diff --git a/teachopencadd/talktorials/.DS_Store b/teachopencadd/talktorials/.DS_Store new file mode 100644 index 0000000..31cde5f Binary files /dev/null and b/teachopencadd/talktorials/.DS_Store differ diff --git a/teachopencadd/talktorials/T005_compound_clustering/README.md "b/teachopencadd/talktorials/T005_\345\214\226\345\220\210\347\211\251_\350\201\232\347\261\273/README.md" similarity index 100% rename from teachopencadd/talktorials/T005_compound_clustering/README.md rename to "teachopencadd/talktorials/T005_\345\214\226\345\220\210\347\211\251_\350\201\232\347\261\273/README.md" diff --git a/teachopencadd/talktorials/T005_compound_clustering/data/README.md "b/teachopencadd/talktorials/T005_\345\214\226\345\220\210\347\211\251_\350\201\232\347\261\273/data/README.md" similarity index 100% rename from teachopencadd/talktorials/T005_compound_clustering/data/README.md rename to "teachopencadd/talktorials/T005_\345\214\226\345\220\210\347\211\251_\350\201\232\347\261\273/data/README.md" diff --git a/teachopencadd/talktorials/T005_compound_clustering/data/cluster_dist_cutoff_0.20.png "b/teachopencadd/talktorials/T005_\345\214\226\345\220\210\347\211\251_\350\201\232\347\261\273/data/cluster_dist_cutoff_0.20.png" similarity index 100% rename from teachopencadd/talktorials/T005_compound_clustering/data/cluster_dist_cutoff_0.20.png rename to "teachopencadd/talktorials/T005_\345\214\226\345\220\210\347\211\251_\350\201\232\347\261\273/data/cluster_dist_cutoff_0.20.png" diff --git a/teachopencadd/talktorials/T005_compound_clustering/data/cluster_representatives.svg "b/teachopencadd/talktorials/T005_\345\214\226\345\220\210\347\211\251_\350\201\232\347\261\273/data/cluster_representatives.svg" similarity index 100% rename from teachopencadd/talktorials/T005_compound_clustering/data/cluster_representatives.svg rename to "teachopencadd/talktorials/T005_\345\214\226\345\220\210\347\211\251_\350\201\232\347\261\273/data/cluster_representatives.svg" diff --git a/teachopencadd/talktorials/T005_compound_clustering/data/molecule_set_largest_cluster.sdf "b/teachopencadd/talktorials/T005_\345\214\226\345\220\210\347\211\251_\350\201\232\347\261\273/data/molecule_set_largest_cluster.sdf" similarity index 100% rename from teachopencadd/talktorials/T005_compound_clustering/data/molecule_set_largest_cluster.sdf rename to "teachopencadd/talktorials/T005_\345\214\226\345\220\210\347\211\251_\350\201\232\347\261\273/data/molecule_set_largest_cluster.sdf" diff --git a/teachopencadd/talktorials/T005_compound_clustering/images/README.md "b/teachopencadd/talktorials/T005_\345\214\226\345\220\210\347\211\251_\350\201\232\347\261\273/images/README.md" similarity index 100% rename from teachopencadd/talktorials/T005_compound_clustering/images/README.md rename to "teachopencadd/talktorials/T005_\345\214\226\345\220\210\347\211\251_\350\201\232\347\261\273/images/README.md" diff --git a/teachopencadd/talktorials/T005_compound_clustering/images/butina_full.pdf "b/teachopencadd/talktorials/T005_\345\214\226\345\220\210\347\211\251_\350\201\232\347\261\273/images/butina_full.pdf" similarity index 100% rename from teachopencadd/talktorials/T005_compound_clustering/images/butina_full.pdf rename to "teachopencadd/talktorials/T005_\345\214\226\345\220\210\347\211\251_\350\201\232\347\261\273/images/butina_full.pdf" diff --git a/teachopencadd/talktorials/T005_compound_clustering/talktorial.ipynb "b/teachopencadd/talktorials/T005_\345\214\226\345\220\210\347\211\251_\350\201\232\347\261\273/talktorial.ipynb" similarity index 99% rename from teachopencadd/talktorials/T005_compound_clustering/talktorial.ipynb rename to "teachopencadd/talktorials/T005_\345\214\226\345\220\210\347\211\251_\350\201\232\347\261\273/talktorial.ipynb" index cb53192..896b079 100644 --- a/teachopencadd/talktorials/T005_compound_clustering/talktorial.ipynb +++ "b/teachopencadd/talktorials/T005_\345\214\226\345\220\210\347\211\251_\350\201\232\347\261\273/talktorial.ipynb" @@ -4,9 +4,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# T005 · Compound clustering\n", + "# T005 · 化合物聚类\n", "\n", - "**Note:** This talktorial is a part of TeachOpenCADD, a platform that aims to teach domain-specific skills and to provide pipeline templates as starting points for research projects.\n", "\n", "Authors:\n", "\n", @@ -19,50 +18,43 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "__Talktorial T005__: This talktorial is part of the TeachOpenCADD pipeline described in the [first TeachOpenCADD paper](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-019-0351-x), comprising of talktorials T001-T010." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Aim of this talktorial\n", + "## 课程目标\n", "\n", "\n", "\n", - "Similar compounds might bind to the same targets and show similar effects. \n", - "Based on this similar property principle, compound similarity can be used to build chemical groups via clustering. \n", - "From such a clustering, a diverse set of compounds can also be selected from a larger set of screening compounds for further experimental testing." + "相似的化合物可能会结合到相同的靶标上并显示出类似的效果。\n", + "基于这种相似性质的原理,可以通过聚类来构建化合物的化学组。\n", + "通过这样的聚类,还可以从更大的筛选化合物集中选择出多样化的化合物组进行进一步的实验测试。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Contents in _Theory_\n", + "### _理论部分_ 目录\n", "\n", - "* Introduction to clustering and Jarvis-Patrick algorithm\n", - "* Detailed explanation of Butina clustering\n", - "* Picking diverse compounds" + "* 聚类介绍和Jarvis-Patrick算法\n", + "* Butina聚类的详细解释\n", + "* 选择多样化的化合物" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Contents in _Practical_\n", + "### _实战部分_ 目录\n", "\n", - "* Clustering with the Butina algorithm\n", - "* Visualizing the clusters\n", - "* Picking the final list of compounds\n", - "* Bonus: analysis of run times" + "* 使用Butina算法进行聚类\n", + "* 可视化聚类结果\n", + "* 挑选最终化合物列表\n", + "* 附加内容:运行时间分析" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### References\n", + "### 参考文献\n", "\n", "* Butina, D. Unsupervised Data Base Clustering Based on Daylight’s Fingerprint and Tanimoto Similarity: A Fast and Automated Way To Cluster Small and Large Data Set. _J. Chem. Inf. Comput. Sci._ (1999)\n", "* Leach, Andrew R., Gillet, Valerie J. An Introduction to Chemoinformatics (2003)\n", @@ -75,7 +67,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Theory" + "## 理论" ] }, { @@ -84,33 +76,35 @@ "source": [ "### Introduction to clustering and Jarvis-Patrick algorithm\n", "\n", - "[Clustering](https://en.wikipedia.org/wiki/Cluster_analysis) can be defined as _the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters)_.\n", + "[聚类](https://en.wikipedia.org/wiki/Cluster_analysis) 是 _将一组对象分组的任务,使得同一组内(称为一个簇)的对象在某种意义上比其他组(簇)的对象更相似_ 。\n", + "\n", + "药物研究中的化合物聚类通常基于化合物之间的化学或结构相似性,以发现共享属性的组,并设计出多样化且具有代表性的集合以供进一步分析。\n", "\n", - "Compound clustering in pharmaceutical research is often based on chemical or structural similarity between compounds to find groups that share properties as well as to design a diverse and representative set for further analysis. \n", "\n", - "General procedure: \n", + "一般过程如下:\n", "\n", - "* Methods are based on clustering data by similarity between neighboring points. \n", - "* In cheminformatics, compounds are often encoded as molecular fingerprints and similarity can be described by the Tanimoto similarity (see **Talktorial T004**).\n", + "* 方法基于通过相似性对邻近点进行聚类的数据。\n", + "* 在化学信息学中,化合物通常被编码为分子指纹,相似性可以通过Tanimoto相似度来描述(见**教程T004**)。\n", "\n", - "> Quick reminder:\n", + "\n", + "> 快速提醒:\n", "> \n", - "> * Fingerprints are binary vectors where each bit indicates the presence or absence of a particular substructural fragment within a molecule. \n", - "> * Similarity (or distance) matrix: The similarity between each pair of molecules represented by binary fingerprints is most frequently quantified using the Tanimoto coefficient, which measures the number of common features (bits). \n", - "> * The value of the Tanimoto coefficient ranges from zero (no similarity) to one (high similarity).\n", + "> * 指纹是二进制向量,其中每个位点表示分子内特定子结构片段的存在或缺失。\n", + "> * 相似性(或距离)矩阵:通过二进制指纹表示的分子对之间的相似性最常用Tanimoto系数来量化,该系数测量共同特征(位点)的数量。\n", + "> * Tanimoto系数的值范围从零(无相似性)到一(高相似性)。\n", "\n", - "There are a number of clustering algorithms available, with the [Jarvis-Patrick clustering](http://www.improvedoutcomes.com/docs/WebSiteDocs/Clustering/Jarvis-Patrick_Clustering_Overview.htm) being one of the most widely used algorithms in the pharmaceutical context.\n", + "有多种聚类算法可供选择,其中[Jarvis-Patrick聚类算法]((http://www.improvedoutcomes.com/docs/WebSiteDocs/Clustering/Jarvis-Patrick_Clustering_Overview.htm))是一个在制药领域广泛使用的算法之一。\n", + "Jarvis-Patrick聚类算法由两个参数$K$和$K_{min}$定义:\n", "\n", - "Jarvis-Patrick clustering algorithm is defined by two parameters $K$ and $K_{min}$:\n", + "* 为每个分子计算$K$个最近邻的集合。\n", + "* 如果满足以下条件,两个分子将聚类在一起:\n", + " * 它们在彼此的最近邻列表中\n", + " * 它们至少有 $K_{min}$个$K$个最近邻是共同的。\n", "\n", - "* Calculate the set of $K$ nearest neighbors for each molecule. \n", - "* Two molecules cluster together if \n", - " * they are in each others list of nearest neighbors\n", - " * they have at least $K_{min}$ of their $K$ nearest neighbors in common.\n", "\n", - "The Jarvis-Patrick clustering algorithm is deterministic and able to deal with large sets of molecules in a matter of a few hours. However, a downside lies in the fact that this method tends to produce large heterogeneous clusters (see _Butina clustering_, referenced above). \n", + "Jarvis-Patrick聚类算法是确定性的,并且能够在几个小时内处理大量的分子集。然而,这种方法的一个缺点是它倾向于产生大型的异质性聚类(参见上文提到的 _Butina聚类_)。\n", "\n", - "More clustering algorithms can also be found in the [scikit-learn clustering module](http://scikit-learn.org/stable/modules/clustering.html)." + "更多聚类算法可参考 [scikit-learn clustering module](http://scikit-learn.org/stable/modules/clustering.html)." ] }, { @@ -119,7 +113,10 @@ "source": [ "### Detailed explanation of Butina clustering\n", "\n", - "Butina clustering ([*J. Chem. Inf. Model.* (1999), **39** (4), 747](https://pubs.acs.org/doi/abs/10.1021/ci9803381)) was developed to identify smaller but homogeneous clusters, with the prerequisite that (at least) the cluster centroid will be more similar than a given threshold to every other molecule in the cluster.\n", + "Butina聚类([*J. Chem. Inf. Model.* (1999), **39** (4), 747](https://pubs.acs.org/doi/abs/10.1021/ci9803381))是为了识别更小但同质的聚类而开发的,前提是(至少)聚类中心将比给定阈值更相似于聚类中的每个其他分子。\n", + "\n", + "以下是这种聚类方法的关键步骤(见下文流程图):\n", + "Butina clustering () was developed to identify smaller but homogeneous clusters, with the prerequisite that (at least) the cluster centroid will be more similar than a given threshold to every other molecule in the cluster.\n", "\n", "These are the key steps in this clustering approach (see flowchart below):\n", "\n", @@ -173,7 +170,7 @@ " " ], "text/plain": [ - "" + "" ] }, "execution_count": 1, @@ -1433,7 +1430,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.9.16" + "version": "3.8.15" }, "toc-autonumbering": true, "widgets": {