OrthoMCL: identification of ortholog groups for eukaryotic genomes
- PMID: 12952885
- PMCID: PMC403725
- DOI: 10.1101/gr.1224503
OrthoMCL: identification of ortholog groups for eukaryotic genomes
Abstract
The identification of orthologous groups is useful for genome annotation, studies on gene/protein evolution, comparative genomics, and the identification of taxonomically restricted sequences. Methods successfully exploited for prokaryotic genome analysis have proved difficult to apply to eukaryotes, however, as larger genomes may contain multiple paralogous genes, and sequence information is often incomplete. OrthoMCL provides a scalable method for constructing orthologous groups across multiple eukaryotic taxa, using a Markov Cluster algorithm to group (putative) orthologs and paralogs. This method performs similarly to the INPARANOID algorithm when applied to two genomes, but can be extended to cluster orthologs from multiple species. OrthoMCL clusters are coherent with groups identified by EGO, but improved recognition of "recent" paralogs permits overlapping EGO groups representing the same gene to be merged. Comparison with previously assigned EC annotations suggests a high degree of reliability, implying utility for automated eukaryotic genome annotation. OrthoMCL has been applied to the proteome data set from seven publicly available genomes (human, fly, worm, yeast, Arabidopsis, the malaria parasite Plasmodium falciparum, and Escherichia coli). A Web interface allows queries based on individual genes or user-defined phylogenetic patterns (http://www.cbil.upenn.edu/gene-family). Analysis of clusters incorporating P. falciparum genes identifies numerous enzymes that were incompletely annotated in first-pass annotation of the parasite genome.
Figures
Similar articles
-
Automatic clustering of orthologs and in-paralogs from pairwise species comparisons.J Mol Biol. 2001 Dec 14;314(5):1041-52. doi: 10.1006/jmbi.2000.5197. J Mol Biol. 2001. PMID: 11743721
-
GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes.BMC Bioinformatics. 2004 Nov 18;5:178. doi: 10.1186/1471-2105-5-178. BMC Bioinformatics. 2004. PMID: 15550167 Free PMC article.
-
Genome-scale compositional comparisons in eukaryotes.Genome Res. 2001 Apr;11(4):540-6. doi: 10.1101/gr.163101. Genome Res. 2001. PMID: 11282969 Free PMC article.
-
Bioinformatic strategies to provide functional clues to the unknown genes in Plasmodium falciparum genome.Parasite. 2010 Dec;17(4):273-83. doi: 10.1051/parasite/2010174273. Parasite. 2010. PMID: 21275233 Review.
-
Computational methods for gene annotation: the Arabidopsis genome.Curr Opin Biotechnol. 2001 Apr;12(2):126-30. doi: 10.1016/s0958-1669(00)00185-3. Curr Opin Biotechnol. 2001. PMID: 11287224 Review.
Cited by
-
Inferring Interaction Networks from Transcriptomic Data: Methods and Applications.Methods Mol Biol. 2024;2812:11-37. doi: 10.1007/978-1-0716-3886-6_2. Methods Mol Biol. 2024. PMID: 39068355
-
Diversity and potential functional role of phyllosphere-associated actinomycetota isolated from cupuassu (Theobroma grandiflorum) leaves: implications for ecosystem dynamics and plant defense strategies.Mol Genet Genomics. 2024 Jul 27;299(1):73. doi: 10.1007/s00438-024-02162-1. Mol Genet Genomics. 2024. PMID: 39066857
-
SonicParanoid2: fast, accurate, and comprehensive orthology inference with machine learning and language models.Genome Biol. 2024 Jul 25;25(1):195. doi: 10.1186/s13059-024-03298-4. Genome Biol. 2024. PMID: 39054525 Free PMC article.
-
Phylogenomic analyses and comparative genomic studies of Thermus strains isolated from Tengchong and Tibet Hot Springs, China.Antonie Van Leeuwenhoek. 2024 Jul 23;117(1):103. doi: 10.1007/s10482-024-02001-8. Antonie Van Leeuwenhoek. 2024. PMID: 39042225
-
PhyloAln: A Convenient Reference-Based Tool to Align Sequences and High-Throughput Reads for Phylogeny and Evolution in the Omic Era.Mol Biol Evol. 2024 Jul 3;41(7):msae150. doi: 10.1093/molbev/msae150. Mol Biol Evol. 2024. PMID: 39041199 Free PMC article.
References
-
- Abascal, F. and Valencia, A. 2002. Clustering of proximal sequence space for the identification of protein families. Bioinformatics 18: 908–921. - PubMed
-
- Carlton, J.M., Angiuoli, S.V., Suh, B.B., Kooij, T.W., Pertea, M., Silva, J.C., Ermolaeva, M.D., Allen, J.E., Selengut, J.D., Koo, H.L., et al. 2002. Genome sequence and comparative analysis of the model rodent malaria parasite Plasmodium yoelii yoelii. Nature 419: 512–519. - PubMed
WEB SITE REFERENCES
-
- http://www.cbil.upenn.edu/gene-family; Putative ortholog groups generated by OrthoMCL, University of Pennsylvania.
-
- http://www.ncbi.nlm.nih.gov/COG/; The Clusters of Orthologous Groups (COG) database, NCBI.
-
- http://www.allgenes.org; The human and mouse gene index, University of Pennsylvania.
-
- http://www.tigr.org/tdb/tgi/; TIGR Gene Indices.
-
- http://www.tigr.org/tdb/tgi/ego/index.shtml; Eukaryotic Gene Orthologs (EGO), TIGR.
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources
Molecular Biology Databases