Similarity Score Estimation and Gaps Trimming of Multiple Sequence Alignment for Phylogenetic Tree Analysis

Kasikrit Damkliang; Pichaya Tandayya; Unisa Sangket; Ekawat Pasomsub

doi:10.37936/ecti-cit.2017112.74783

PDF

Published: Nov 28, 2017

DOI: https://doi.org/10.37936/ecti-cit.2017112.74783

Keywords:

Phylogenetic Tree Multiple Sequence Alignment (MSA) Similarity Score Gaps Trimming Tree Inferring Sequence Analysis Coding Sequences (CDS) Web Service Workflow

Kasikrit Damkliang

Department of Computer Engineering, Faculty of Engineering, Prince of Songkla University, Thailand

Pichaya Tandayya

Department of Computer Engineering, Faculty of Engineering, Prince of Songkla University, Thailand

Unisa Sangket

Center for Genomics and Bioinformatics Research, Faculty of Science, Prince of Songkla University, Thailand

Ekawat Pasomsub

Department of Pathology, Faculty of Medicine, Mahidol University, Thailand

Abstract

Phylogenetic tree analysis is a process for finding the highest possible revolution tree history of an interested organism. The important step of the process is multiple sequences alignment (MSA) which is operated using any MSA tool that produces a result in blocks of the Phylip format. Bioinformaticians have to manually determine and trim gaps of the MSA blocks using relevant tools of a software package in the off-line mode. The data blocks need to be manually cut-and-pasted between these tools. This working steps tend to be error-prone and time consuming. In addition, improper algorithm selection for tree inferring without applying an MSA similarity score tends to generate the phylogenetic tree with low accuracy and also take much more time. In this work, we present a new practical approach for the phylogenetic tree analysis applying our enhancement for the similarity score estimation and gaps trimming of the MSA blocks. We propose \textit{in-silico} algorithms for automating the concerned similarity score estimation and gaps trimming, and deploy them as web services. We demonstrate the web services utilized by composing them into an integrated stateful WSDL workflow. Our case study datasets are a complete coding sequences (CDS) and sets of complete genome of Dengue Viruses - 2, fetched from the NCBI RefSeq nucleotide database. Our proposed algorithms have correctly returned results, verified and satisfied by our bioinformaticians. Our distributions, user manuals and endpoints of the web services, and the open source programs are available at https://bioservices.sci.psu.ac.th.

How to Cite

[1]

K. Damkliang, P. Tandayya, U. Sangket, and E. Pasomsub, “Similarity Score Estimation and Gaps Trimming of Multiple Sequence Alignment for Phylogenetic Tree Analysis”, ECTI-CIT Transactions, vol. 11, no. 2, pp. 129–142, Nov. 2017.

Issue

Vol. 11 No. 2 (2017): ECTI Transaction on CIT (Nov 2017)

Section

Artificial Intelligence and Machine Learning (AI)

Author Biographies

Kasikrit Damkliang, Department of Computer Engineering, Faculty of Engineering, Prince of Songkla University, Thailand

Kasikrit Damkliang received a BS degree in Computer Science in 2005 and an MEng degree in Computer Engineering in 2009 from Prince of Songkla University (PSU), Thailand. Currently, he is a lecturer in the Information and Communication Technology Programme (ICT), Faculty of Science, PSU and also a PhD student at the Department of Computer Engineering, Faculty of Engineering, PSU. His research interests include HPC, Web Service, Cloud Computing, Workflow Technology, and Bioinformatics.

Pichaya Tandayya, Department of Computer Engineering, Faculty of Engineering, Prince of Songkla University, Thailand

Pichaya Tandayya graduated in Electrical Engineering (Communications) from Prince of Songkla University (PSU) in Thailand in 1990. She obtained her Ph.D. in Computer Science in 2001 from the University of Manchester in the area of Distributed Interactive Simulation. Currently, she is an Assistant Professor working at the Department of Computer Engineering, PSU. Her current research works concern Parallel and Distributed Computing and Systems, and Assistive Technology.

Unisa Sangket, Center for Genomics and Bioinformatics Research, Faculty of Science, Prince of Songkla University, Thailand

Unisa Sangket received the B.Sc., M.Sc. (Computer Science), and Ph.D. (Molecular Biology and Bioinformatics) degrees from Prince of Songkla University, Thailand, in 2002, 2006, and 2011, respectively. She is currently a lecturer at the Department of Molecular Biotechnology and Bioinformatics, Faculty of Science, Prince of Songkla University. Her main areas of research interest are variant, genome, and transcriptome analysis using bioinformatics tools.

Ekawat Pasomsub, Department of Pathology, Faculty of Medicine, Mahidol University, Thailand

Ekawat Pasomsub graduated in Medical Technology in 2001 and obtained his Ph.D. in Clinical Pathology in 2010 from Mahidol University (MU), Thailand. Currently, he is a lecturer in Department of Pathology, Faculty of Medicine, Ramathibodi Hospital, MU. His current research works concern laboratory diagnosis for viruses, HIV drug resistance, genetic association study, and applications on next generation sequencing technology.

References

R. Page and E. Holmes, Molecular evolution: a phylogenetic approach. New Jersey, USA: WileyBlackwell, 1998.

S. Guindon and O. Gascuel, “A simple, fast, and accurate algorithm to estimate large phylo- genies by maximum likelihood," Systematic Biology, vol. 52, no. 5, pp. 696-704, 2003.

M. Wu and J. A. Eisen, “A simple, fast, and accurate method of phylogenomic inference,"Genome Biology, vol. 9, no. 10, p. R151, 2008.

M. Binet, O. Gascuel, C. Scornavacca, E. J.P. Douzery, and F. Pardi, “Fast and accurate branch lengths estimation for phylogenomic trees," BMC Bioinformatics, vol. 17, no. 1, p.23, 2016. [Online]. Available: https://dx.doi.org/10.1186/s12859-015-0821-8

J. Burleigh et al., “Genome-scale phylogenetics:inferring the plant tree of life from 18,896

gene trees," Syst Biol., vol. 60, no. 2, pp. 117-125, Mar. 2011.

S. Guindon, J. -F. Dufayard, V. Lefort, M. Anisimova, W. Hordijk, and O. Gascuel, “New algorithms and methods to estimate maximum- likelihood phylogenies: Assessing the performance of PhyML 3.0," Systematic Biology, vol. 59, no.3, pp. 307-321, 2010.

J. Felsenstein, “PHYLIP - Phylogeny inference package (version 3.2)," Cladistics, vol. 5, pp.164-

, 1989.

P. Rice, I. Longden, and A. Bleasby, “EMBOSS:The European Molecular Biology Open Software Suite (2000)," Trends in Genetics, vol. 16, no. 6, pp. 276-277, 2000.

K. Tamura, G. Stecher, D. Peterson, A. Filipski, and S. Kumar, “MEGA6: Molecular Evolutionary Genetics Analysis version 6.0," Molecular Biology and Evolution, vol. 30, no. 12, pp.2725-2729, Oct. 2013.

W. Li, A. Cowley, M. Uludag, T. Gur, H. McWilliam, S. Squizzato, Y. M. Park, N. Buso, and R. Lopez, “The EMBL-EBI bioinformatics web and programmatic tools frame-work," Nucleic Acids Research, vol. 43, pp.W580-W584, Apr. 2015.

EMBL-EBI, The European Bioinformatics Institute, Part of the European Molecular Biology

Laboratory," https://www.ebi.ac.uk/, 2017, [Online; accessed 25-July-2017].

M. Pagni, J. Hau, and H. Stockinger, “A Multi-protocol Bioinformatics Web Service: Use SOAP, Take a REST or Go with HTML," in Proc. IEEE International Symposium on Cluster Computing and the Grid, Lyon, France, pp. 728-734, May 2008.

L. J. Revell and S. A. Chamberlain, “Rphylip:an R interface for PHYLIP," Methods in Ecology and Evolution, vol. 5, pp. 976-981, 2014.

A. L. Bazinet, D. J. Zwickl, and M. P. Cummings, “A Gateway for Phylogenetic Analysis

Powered by Grid Computing Featuring GARLI 2.0," Syst Biol, vol. 63, no. 5, pp. syu031v1-syu031, Apr. 2014.

R. Snchez, F. Serra, J. Trraga, I. Medina, J. Carbonell, L. Pulido, A. de Mara, S. Capella Guterrez, J. Huerta-Cepas, T. Gabaldn, D. J., and H. Dopazo, “Phylemon 2.0: a suite of web-tools for molecular evolution, phylogenetics, phylogenomics and hypotheses testing," Nucleic Acids Research, vol. 10, no. 1093, pp. 1-5, Jun. 2011.

F. Sievers, A. Wilm, D. Dineen, T. J. Gibson, K. Karplus, W. Li, R. Lopez, H. McWilliam, M.

Remmert, J. Soding, J. D. Thompson, and D. G. Higginsa, Fast, scalable generation of high quality protein multiple sequence alignments using clustal omega," Molecular Systems Biology, vol. 7, no. 539, pp. 1-6, Oct. 2011.

L. Kannan and W. Wheeler, “Maximum parsimony on phylogenetic networks," Algorithms for

Molecular Biology, vol. 7, no. 9, pp. 1-10, May 2012.

N. Saitou and M. Nei, “The neighbor-joining method: a new method for reconstructing phylogenetic trees," Molecular Biology and Evolution, vol. 4, no. 4, pp. 406-425, 1987.

J. Felsenstein, “Evolutionary trees from dna sequences: a maximum likelihood approach," J

Mol Evol, vol. 17, pp. 368-376, 1981.

S. Capella-Gutierrez, J. Silla-Martinez, and T.Gabaldon, “trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses," Bioinformatics, vol. 25, no. 15, pp.1972-1973, Aug. 2009.

K. Wolstencroft, R. Haines, D. Fellows, A. Williams, D. Withers, S. Owen, S. Soiland Reyes, I. Dunlop, A. Nenadic, P. Fisher, J. Bhagat, K. Belhajjame, F. Bacall, A. Hardisty, A. Nieva de la Hidalga, M. P. Balcazar Vargas, S. Su_, and C. Goble, “The Taverna workow suite: designing and executing workows of WebServices on the desktop, web or in the cloud," Nucleic Acids Research, vol. 41, no. Web Server issue, pp. W557-W561, May 2013.

W. Tan, K. Chard, D. Sulakhe, R. Madduri, I. Foster, S. Soiland, and C. Goble, “Scientific workflows as services in caGrid: a Taverna and gRAVI approach," in Proc. IEEE International Conference on Web Services, Los Angeles, CA, pp. 413-420, Sep. 2009.

T. Tatusova, S. Ciufo, B. Fedorov, K. O'Neill, and I. Tolstoy, “RefSeq microbial genomes database: new representation and annotation strategy," Nucleic Acids Research, vol. 42, no.1, pp. D553-D559, Jan. 2014.

C. Mathew, A. Guntsch, M. Obst, S. Vicario, R. Haines, A. Williams, Y. de Jong, and C.Goble, “A semi-automated workflow for biodiversity data retrieval, cleaning, and quality control," Biodiversity Data Journal, vol. 2, p.e4221,Dec. 2014.

J.E. Ruiz, J. Garrido, J.D. Santander-Vela, S. Sanchez-Exposito and L. Verdes-Montenegro, “AstroTavernaBuilding workflows with Virtual Observatory services," in Astronomy and Computing, Volumes 78, Pages 3-11, 2014, special Issue on The Virtual Observatory: I.

I. Altintas, J.Wang, D. Crawl, and W. Li, “Challenges and approaches for distributed workflow driven analysis of large-scale biological data," in Proc. Workshop on Data analytics in the Cloud at EDBT/ICDT 2012 Conference, Berlin, Germany, pp. 73-78, Mar. 2012.

Y. Zhao, Y. Li, I. Raicu, S. Lu, W. Tian, and H. Liu, “Enabling scalable scientific workflow management in the Cloud," Future Generation Computer Systems, vol. 46, no. Issue C, pp. 3-16, May

Y. Zhao, Y. Li, I. Raicu, S. Lu, C. Lin, Y. Zhang, W. Tian, and R. Xue, “A service framework for scientific workflow management in the Cloud," IEEE Transactions on Services Computing, vol. PP, no. 99, pp. 1-14, Aug. 2014.

Y. Zhao, Y. Li, I. Raicu, C. Lin, W. Tian, and R. Xue, “Migrating Scienti_c Workow Management

Systems from the Grid to the Cloud," Cloud Computing for Data Intensive Applications, pp.

-256, Nov. 2014.

J. Thompson, D. Higgins, and T. Gibson, “CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice," Nucleic Acids Research, vol. 22, no. 22, pp. 4673-4680, Nov. 1994.

T. Lassmann, O. Frings, and E. L. L. Sonnhammer, “Kalign2: high-performance multiple alignment of protein and nucleotide sequences allowing external features," Nucleic Acids Research,

vol. 37, no. 3, pp. 858-865, Feb. 2009.

K. Katoh and H. Toh, “Recent developments in the MAFFT multiple sequence alignment program," Briefings in Bioinformatics, vol. 9, no.4, pp. 286-298, Mar. 2008.

R. C. Edgar, “MUSCLE: multiple sequence alignment with high accuracy and high throughput," Nucleic Acids Research, vol. 32, no. 5, pp.1792-1797, Mar. 2004.

B. P. Blackburne and S. Whelan, “Measuring the distance between multiple sequence alignments," BIOINFORMATICS, vol. 28, no. 4, pp. 495-502, Dec. 2012.

J. Felsenstein, PHYLIP (Phylogeny Inference Package) version 3.6. Seattle: Distributed by the author, Department of Genome Sciences, University of Washington, 2005.

S. Kumar, G. Stecher, and K. Tamura, “MEGA7: Molecular Evolutionary Genetics Analysis Version 7.0 for Bigger Datasets," Molecular Biology and Evolution, vol. 33, no. 7, pp. 1870-1874, Mar. 2016.

S. Perera, C. Herath, J. Ekanayake, E. Chinthaka, A. Ranabahu, D. Jayasinghe, S. Weerawarana, and G. Daniels2, “Axis2, middleware for next generation web services," in Proc. IEEE International Conference on Web Services (ICWS'06), Chicago, USA, pp. 833-840, Sep. 2006.

K. Damkliang, Workow of MSA Similarity Score Estimation," https://www.myexperiment.org/workflows/4803.html, 2017, [Online; accessed 25-July-2017].

K. Damkliang, “Workflow of MSA Gaps Trimming, "https://www.myexperiment.org/workflows/4804.html, 2017, [Online; accessed 25-July-2017].

K. Damkliang, “Workflow of MSA, Similarity Score Estimation, and Gaps Trimming," https://www.myexperiment.org/workflows/4805.html, 2017, [Online; accessed 25-July-2017].

Article Sidebar

Main Article Content