Genomic Pathway Visualizer
Research Goal
- To develop text mining and data mining techniques to support automated extraction and inference of regulatory pathways from biomedical literature and experimental data.
Technological developments in genomic and proteomic research have led to an explosion of data available for biomedical research. The sheer quantity of data generated by high throughput technologies such as DNA microarray has exceeded the capacity of traditional data analysis techniques to extract useful information. Meanwhile, rapid accumulation of research publications makes it difficult to keep abreast of new developments in the area.
The research goal of Arizona BioPathway is to develop novel machine learning and Natural Language Processing (NLP) techniques to support efficient and effective data and text analysis in biomedical fields, particularly, the analysis of genetic regulatory pathways which are crucial for biological processes such as gene regulation and cancer development. Arizona BioPathway is also aimed at the creation of a framework for pathway-related knowledge integration and visualization using a combination of various approaches. The ultimate goal of Arizona BioPathway is to provide biomedical researchers with a platform of pathway-related literature abstraction, data analysis and knowledge integration, thus to support the development of scientific hypotheses and discovery of new knowledge.
Funding
Funding for this research was received from the following sources:
1 R33 LM07299-01 | 05/01/2002 - 04/30/2005 |
National Institutes of Health/National Library of Medicine | $1,320,000 |
GeneScene: A toolkit for gene pathway analysis | |
1R01 LM06919-01A1 | 2/15/2001 - 2/14/2004 |
National Institutes of Health/National Library of Medicine | $500,000 |
UMLS Enhanced Dynamic Agents to Manage Medical Knowledge | |
IIS-9817473 | 5/1/99 - 4/31/2002 |
National Science Foundation | $500,000 |
DLI –Phase 2: High Performance Digital Library Classification Systems: From Information Retrieval to Knowledge Management |
Acknowledgements
- Arizona Cancer Center researchers, staff, and students for providing genomic data and helping with user evaluation of our applications.
- School of Plant Sciences, University of Arizona for providing domain expertise in evaluation of our applications.
- Arizona Health Sciences Library for their support and assistance.
- National Library of Medicine for providing Unified Medical Language System (UMLS).
Approach & Methodology
Current focuses of the Arizona BioPathway research include automatic extraction of regulatory pathway relations from biomedical literature using NLP techniques, inference of genetic networks from genomic data using data mining approaches, and the integration of existing knowledge and text/data mining results of regulatory pathways using a variety of biomedical ontologies.
The text mining component of Arizona BioPathway is designed to extract genetic regulatory pathway relations from biomedical literature. We have experimented with two different approaches of natural language processing (NLP) to extract the pathway relations, shallow parsing and full parsing. The shallow parser uses templates based on closed-class words (e.g., prepositions) and model generic relations to capture relations between noun phrases, while the full parser uses a broad coverage syntactic-semantic hybrid grammar to identify grammatical verb relations. To increase the precision, both approaches use relevant biomedical lexicons such as Gene Ontology (GO), HUGO Gene Nomenclature, and the Specialist Lexicon of UMLS to filter the extracted relations. We are also studying various statistical learning techniques for biomedical entity recognition and relation extraction from biomedical text.
The data mining component is designed to extract gene regulatory relations from genomic and proteomic data including DNA microarray by machine learning techniques such as Bayesian networks. We are experimenting various techniques to learn regulatory networks from microarray data, either with existing prior knowledge or in combination with other types of biological experimental data, e.g., DNA methylation array or protein expression. The so-called joint learning approach is promising to learn the network more accurately, avoiding bias and incompleteness inherited by a particular type of data. Linkages extracted from heterogeneous genomic data sources provide different evidence about gene functional relations. In a recent study, we develop a Bayesian framework for integrating relations extracted from multiple sources, such as gene expression, biomedical literature, and genomic sequence information, into a genome-wide functional network. In addition, we conduct studies on cancer classification using gene array data. We are adopting and developing various feature selection techniques to identify marker genes and their interactions for cancer diagnosis and drug discovery.
The knowledge integration component leverages a variety of biomedical ontology and knowledge sources to form an integrated framework for pathway-related knowledge organization. We have developed a feature decomposition approach to the aggregation of extracted pathway relations and resolution of the redundancy, ambiguity and inconsistency among them, using existing lexicons and ontologies such Entrez Gene, RefSeq, Homologene, MeSH, UMLS and GO. Pathway relations extracted from text and learned from data, as well as known relations from existing knowledge sources will eventually be integrated into a consolidated knowledge base.
All these pathway relations can be combined to construct regulatory networks and be visualized by automatic graph drawing algorithms implemented in the Arizona BioPathway Visualizer (see the demo).
Testbed
- Text mining (PubMed, 2003)
- P53 - Text Collection:
Content: All abstract with p53 or related genes in title or abstract
Abstracts: 20,360
Linguistic Parser Relations: 194,384
Co-occurrence Relations: 2,724,099 - AP1-Text Collection:
Content: All abstract with ap1 or related genes in title or abstract
Abstracts: 23,339
Linguistic Parser Relations: 258,142
Co-occurrence Relations: 3,265,524 - Yeast - Text Collection:
Content: All abstract with yeast in title or abstract
Abstracts: 66,197
Linguistic Parser Relations: 584,502
Co-occurrence Relations: 6,535,737 - Arabidopsis -Text Collection:
Content: All abstracts with MeSH terms of ‘Arabidopsis’ or ‘Arabidopsis Proteins’
Abstracts: 10,548
Linguistic Parser Relations: 222
Co-occurrence Relations: 1,291
- P53 - Text Collection:
- Data Mining
- P53 – Microarray Data:
Content: Gene expression measurement of p53 mutant cell lines (provided by AZCC)
Gene expression measurements: 33
Genes (Homo sapiens ORFs): 5,306
Genes with greatest variations: 200 - Yeast – Microarray Data:
Content: Microarray data of yeast cell cycle (Spellman et al. 1998)
Gene expression measurements: 77
Time series: 6
Genes (S. cerevisiae ORFs): 6,177
Genes whose expression varied over the different cell-cycle stages: 800 - Arabidopsis – Microarray data:
Content: two high-quality microarray series of Arabidopsis at http://www.weigelworld.org
Gene expression measurements: 237 for development and 298 for abiotic stress
Genes (Arabidopsis): 22,810 - Arabidopsis – Genome sequence relations:
Content: gene relations extracted from genome sequence using four different methods in ProLink (http://dip.doe-mbi.ucla.edu/pronav)
Relations:
Phylogenetic profiling (PP): 132,637
Rosetta Stone (RS): 989,795
Gene neighbor (GN): 18,823
Gene cluster (GC): 11,586 - MDS – Microarray data
Content: DNA methylation arrays from Arizona Cancer Center. It is derived from the epigenomic analysis of bone marrow specimens from healthy donors and individuals with myelodysplastic syndrome (MDS).
Measurements: 55 (10 normal and 45 tumor samples)
Genes: 678 - Ovarian Cancer – Microarray data
Content: microarray-based measurements of DNA methylation from the Gynecologic Oncology tumor bank at the University of Iowa and made available through the Arizona Cancer Center.
Measurements: 114 (25 normal and 89 tumor samples)
Genes: 6,560
- P53 – Microarray Data:
Techniques
- A shallow parser based on closed class English words extracting noun phrase relations
- A full parser using syntax-semantic hybrid grammar extracting verb relations
- Co-occurrence analysis based on Concept Space, which generates asymmetric relations between phrases ordered according to the strength of their relation
- Conditional Random Field (CRF) methods for entity recognition
- Kernel-based learning methods for relation extraction and classification
- Feature decomposition for entity and relation aggregation
- Bayesian Network frameworks for integrating gene functional relations from multiple data sources
- Optimal search based feature subset selection methods for identifying marker genes for cancer classification
Team Members
Dr. Hsinchun Chen | hchen@eller.arizona.edu |
Dr. Zhu Zhang | |
Dr. Jesse Martinez | |
Cathy Larson | |
Jiexun Li | |
Hua Su | |
Chun-Ju Tseng | |
Siddharth Kaza | |
Xin Li | |
Nichalin Suakkaphong | |
Yulei Zhang (Gavin) | |
Shailesh Joshi |
Publications
Text Mining Publications and Presentations
- N. Suakkaphong, Z. Zhang, and H. Chen, “Disease Named Entity Recognition using Semi-supervised Learning and Conditional Random Fields,” Journal of the American Society for Information Science and Technology, Volume 62, Number 4, Pages 727-737, 2011.
- K. D. Quiñones, H. Su, B. Marshall, S. Eggers, and H. Chen. “User-centered evaluation of Arizona BioPathway: an information extraction, integration, and visualization system.” IEEE Transactions on Information Technology in Biomedicine, 11(5): 527-536, 2007.
- B. Marshall, H. Su, D. McDonald, S. Eggers, and H. Chen. "Aggregating Automatically Extracted Regulatory Pathway Relations." IEEE Transactions on Information Technology in Biomedicine, 10:100-108, 2006.
- B. Marshall, H. Su, D. McDonald, and H. Chen. “Linking ontological resources using aggregatable substance identifiers to organize extracted relations.” In Proceedings of Pacific Symposium on Biocomputing, pp. 162-173, 2005.
- G. Leroy, H. Chen. "GeneScene: An Ontology-Enhanced Integration of Linguistic and Co-Occurrence Based Relations in Biomedical Texts," Journal of The American Society for Information Science and Technology (JASIST), 56: 457-468, 2005.
- D. McDonald, H. Chen, H. Su, and B. Marshall. "Extracting Gene Pathway Relations Using a Hybrid Grammar: The Arizona Relation Parser," Bioinformatics 20:3370-3378, 2004.
- D.M. McDonald, H. Chen, G. Leroy, and H. Su. "Combining Ontologies and Grammatical Relations to Yield Diverse Semantic Relations from Biomedical Texts,”Poster presentation at Pacific Symposium on Biocomputing, January 2004.
- G. Leroy, H. Chen, and J.D. Martinez. “A Shallow Parser Based on Closed-class Words to Capture Relations in Biomedical Text.” Journal of Biomedical Informatics (JBI)36:145-158, 2003.
- G. Leroy, H. Chen, J.Martinez, S. Eggers, R. Falsey, K. Kislin, Z. Huang, J. Li, J. Xu, D. McDonald, and G. Ng. "GeneScene: Biomedical Text and Data Mining" Presented at the Third ACM and IEEE Joint Conference on Digital Libraries (JCDL-) May 27-31, 2003, Houston, Texas, 2003.
- G. Leroy and H. Chen. "Filling preposition-based templates to capture information for medical abstracts." In Proceedings of Pacific Symposium on Biocomputing, pp. 350-361, 2002.
Data Mining Publications and Presentations
- J. Li, H. Su, H. Chen, and B. W. Futscher “Optimal search-based gene subset selection from gene array data for cancer classification.” IEEE Transactions on Information Technology in Biomedicine, accepted, 2006.
- Z. Huang, J. Li, H. Su, G. S. Watts, H. Chen "Large-scale regulatory network analysis from microarray data: modified Bayesian Network learning and association rule mining." Decision Support Systems: Special Issue on Decision Support in Medicine, forthcoming, 2006.
- J. Li, X. Li, H. Su, H. Chen, and D. W. Galbraith, "A framework of integrating gene relations from heterogeneous data sources: an experiment on Arabidopsis thaliana."Bioinformatics, 22:2037-2043, 2006.
- Z. Huang, H. Su, H. Chen “Joint learning using multiple types of data and knowledge,” in H. Chen, S. Fuller, C. Friedman, and W. Hersh (Eds.), Medical Informatics: Knowledge Management and Data Mining in Biomedicine, Springer, p.593-624. 2005.
- Z. Huang, H. Chen, H. Su, B. Marshall, B. L. Smith, G. W. Watts, J. D. Martinez. “Learning Genetic Pathways Using Bayesian Networks and Qualitative Probabilistic Networks,” Poster presentation at Pacific Symposium on Biocomputing, January 2005.
- Z. Huang, H. Chen, H. Su, B. Marshall, B. L. Smith, G. W. Watts, J. D. Martinez. “Learning Genetic Pathways Using Bayesian Networks and Qualitative Probabilistic Networks,” Poster presentation at Pacific Symposium on Biocomputing, January 2004.