To develop text mining and data mining techniques to support automated extraction and inference of regulatory pathways from biomedical literature and experimental data.
Technological developments in genomic and proteomic research have led to an explosion of data available for biomedical research. The sheer quantity of data generated by high throughput technologies such as DNA microarray has exceeded the capacity of traditional data analysis techniques to extract useful information. Meanwhile, rapid accumulation of research publications makes it difficult to keep abreast of new developments in the area.
The research goal of Arizona BioPathway is to develop novel machine learning and Natural Language Processing (NLP) techniques to support efficient and effective data and text analysis in biomedical fields, particularly, the analysis of genetic regulatory pathways which are crucial for biological processes such as gene regulation and cancer development. Arizona BioPathway is also aimed at the creation of a framework for pathway-related knowledge integration and visualization using a combination of various approaches. The ultimate goal of Arizona BioPathway is to provide biomedical researchers with a platform of pathway-related literature abstraction, data analysis and knowledge integration, thus to support the development of scientific hypotheses and discovery of new knowledge.
Funding for this research was received from the following sources:
|1 R33 LM07299-01||05/01/2002 - 04/30/2005|
|National Institutes of Health/National Library of Medicine||$1,320,000|
|GeneScene: A toolkit for gene pathway analysis|
|1R01 LM06919-01A1||2/15/2001 - 2/14/2004|
|National Institutes of Health/National Library of Medicine||$500,000|
|UMLS Enhanced Dynamic Agents to Manage Medical Knowledge|
|IIS-9817473||5/1/99 - 4/31/2002|
|National Science Foundation||$500,000|
|DLI –Phase 2: High Performance Digital Library Classification Systems: From Information Retrieval to Knowledge Management|
- Arizona Cancer Center researchers, staff, and students for providing genomic data and helping with user evaluation of our applications.
- Department of Plant Sciences, University of Arizona for providing domain expertise in evaluation of our applications.
- Arizona Health Sciences Library for their support and assistance.
- National Library of Medicine for providing Unified Medical Language System (UMLS).
Current focuses of the Arizona BioPathway research include automatic extraction of regulatory pathway relations from biomedical literature using NLP techniques, inference of genetic networks from genomic data using data mining approaches, and the integration of existing knowledge and text/data mining results of regulatory pathways using a variety of biomedical ontologies.
The text mining component of Arizona BioPathway is designed to extract genetic regulatory pathway relations from biomedical literature. We have experimented with two different approaches of natural language processing (NLP) to extract the pathway relations, shallow parsing and full parsing. The shallow parser uses templates based on closed-class words (e.g., prepositions) and model generic relations to capture relations between noun phrases, while the full parser uses a broad coverage syntactic-semantic hybrid grammar to identify grammatical verb relations. To increase the precision, both approaches use relevant biomedical lexicons such as Gene Ontology (GO), HUGO Gene Nomenclature, and the Specialist Lexicon of UMLS to filter the extracted relations. We are also studying various statistical learning techniques for biomedical entity recognition and relation extraction from biomedical text.
The data mining component is designed to extract gene regulatory relations from genomic and proteomic data including DNA microarray by machine learning techniques such as Bayesian networks. We are experimenting various techniques to learn regulatory networks from microarray data, either with existing prior knowledge or in combination with other types of biological experimental data, e.g., DNA methylation array or protein expression. The so-called joint learning approach is promising to learn the network more accurately, avoiding bias and incompleteness inherited by a particular type of data. Linkages extracted from heterogeneous genomic data sources provide different evidence about gene functional relations. In a recent study, we develop a Bayesian framework for integrating relations extracted from multiple sources, such as gene expression, biomedical literature, and genomic sequence information, into a genome-wide functional network. In addition, we conduct studies on cancer classification using gene array data. We are adopting and developing various feature selection techniques to identify marker genes and their interactions for cancer diagnosis and drug discovery.
The knowledge integration component leverages a variety of biomedical ontology and knowledge sources to form an integrated framework for pathway-related knowledge organization. We have developed a feature decomposition approach to the aggregation of extracted pathway relations and resolution of the redundancy, ambiguity and inconsistency among them, using existing lexicons and ontologies such Entrez Gene, RefSeq, Homologene, MeSH, UMLS and GO. Pathway relations extracted from text and learned from data, as well as known relations from existing knowledge sources will eventually be integrated into a consolidated knowledge base.
All these pathway relations can be combined to construct regulatory networks and be visualized by automatic graph drawing algorithms implemented in the Arizona BioPathway Visualizer (see the demo).
- Text mining (PubMed, 2003)
- P53 - Text Collection:
Content: All abstract with p53 or related genes in title or abstract
Linguistic Parser Relations: 194,384
Co-occurrence Relations: 2,724,099
- AP1-Text Collection:
Content: All abstract with ap1 or related genes in title or abstract
Linguistic Parser Relations: 258,142
Co-occurrence Relations: 3,265,524
- Yeast - Text Collection:
Content: All abstract with yeast in title or abstract
Linguistic Parser Relations: 584,502
Co-occurrence Relations: 6,535,737
- Arabidopsis -Text Collection:
Content: All abstracts with MeSH terms of ‘Arabidopsis’ or ‘Arabidopsis Proteins’
Linguistic Parser Relations: 222
Co-occurrence Relations: 1,291
- P53 - Text Collection:
- Data Mining
- P53 – Microarray Data:
Content: Gene expression measurement of p53 mutant cell lines (provided by AZCC)
Gene expression measurements: 33
Genes (Homo sapiens ORFs): 5,306
Genes with greatest variations: 200
- Yeast – Microarray Data:
Content: Microarray data of yeast cell cycle (Spellman et al. 1998)
Gene expression measurements: 77
Time series: 6
Genes (S. cerevisiae ORFs): 6,177
Genes whose expression varied over the different cell-cycle stages: 800
- Arabidopsis – Micrarray data:
Content: two high-quality microarray series of Arabidopsis at http://www.weigelworld.org
Gene expression measurements: 237 for development and 298 for abiotic stress
Genes (Arabidopsis): 22,810
- Arabidopsis – Genome sequence relations:
Content: gene relations extracted from genome sequence using four different methods in ProLink (http://dip.doe-mbi.ucla.edu/pronav)
Phylogenetic profiling (PP): 132,637
Rosetta Stone (RS): 989,795
Gene neighbor (GN): 18,823
Gene cluster (GC): 11,586
- MDS – Microarray data
Content: DNA methylation arrays from Arizona Cancer Center. It is derived from the epigenomic analysis of bone marrow specimens from healthy donors and individuals with myelodysplastic syndrome (MDS).
Measurements: 55 (10 normal and 45 tumor samples)
- Ovarian Cancer – Microarray data
Content: microarray-based measurements of DNA methylation from the Gynecologic Oncology tumor bank at the University of Iowa and made available through the Arizona Cancer Center.
Measurements: 114 (25 normal and 89 tumor samples)
- P53 – Microarray Data:
- A shallow parser based on closed
class English words extracting noun
- A full parser using syntax-semantic
hybrid grammar extracting verb
- Co-occurrence analysis based
on Concept Space, which generates
asymmetric relations between phrases
ordered according to the strength
of their relation
- Conditional Random Field (CRF)
methods for entity recognition
- Kernel-based learning methods
for relation extraction and
- Feature decomposition for
entity and relation aggregation
- Bayesian Network frameworks
for integrating gene functional
relations from multiple data
- Optimal search based feature subset selection methods for identifying marker genes for cancer classification
|Dr. Hsinchun Chenfirstname.lastname@example.org|
|Dr. Zhu Zhang|
|Dr. Jesse Martinez|
|Yulei Zhang (Gavin)|
|Text Mining Publications and Presentations|
|Data Mining Publications and Presentations|
For additional information, please contact us.