BioInformatics
|
| Research
Goal |
|
To develop text mining and data mining techniques
to support automated extraction and inference of
regulatory pathways from biomedical literature
and experimental data.
Technological
developments in genomic and proteomic research have
led to an explosion of data available for biomedical
research. The sheer quantity of data generated by
high throughput technologies such as DNA microarray
has exceeded the capacity of traditional data analysis
techniques to extract useful information. Meanwhile,
rapid accumulation of research publications makes it
difficult to keep abreast of new developments
in the area.
The
research goal of Arizona BioPathway is to develop
novel machine learning and Natural Language Processing
(NLP) techniques to support efficient and effective
data and text analysis in biomedical fields, particularly,
the analysis of genetic regulatory pathways which are
crucial for biological processes such as gene regulation
and cancer development. Arizona BioPathway is also aimed
at the creation of a framework for pathway-related
knowledge integration and visualization using a combination
of various approaches. The ultimate goal of Arizona
BioPathway is to provide biomedical researchers with
a platform of pathway-related literature abstraction,
data analysis and knowledge integration, thus to support
the development of scientific hypotheses and discovery
of new knowledge.
|
|
| Funding |
|
Funding for this research was received from
the following sources:
| 1 R33 LM07299-01 |
05/01/2002 -
04/30/2005 |
| National Institutes of Health/National
Library of Medicine |
$1,320,000 |
| GeneScene: A toolkit for gene pathway analysis |
| |
|
| 1R01 LM06919-01A1 |
2/15/2001 - 2/14/2004 |
| National Institutes of Health/National
Library of Medicine |
$500,000 |
| UMLS Enhanced Dynamic
Agents to Manage Medical Knowledge |
| |
|
| IIS-9817473 |
5/1/99 - 4/31/2002 |
| National Science Foundation |
$500,000 |
| DLI –Phase 2:
High Performance Digital Library Classification
Systems: From Information Retrieval
to Knowledge Management |
|
|
| Acknowledgement |
|
| Approach
& Methodology |
|
|
Current focuses of the Arizona BioPathway
research include automatic extraction of
regulatory pathway relations from biomedical
literature using NLP techniques, inference
of genetic networks from genomic data using
data mining approaches, and the integration
of existing knowledge and text/data mining
results of regulatory pathways using a
variety of biomedical ontologies.
The text mining
component of Arizona BioPathway is
designed to extract genetic regulatory
pathway relations from biomedical
literature. We have experimented
with two different approaches of
natural language processing (NLP)
to extract the pathway relations,
shallow parsing and full parsing.
The shallow parser uses templates
based on closed-class words (e.g.,
prepositions) and model generic
relations to capture relations
between noun phrases, while the
full parser uses a broad coverage
syntactic-semantic hybrid grammar
to identify grammatical verb relations.
To increase the precision, both
approaches use relevant biomedical
lexicons such as Gene Ontology (GO),
HUGO Gene Nomenclature, and the
Specialist Lexicon of UMLS to filter
the extracted relations. We are also
studying various statistical learning
techniques for biomedical entity
recognition and relation extraction
from biomedical text.
The data mining
component is designed to extract gene
regulatory relations from genomic and
proteomic data including DNA microarray
by machine learning techniques such as
Bayesian networks. We are experimenting
various techniques to learn regulatory
networks from microarray data, either
with existing prior knowledge or in
combination with other types of biological
experimental data, e.g., DNA methylation
array or protein expression. The so-called
joint learning approach is promising
to learn the network more accurately,
avoiding bias and incompleteness inherited
by a particular type of data. Linkages
extracted from heterogeneous genomic
data sources provide different evidence
about gene functional relations. In a
recent study, we develop a Bayesian
framework for integrating relations
extracted from multiple sources, such
as gene expression, biomedical literature,
and genomic sequence information, into
a genome-wide functional network. In
addition, we conduct studies on cancer
classification using gene array data.
We are adopting and developing various
feature selection techniques to identify
marker genes and their interactions for
cancer diagnosis and drug discovery.
The knowledge integration
component leverages a variety of biomedical
ontology and knowledge sources to form
an integrated framework for pathway-related
knowledge organization. We have developed
a feature decomposition approach to the
aggregation of extracted pathway relations
and resolution of the redundancy, ambiguity
and inconsistency among them, using existing
lexicons and ontologies such Entrez Gene,
RefSeq, Homologene, MeSH, UMLS and GO.
Pathway relations extracted from text and
learned from data, as well as known relations
from existing knowledge sources will
eventually be integrated into a
consolidated knowledge base.
All these pathway relations can be
combined to construct regulatory networks
and be visualized by automatic graph drawing
algorithms implemented in the Arizona
BioPathway Visualizer (see the demo).
|
| |
Testbed:
|
| |
Text mining
(PubMed, 2003) |
| |
- P53 – Text Collection:
Content: All abstract with p53
or related genes in title or abstract
Abstracts: 20,360
Linguistic Parser Relations: 194,384
Co-occurrence Relations: 2,724,099
|
| |
- AP1-Text Collection:
Content: All abstract with ap1
or related genes in title or abstract
Abstracts: 23,339
Linguistic Parser Relations: 258,142
Co-occurrence Relations: 3,265,524
|
| |
- Yeast-Text Collection:
Content: All abstract with yeast
in title or abstract
Abstracts: 66,197
Linguistic Parser Relations: 584,502
Co-occurrence Relations: 6,535,737
|
| |
- Arabidopsis -Text Collection:
Content: All abstracts with MeSH
terms of ‘Arabidopsis’ or
‘Arabidopsis Proteins’
Abstracts: 10,548
Linguistic Parser Relations: 222
Co-occurrence Relations: 1,291
|
| |
Data Mining |
| |
- P53 – Microarray Data:
Content: Gene expression measurement
of p53 mutant cell lines (provided
by AZCC)
Gene expression measurements:
33
Genes (Homo sapiens ORFs): 5,306
Genes with greatest variations:
200
|
| |
- Yeast – Microarray
Data:
Content: Microarray data of yeast
cell cycle (Spellman et al. 1998)
Gene expression measurements:
77
Time series: 6
Genes (S. cerevisiae ORFs): 6,177
Genes whose expression varied
over the different cell-cycle
stages: 800
|
| |
- Arabidopsis – Micrarray data:
Content: two high-quality microarray
series of Arabidopsis at
http://www.weigelworld.org
Gene expression measurements:
237 for development and 298
for abiotic stress
Genes (Arabidopsis): 22,810
|
| |
- Arabidopsis – Genome
sequence relations:
Content: gene relations extracted
from genome sequence using four
different methods in ProLink
(http://dip.doe-mbi.ucla.edu/pronav)
Relations:
Phylogenetic profiling (PP): 132,637
Rosetta Stone (RS): 989,795
Gene neighbor (GN): 18,823
Gene cluster (GC): 11,586
|
| |
- MDS – Microarray data
Content: DNA methylation arrays
from Arizona Cancer Center. It
is derived from the epigenomic
analysis of bone marrow specimens
from healthy donors and individuals
with myelodysplastic syndrome (MDS).
Measurements: 55 (10 normal and 45 tumor samples)
Genes: 678
|
| |
- Ovarian Cancer – Microarray data
Content: microarray-based measurements
of DNA methylation from the Gynecologic
Oncology tumor bank at the University
of Iowa and made available through
the Arizona Cancer Center.
Measurements: 114 (25 normal and 89 tumor samples)
Genes: 6,560
|
| |
Techniques: |
| |
- A shallow parser based on closed
class English words extracting noun
phrase relations
- A full parser using syntax-semantic
hybrid grammar extracting verb
relations
- Co-occurrence analysis based
on Concept Space, which generates
asymmetric relations between phrases
ordered according to the strength
of their relation
- Conditional Random Field (CRF)
methods for entity recognition
- Kernel-based learning methods
for relation extraction and
classification
- Feature decomposition for
entity and relation aggregation
- Bayesian Network frameworks
for integrating gene functional
relations from multiple data
sources
- Optimal search based feature
subset selection methods for
identifying marker genes for
cancer classification
|
|
|
| Team Members |
|
| Publications |
|
| Text
Mining Publications and Presentations |
| |
- K. D. Quiñones, H. Su, B. Marshall, S. Eggers, and H. Chen. “User-centered evaluation of Arizona BioPathway: an information extraction, integration, and visualization system.” IEEE Transactions on Information Technology in Biomedicine, 11(5): 527-536, 2007.
- B. Marshall, H. Su, D. McDonald, S. Eggers, and H. Chen. "Aggregating Automatically Extracted Regulatory Pathway Relations." IEEE Transactions on Information Technology in Biomedicine, 10:100-108, 2006.
- B. Marshall, H. Su, D. McDonald, and H. Chen. “Linking ontological resources using aggregatable substance identifiers to organize extracted relations.” In Proceedings of Pacific Symposium on Biocomputing, pp. 162-173, 2005
- G. Leroy, H. Chen. "GeneScene: An Ontology-Enhanced Integration of Linguistic and Co-Occurrence Based Relations in Biomedical Texts," Journal of The American Society for Information Science and Technology (JASIST), 56: 457-468, 2005.
- D. McDonald, H. Chen, H. Su, and B. Marshall. "Extracting Gene Pathway Relations Using a Hybrid Grammar: The Arizona Relation Parser," Bioinformatics 20:3370-3378, 2004.
- D.M. McDonald, H. Chen, G. Leroy, and H. Su. "Combining Ontologies and Grammatical Relations to Yield Diverse Semantic Relations from Biomedical Texts,” Poster presentation at Pacific Symposium on Biocomputing, January 2004.
- G. Leroy, H. Chen, and J.D. Martinez. “A Shallow Parser Based on Closed-class Words to Capture Relations in Biomedical Text.” Journal of Biomedical Informatics (JBI) 36:145-158, 2003.
- G. Leroy, H. Chen, J.Martinez, S. Eggers, R.Falsey, K. Kislin, Z. Huang, J. Li, J. Xu, D. McDonald, and G. Ng. "GeneScene: Biomedical Text and Data Mining" Presented at the Third ACM and IEEE Joint Conference on Digital Libraries (JCDL-) May 27-31, 2003, Houston, Texas, 2003.
- G. Leroy and H. Chen. "Filling preposition-based templates to capture information for medical abstracts." In Proceedings of Pacific Symposium on Biocomputing, pp. 350-361, 2002.
|
| |
| Data
Mining Publications and Presentations |
| |
- J. Li, H. Su, H. Chen, and B. W. Futscher “Optimal search-based gene subset selection from gene array data for cancer classification.” IEEE Transactions on Information Technology in Biomedicine, accepted, 2006.
- Z. Huang, J. Li, H. Su, G.S. Watts, H. Chen "Large-scale Regulatory Network Analysis From Microarray Data: Modified Bayesian Network Learning and Association Rule Mining." Decision Support Systems: Special Issue on Decision Support in Medicine, forthcoming, 2006.
- J. Li, X. Li, H. Su, H. Chen, and D. W. Galbraith, "A framework of integrating gene relations from heterogeneous data sources: an experiment on Arabidopsis thaliana." Bioinformatics, 22:2037-2043, 2006.
- Z. Huang, H. Su, H. Chen “Joint learning using multiple types of data and knowledge,” in H. Chen, S. Fuller, C. Friedman, and W. Hersh (Eds.), Medical informatics: knowledge management and data mining in biomedicine, Springer, p.593-624. 2005
- 5. Z. Huang, H. Chen, H. Su, B.B. Marshall, B.L. Smith, G.W. Watts, J.D. Martinez.
“Learning Genetic Pathways Using Bayesian Networks and Qualitative Probabilistic Networks,” Poster presentation at Pacific Symposium on Biocomputing, January 2005.
- Z. Huang, H. Chen, H. Su, B.B. Marshall, B.L. Smith, G.W. Watts, J.D. Martinez.
“Learning Genetic Pathways Using Bayesian Networks and Qualitative Probabilistic Networks,” Poster presentation at Pacific Symposium on Biocomputing, January 2004.
|
| |
|
|
|
|
| |
|
|