- To develop text mining and data mining techniques to support automated extraction and inference of regulatory pathways from biomedical literature and experimental data.
|1 R33 LM07299-01||05/01/2002 - 04/30/2005|
|National Institutes of Health/National Library of Medicine||$1,320,000|
|GeneScene: A toolkit for gene pathway analysis|
|1R01 LM06919-01A1||2/15/2001 - 2/14/2004|
|National Institutes of Health/National Library of Medicine||$500,000|
|UMLS Enhanced Dynamic Agents to Manage Medical Knowledge|
|IIS-9817473||5/1/99 - 4/31/2002|
|National Science Foundation||$500,000|
|DLI –Phase 2: High Performance Digital Library Classification Systems: From Information Retrieval to Knowledge Management|
- Arizona Cancer Center researchers, staff, and students for providing genomic data and helping with user evaluation of our applications.
- School of Plant Sciences, University of Arizona for providing domain expertise in evaluation of our applications.
- Arizona Health Sciences Library for their support and assistance.
- National Library of Medicine for providing Unified Medical Language System (UMLS).
- Text mining (PubMed, 2003)
- P53 - Text Collection:
Content: All abstract with p53 or related genes in title or abstract
Linguistic Parser Relations: 194,384
Co-occurrence Relations: 2,724,099
- AP1-Text Collection:
Content: All abstract with ap1 or related genes in title or abstract
Linguistic Parser Relations: 258,142
Co-occurrence Relations: 3,265,524
- Yeast - Text Collection:
Content: All abstract with yeast in title or abstract
Linguistic Parser Relations: 584,502
Co-occurrence Relations: 6,535,737
- Arabidopsis -Text Collection:
Content: All abstracts with MeSH terms of ‘Arabidopsis’ or ‘Arabidopsis Proteins’
Linguistic Parser Relations: 222
Co-occurrence Relations: 1,291
- P53 - Text Collection:
- Data Mining
- P53 – Microarray Data:
Content: Gene expression measurement of p53 mutant cell lines (provided by AZCC)
Gene expression measurements: 33
Genes (Homo sapiens ORFs): 5,306
Genes with greatest variations: 200
- Yeast – Microarray Data:
Content: Microarray data of yeast cell cycle (Spellman et al. 1998)
Gene expression measurements: 77
Time series: 6
Genes (S. cerevisiae ORFs): 6,177
Genes whose expression varied over the different cell-cycle stages: 800
- Arabidopsis – Micrarray data:
Content: two high-quality microarray series of Arabidopsis athttp://www.weigelworld.org
Gene expression measurements: 237 for development and 298 for abiotic stress
Genes (Arabidopsis): 22,810
- Arabidopsis – Genome sequence relations:
Content: gene relations extracted from genome sequence using four different methods in ProLink (http://dip.doe-mbi.ucla.edu/pronav)
Phylogenetic profiling (PP): 132,637
Rosetta Stone (RS): 989,795
Gene neighbor (GN): 18,823
Gene cluster (GC): 11,586
- MDS – Microarray data
Content: DNA methylation arrays from Arizona Cancer Center. It is derived from the epigenomic analysis of bone marrow specimens from healthy donors and individuals with myelodysplastic syndrome (MDS).
Measurements: 55 (10 normal and 45 tumor samples)
- Ovarian Cancer – Microarray data
Content: microarray-based measurements of DNA methylation from the Gynecologic Oncology tumor bank at the University of Iowa and made available through the Arizona Cancer Center.
Measurements: 114 (25 normal and 89 tumor samples)
- P53 – Microarray Data:
- A shallow parser based on closed class English words extracting noun phrase relations
- A full parser using syntax-semantic hybrid grammar extracting verb relations
- Co-occurrence analysis based on Concept Space, which generates asymmetric relations between phrases ordered according to the strength of their relation
- Conditional Random Field (CRF) methods for entity recognition
- Kernel-based learning methods for relation extraction and classification
- Feature decomposition for entity and relation aggregation
- Bayesian Network frameworks for integrating gene functional relations from multiple data sources
- Optimal search based feature subset selection methods for identifying marker genes for cancer classification
|Dr. Hsinchun Chenfirstname.lastname@example.org|
|Dr. Zhu Zhang|
|Dr. Jesse Martinez|
|Yulei Zhang (Gavin)|
Text Mining Publications and Presentations
- N. Suakkaphong, Z. Zhang, and H. Chen, “Disease Named Entity Recognition using Semisupervised Learning and Conditional Random Fields,” Journal of the American Society for Information Science and Technology, Volume 62, Number 4, Pages 727-737, 2011.
- K. D. Quiñones, H. Su, B. Marshall, S. Eggers, and H. Chen. “User-centered evaluation of Arizona BioPathway: an information extraction, integration, and visualization system.” IEEE Transactions on Information Technology in Biomedicine, 11(5): 527-536, 2007.
- B. Marshall, H. Su, D. McDonald, S. Eggers, and H. Chen. "Aggregating Automatically Extracted Regulatory Pathway Relations." IEEE Transactions on Information Technology in Biomedicine, 10:100-108, 2006.
- B. Marshall, H. Su, D. McDonald, and H. Chen. “Linking ontological resources using aggregatable substance identifiers to organize extracted relations.” In Proceedings of Pacific Symposium on Biocomputing, pp. 162-173, 2005.
- G. Leroy, H. Chen. "GeneScene: An Ontology-Enhanced Integration of Linguistic and Co-Occurrence Based Relations in Biomedical Texts," Journal of The American Society for Information Science and Technology (JASIST), 56: 457-468, 2005.
- D. McDonald, H. Chen, H. Su, and B. Marshall. "Extracting Gene Pathway Relations Using a Hybrid Grammar: The Arizona Relation Parser," Bioinformatics 20:3370-3378, 2004.
- D.M. McDonald, H. Chen, G. Leroy, and H. Su. "Combining Ontologies and Grammatical Relations to Yield Diverse Semantic Relations from Biomedical Texts,”Poster presentation at Pacific Symposium on Biocomputing, January 2004.
- G. Leroy, H. Chen, and J.D. Martinez. “A Shallow Parser Based on Closed-class Words to Capture Relations in Biomedical Text.” Journal of Biomedical Informatics (JBI)36:145-158, 2003.
- G. Leroy, H. Chen, J.Martinez, S. Eggers, R. Falsey, K. Kislin, Z. Huang, J. Li, J. Xu, D. McDonald, and G. Ng. "GeneScene: Biomedical Text and Data Mining" Presented at the Third ACM and IEEE Joint Conference on Digital Libraries (JCDL-) May 27-31, 2003, Houston, Texas, 2003.
- G. Leroy and H. Chen. "Filling preposition-based templates to capture information for medical abstracts." In Proceedings of Pacific Symposium on Biocomputing, pp. 350-361, 2002.
Data Mining Publications and Presentations
- J. Li, H. Su, H. Chen, and B. W. Futscher “Optimal search-based gene subset selection from gene array data for cancer classification.” IEEE Transactions on Information Technology in Biomedicine, accepted, 2006.
- Z. Huang, J. Li, H. Su, G. S. Watts, H. Chen "Large-scale regulatory network analysis from microarray data: modified Bayesian Network learning and association rule mining." Decision Support Systems: Special Issue on Decision Support in Medicine, forthcoming, 2006.
- J. Li, X. Li, H. Su, H. Chen, and D. W. Galbraith, "A framework of integrating gene relations from heterogeneous data sources: an experiment on Arabidopsis thaliana."Bioinformatics, 22:2037-2043, 2006.
- Z. Huang, H. Su, H. Chen “Joint learning using multiple types of data and knowledge,” in H. Chen, S. Fuller, C. Friedman, and W. Hersh (Eds.), Medical Informatics: Knowledge Management and Data Mining in Biomedicine, Springer, p.593-624. 2005.
- Z. Huang, H. Chen, H. Su, B. Marshall, B. L. Smith, G. W. Watts, J. D. Martinez. “Learning Genetic Pathways Using Bayesian Networks and Qualitative Probabilistic Networks,” Poster presentation at Pacific Symposium on Biocomputing, January 2005.
- Z. Huang, H. Chen, H. Su, B. Marshall, B. L. Smith, G. W. Watts, J. D. Martinez. “Learning Genetic Pathways Using Bayesian Networks and Qualitative Probabilistic Networks,” Poster presentation at Pacific Symposium on Biocomputing, January 2004.
Genomic sequence graphic courtesy Shutterstock.