As digital library applications become more overwhelming, pressing, and diverse, several well-known information retrieval (IR) problems have become even more urgent in this network-centric information age. The conventional approaches to addressing information overload and interoperability problems are manual in nature, requiring human experts as information intermediaries to create knowledge structures and/or classification systems (e.g., the National Library of Medicine's Unified Medical Language System, UMLS) to bridge the gap between vocabulary differences. As information content and collections become even larger and more dynamic, we believe a system-aided, algorithmic, bottom-up approach to creating large-scale digital library classification systems is needed.
(1) Can various clustering algorithms produce classification results comparable to classification systems generated by human beings? Which algorithm produces the best result and under what condition?
(2) Are these clustering algorithms computationally feasible to create classification systems based on large-scale digital library collections? What optimization and parallelization techniques are needed to achieve such scalability?
The proposed research aims to develop an architecture and the associated techniques needed to automatically generate classification systems from large textual collections and to unify them with manually created classification systems to assist in effective digital library retrieval and analysis. Both algorithmic developments and user evaluation in several sample domains will be conducted in this project. Scalable automatic clustering methods, including Ward's clustering, multi-dimensional scaling, latent semantic indexing, and self-organizing map, will be developed and compared. Most of these algorithms, which are computationally intensive, will be optimized based on the sparsity of common keywords in textual document representations. Using parallel, high-performance platforms as a time machine for simulation, we plan to parallelize and benchmark the above clustering algorithms for large-scale collections (on the order of millions of documents) in several domains. Results of these automatic classification systems will be represented using several novel hierarchical display methods.
The testbed of research will include three application domains that consist of both large-scale collections and existing classification systems: (1) medicine: CancerLit (700,000 cancer abstracts) and the NLM's UMLS (500,000 medical concepts), (2) geoscience: GeoRef and Petroleum Abstracts (800,000 abstracts) and Georef thesaurus (26,000 geoscience terms), and (3) Web application: a WWW collection (1.5M web pages) and the Yahoo! classification (20,000 categories). Medical subjects, geo scientists, and WWW search engine users will be used in our evaluation plan.