|OOHAY: Visualizing the Web
As digital library applications become more
overwhelming, pressing, and diverse, several
well-known information retrieval (IR) problems
have become even more urgent in this network-centric
information age. The conventional approaches
to addressing information overload and interoperability
problems are manual in nature, requiring human
experts as information intermediaries to create
knowledge structures and/or classification
systems (e.g., the National Library of Medicine's
Unified Medical Language System, UMLS) to
bridge the gap between vocabulary differences.
As information content and collections become
even larger and more dynamic, we believe a
system-aided, algorithmic, bottom-up approach
to creating large-scale digital library classification
systems is needed.
(1) Can various clustering algorithms produce
classification results comparable to classification
systems generated by human beings? Which algorithm
produces the best result and under what condition?
(2) Are these clustering algorithms computationally
feasible to create classification systems
based on large-scale digital library collections?
What optimization and parallelization techniques
are needed to achieve such scalability?
The proposed research aims to develop an architecture
and the associated techniques needed to automatically
generate classification systems from large
textual collections and to unify them with
manually created classification systems to
assist in effective digital library retrieval
and analysis. Both algorithmic developments
and user evaluation in several sample domains
will be conducted in this project. Scalable
automatic clustering methods, including Ward's
clustering, multi-dimensional scaling, latent
semantic indexing, and self-organizing map,
will be developed and compared. Most of these
algorithms, which are computationally intensive,
will be optimized based on the sparsity of
common keywords in textual document representations.
Using parallel, high-performance platforms
as a time machine for simulation, we plan
to parallelize and benchmark the above clustering
algorithms for large-scale collections (on
the order of millions of documents) in several
domains. Results of these automatic classification
systems will be represented using several
novel hierarchical display methods.
The testbed of research will include three
application domains that consist of both large-scale
collections and existing classification systems:
(1) medicine: CancerLit (700,000 cancer abstracts)
and the NLM's UMLS (500,000 medical concepts),
(2) geoscience: GeoRef and Petroleum Abstracts
(800,000 abstracts) and Georef thesaurus (26,000
geoscience terms), and (3) Web application:
a WWW collection (1.5M web pages) and the
Yahoo! classification (20,000 categories).
Medical subjects, geo scientists, and WWW
search engine users will be used in our evaluation