Online ISSN: 1097-4571    Print ISSN: 0002-8231
Journal of the American Society for Information Science
Volume 51, Issue 4, 2000. Pages: 352-370

(Special Issue: Digital Libraries: Part 2. Issue Edited by Hsinchun Chen.)
Published Online: 11 Feb 2000

Copyright © 2000 John Wiley & Sons, Inc.


 Research Article
Comparing noun phrasing techniques for use with medical digital library tools
Kristin M. Tolle, Hsinchun Chen
Management Information Systems Department, University of Arizona, Tucson, AZ 85721
email: Kristin M. Tolle (ktolle@bpa.arizona.edu)

Funded by:
 NSF/ARPA/NASA Digital Library Initiative; Grant Number: IRI-9411318
 NSF CISE; Grant Number: IRI-9525790
 National Computational Science Alliance (NCSA); Grant Number: IRI970000N, IRI970002N
 National Library of Medicine (NLM)
 National Cancer Institute
 National Institutes of Health

Abstract
Abstract 1. Introduction 2. Background 3. Experiment Details and Metrics 4. Results and Analysis References
In an effort to assist medical researchers and professionals in accessing information necessary for their work, the A1 Lab at the University of Arizona is investigating the use of a natural language processing (NLP) technique called noun phrasing. The goal of this research is to determine whether noun phrasing could be a viable technique to include in medical information retrieval applications. Four noun phrase generation tools were evaluated as to their ability to isolate noun phrases from medical journal abstracts. Tests were conducted using the National Cancer Institute's CANCERLIT database. The NLP tools evaluated were Massachusetts Institute of Technology's (MIT's) Chopper, The University of Arizona's Automatic Indexer, Lingsoft's NPtool, and The University of Arizona's AZ Noun Phraser. In addition, the National Library of Medicine's SPECIALIST Lexicon was incorporated into two versions of the AZ Noun Phraser to be evaluated against the other tools as well as a nonaugmented version of the AZ Noun Phraser. Using the metrics relative subject recall and precision, our results show that, with the exception of Chopper, the phrasing tools were fairly comparable in recall and precision. It was also shown that augmenting the AZ Noun Phraser by including the SPECIALIST Lexicon from the National Library of Medicine resulted in improved recall and precision.


Digital Object Identifier (DOI)

10.1002/(SICI)1097-4571(2000)51:4<352::AID-ASI5>3.0.CO;2-8  About DOI

Article Text

1. Introduction
Abstract 1. Introduction 2. Background 3. Experiment Details and Metrics 4. Results and Analysis References

The explosion of medical information accessible via the Internet has created a growing need for development of a cohesive online medical digital library. Much of the work being done in this area has been focused on the development of tools to index and provide access to the increasing number of online medical data collections (Houston, Chen, Hubbard et al., [1999a]). The proliferation of online information and the diversity of interfaces to data collections have led to a medical information gap. Users who need access to such information must visit a variety of sources, which can be excessively time consuming and potentially dangerous if the information is urgently needed for treatment decisions. In addition, information generated by using existing search engines often may be too general or inaccurate. Particularly frustrating is that simple queries can result in an excessive number of documents retrieved - too many to search through to determine which are and which are not relevant.

The potential beneficiaries of research to improve interfaces to medical digital libraries are many. First, there are information providers - the institutions that provide online medical information and bear the cost of its maintenance and availability. There also are medical researchers and health care providers, who can benefit from more ready access to critical information that will result from reduced search time and retrieval of more relevant information. Finally, patients and their families will gain from having more knowledgeable service providers and by being able to conduct searches for information on their own. Ultimately we hope to develop a more effective way for medical professionals, librarians, and researchers to retrieve information over the Internet.

At the University of Arizona, the AI Lab Medical group's goal is to develop tools that can improve the capabilities of our existing medical digital library interfaces (Cancer Space can be found at http://ai.bpa.arizona.edu/ and is shown in Figure 1). Our approach is to combine existing AI Lab techniques for concept extraction (Concept Space) (Chen, Martinez, Kirchhoff, Ng, & Schatz, [1998a]; Chen, Schatz, Ng, Martinez, Kirchhoff, & Lin, [1996a]) and data visualization (Kohonen Self-Organizing Maps) (Chen, Schuffels, & Orwig, [1996b]) with a NLP technique called noun phrasing. An example of a dynamic SOM can be seen in Figure 2.

Figure 1. Concept space for CANCERLIT.
[Normal View 199K | Magnified View 461K]

Figure 2. Java SOM categorization map using noun phrases to scrutinize topics in 300 CANCERLIT documents.
[Normal View 155K | Magnified View 427K]

Our experiment was designed to test an NLP tool called the Arizona Noun Phraser that we developed in order to improve key phrase extraction from medical text. Previously the AI Lab used a technique called automatic indexing (Chen et al., [1998a]). We compared the AZ Noun Phraser's performance with those of other phrase generation techniques: AI Lab automatic indexing, MIT's Chopper, and LingSoft's NPtool. We also compared different versions of the AZ Noun Phraser; versions with a standard set of corpus-based dictionaries and versions which combined these dictionaries with use of the SPECIALIST Lexicon from the National Library of Medicine.

The article is organized as follows: Section 2 gives a general background on digital libraries and groups that need a medical digital library. We also discuss the techniques employed. Section 3 describes our experimental design research questions and the metrics used for evaluation. Section 4 is a discussion of the results of our experiment. Finally, we present conclusions and discuss plans for future research.

2. Background
Abstract 1. Introduction 2. Background 3. Experiment Details and Metrics 4. Results and Analysis References

Fox and Marchionini ([1998]) describe information as a basic human need, suggesting that civilization advances when people can apply the right information at the right time. They also suggest that substantial progress will be necessary if the world's many digital libraries are to be linked. The AI Lab is focusing its attention on development of tools to lend support to the advancements needed in this area, from information retrieval to user-interface development (Chen et al., [1998a]; Ramsey, Chen, Zhu, & Schatz, [1999]).

Digital libraries focus on interactions between information producers (authors, publishing companies, government organizations), librarians (information locators, indexers, and filterers), and information seekers (Adam & Yesha, [1996]; Lynch & Garcia-Molina, [1995]). In the case of medical information, diversity and overlap among these different groups generates a very strong need for the creation of a medical digital library.

Much textual medical information is captured, stored, and made available through a set of public and private institutions representing the information providers. These institutions often distinguish themselves by providing a subset of the available medical document collections specified by a particular domain. MEDLINE from the National Library of Medicine (NLM) provides access to general medical journals whereas CANCERLIT, maintained by the National Cancer Institute (NCI), contains cancer-related journals and symposia. Lost in these divided collections are interrelationships among the different domains of medical information. For instance, articles on chlorine, a common carcinogen, appear in toxicology and general medicine as well as cancer-related journals.

The cost of storing and updating document information is borne by the institutions that provide these services; access to documents is accomplished through information retrieval interfaces with the document databases. Because an organization's cost for accessing a document is far more than the cost of filing it (Halverson, [1995]), it is important to provide users with efficient retrieval applications.

Information producers are mainly medical researchers and practitioners, and it is estimated that they produce more than 360,000 medical journal articles each year (Detmer & Shortliffe, [1997]). In addition to proliferating journal publications and conference proceedings, government reports are also generated by a variety of sources, including information providers.

Considerably more diverse are information seekers, who include information producers and, increasingly, public consumers. Although the World Wide Web (WWW) contains a vast amount of medical information, information seekers are often frustrated by the difficulty of finding relevant information quickly (Detmer & Shortliffe, [1997]). Users who wish to access information from a variety of medical sources may have to learn several different information retrieval systems and several different indexing vocabularies in order to locate relevant information (Houston, Chen, Schatz, Hubbard, Sewell, & Ng, [1999b]), there being no consistent means of access to medical literature databases. An additional problem is that retrieved information is often not specific enough to be considered useful. Many search tool interfaces do not allow a sufficient level of specificity because they rely on simple keyword searching and ambiguous query formation. As a result, many of the documents retrieved are only slightly related to the searcher's topic. Also, interfaces are inconsistent, both in query formation and in the type of medical information available. Some applications provide only reference information (i.e., title, author, publication information), while others contain abstracts and/or full text documents. These obstacles make it difficult for medical professionals to gather information electronically in an efficient, well-informed manner.

2.1. Keyword Generation and Searching

Most academic papers are accompanied by a set of keywords and/or thesaurus headings, usually listed at the beginning of the document. These keywords or phrases are intended to represent the dominant topics discussed in the article and, in general, are either designated by the author or generated by domain experts. Since they are often used for document retrieval, they must be highly representative of a document's content. Frequently this is not the case.

Keyword assignment by human indexers relies entirely on the indexer's expertise in both the subject domain and in using the standard subject domain thesaurus. This method works well in narrow topic domains having a limited number of documents. However, manual indexing becomes extremely difficult as the number of documents increases and/or the documents are from varied domains (Houston et al., [1999b]).

The effectiveness of keyword-based retrieval systems is limited. Bates ([1986]) found that different indexers, well trained in an indexing scheme, often assigned different index phrases to the same document. Indexers may also use different phrases for the same document at different times. Meanwhile, different users also use different phrases to retrieve the same information. Furnas, Landauer, Gomez, and Dumais ([1987]), showed that the probability of two people using the same term to classify an object was less than 20%.

Manually indexing documents can be difficult and time consuming. A method called automatic indexing has been used to extract concepts from textual data (Salton, Wong, & Yang, [1975]) and, according to Salton ([1986]), its effectiveness in information retrieval systems compares favorably with that of human indexers. However, most automatic indexing programs can perform only keyword indexing - fetching keywords physically present in a document.

2.2. Automatic Thesaurus Generation

A common method of addressing keyword information retrieval problems is to use either a thesaurus or a vector space representation based on the pioneering work of Salton and colleagues ([1975]). Thesauri are mainly used to expand users' queries so that query terms can be translated into alternative phrases that match document indexes. These automatically generated indexes contain phrases that appear in a document collection. Srinivasan ([1996]) evaluated different query expansion strategies using the MEDLINE test collection to demonstrate the viability of this technique in the medical domain.

Virtually all automatically generated thesauri are based on syntactic analysis using statistical co-occurrence of word types in text and vector space representation of the documents (Salton, [1989]). Most have also incorporated other statistical techniques such as cluster analysis, cooccurrence efficiency analysis, and factor analysis. These techniques employ mathematical matrices to represent relationships between documents, between phrases and documents, and between phrases and phrases.

As the availability of computer processing and storage power has increased, research into the automatic generation of thesauri has also increased. Experimentation using these techniques on parallel supercomputers to address the scalability issues of large-scale information retrieval is discussed in (Chen, Schatz, Ng, & Yang, [1999]). Automatically generated neural-like thesauri or concept spaces have generated a great deal of interest (Gallant, [1988]). In a neural knowledge base, concepts or phrases are represented as nodes and their relationships as weighted links. The associative memory feature of this thesaurus type has created a new paradigm for knowledge discovery and document searching using spreading activation algorithms such as the Hopfield net (Chen, Zhang, & Houston, [1998b]).

Most statistical methods have concentrated on solving the synonymy problem by adding more associative phrases to keyword indexes. A major disadvantage of this technique is that noise is introduced because some of the added phrases have meanings that are different from those intended. This can result in the rapid degradation of precision in information retrieval (Deerwester, Dumais, Furnas, Landauer, & Harshman, [1990]).

Document recall improvements on the order of 10% to 20% have been demonstrated when a thesaurus is used in an environment similar to that in which the original thesaurus was constructed (Crouch, [1990]). However, Cimino, Johnson, Peng, and Aguirre ([1994]) documented that problems are associated with automated translation of medical information using thesauri.

2.3. Other Approaches

Another way to address the vocabulary problem is to index documents semantically so users can conduct a search based on conceptual meanings rather than keywords. These methods attempt to create spaces in which documents are placed according to their meanings, not simply by document keywords. The most representative methods of the multi-dimensional semantic space techniques are Metric Similarity Modeling (MSM) and Latent Semantic Indexing (LSI) (Dumais, [1994]).

Another approach to semantic indexing has been to use natural language processing techniques to analyze sentences and word context to interpret content (Lewis & Sparck-Jones, [1996]). CITE (Doszkocs, [1983]) and AQUA (Johnson, Aguirre, Peng, & Cimino, [1994]) are two examples of the use of natural language parsers in medical information access.

2.4. Natural Language Processing

The object of Natural Language Processing (NLP) is to make the computer a fluent user of ordinary (human) language. In order to make this possible, some type of computational or conceptual analysis must be performed to form a meaningful structure directly from input sentences. A variety of research disciplines are involved in the successful development of NLP systems.

The fundamental aim of linguistic theory is to show how larger units of meaning arise out of the combination of smaller ones. Computational linguistics tries to implement this process efficiently by subdividing the task into syntax and semantics. Syntax describes the formal elements of a textual unit (typically a sentence). Semantics describes how an interpretation is calculated (Zaenen & Uszkoreit, [1995]). The mapping of words into meaning representation is driven by whatever morphological, syntactic, semantic, and contextual cues are available in a textual unit (Cullingford, [1986]). Barriers to correctly assigning meaning to words range from inconsistent user input to word-form ambiguity.

2.4.1. Lexical analysis

The fundamental problem of lexical analysis is determining how to provide, fully and adequately, lexical knowledge to NLP systems. The answer, toward which the computational linguistic community is converging today, is to extract the lexicon from the texts themselves (Boguraev & Pustejovski, [1996]). A significant bottleneck in this approach occurs because it is necessary to populate a lexicon with entries for tens to hundreds of thousands of words. Natural Language Processing systems typically have a limited application range that is constrained by their small number of lexical entries. A further problem concerns the inability of most experimental systems to scale up laboratory prototypes or application programs, since these usually have been tuned for particular domains and primarily contain lexical entries specific to a particular system in its current state (Boguraev & Pustejovski, [1996]).

2.4.2. Machine readable dictionaires and text corpora

There are two types of text resources for lexical data acquisition: machine-readable dictionaries (MRDs) and text corpora. Dictionaries are, by definition, repositories of lexical information. Text corpora reflect language as it is used and are employed by studying regularities of use and patterns of word behavior, which only emerge from analysis of very large samples of text (Boguraev & Pustejovski, [1996]).

Both MRDs and text corpora have drawbacks. A computational lexicon derived from MRD sources tends to be incomplete, both with respect to coverage (words) and content (lexical properties). Such a lexicon leaves much to be desired regarding data consistency and organization. Organizing extracted lexical knowledge is equally problematic when using text corpora. Furthermore, the transition from corpus to lexicon will be delayed until there is resolution of issues like what constitutes the right kind of corpus given certain lexical properties, how to balance the corpus and how to abstract the information acquired through learning from the learning mechanism itself (Boguraev & Pustejovski, [1996]).

2.4.3. Morphological disambiguation: Part-of-speech assignment

Ambiguity, in particular, has been difficult for researchers to address. Morphemes, or word stems, have a variety of prefix and suffix options-each with a slightly different meaning and/or part of speech (Boguraev & Pustejovski, [1996]). Morphological disambiguators have had some success in addressing both of these issues. There are two basic approaches to disambiguation: rule-based and probabilistic. Rule-based taggers typically leave some of the ambiguities unresolved but make very few errors; statistical, corpus-based taggers generally provide a fully disambiguated output but have a higher error rate (Karlsson & Karttunen, [1995]).

Rule-based disambiguators have reportedly surpassed standard probabilistic approaches (Karlsson & Karttunen, [1995]). As with corpus-based approaches, rule-based approaches also initially tag text. However, the word is first reduced to its base form (simple stem) before it is assigned a part-of-speech tag. It is also assigned a dependency-oriented function (e.g., red, an adjective, modifies car, a noun which follows) and its syntactic properties (object, subject, subject complement, etc.). Sentences are evaluated as a whole by comparing every possible syntactic representation of the sentence with a large set of linguistic rules. In the analysis of a sentence, a successful grammar will discard all readings except the appropriate one (Voutilainen, [1997]).

Probabilistic (stochastic) corpus-based processing techniques identify categories of lexical properties by empirically studying word occurrences in large bodies of text (Boguraev & Pustejovski, [1996]). This method derives probabilities for manually tagging a training corpus on which the analysis is based. Stochastic systems based on a Hidden Markov Model can also be trained on an untagged corpus with a reported success rate of 96% for English (Cutting, Kupiec, Pedersen, & Sibun, [1992]). In either method, the corpus is fundamentally used to create a lexicon for tagging words with their respective parts of speech.

Corpus-based approaches are often able to succeed while ignoring the true complexities of language, relying on the fact that complex linguistic phenomena can often be indirectly observed through simple epiphenomena (Brill, [1995]). Using this technique, stochastic methods have been able to obtain very high rates of tagging accuracy simply by observing fixed-length word sequences, without recourse to the underlying linguistic structure.

An advancement of this technique is called transformation-based error-driven learning (Brill, [1995]). Text is first passed through an initial state annotator and assigned its most likely tag (noun if unknown). Once this has occurred, the annotated text is then compared with existing corpora, for instance the Brown Corpus or the Wall Street Journal (WSJ) Corpus, and errors observed in this comparison are noted. The tagger is then trained using this information and the annotations are augmented to more closely resemble the training corpus. The last step is repeated until there is no further reduction in errors (the tagger is fully trained).

2.5. Refinement of Searching Using NLP

The motivation and drive for using NLP techniques in document retrieval is quite intuitive. Devanbu, Brachman, Selflidge, and Ballard ([1991]) and Girardi and Ibrahim ([1993]) discuss the comparative utility of natural language and keyword-based retrieval systems interfaces. They found that it is simpler for users to search for a concept than to search using keywords corresponding to static classification schemes and/or Boolean combinations of keywords. Boolean operators are especially problematic since precedence is often ambiguous.

2.6. Noun Phrasing

Quirk ([1985]) defines a noun phrase as a subject, object and complement of a clause, and as a complement of a prepositional phrase. Noun phrasing has been used in information retrieval to capture a richer linguistic representation of document content (Anick & Vaithyanathan, [1997]). It has the potential to improve precision over other document indexing techniques, since it allows for multi-word queries (or phrases) to be matched with words (or phrases) present in the text documents (Girardi & Ibrahim, [1993]). Anick and colleagues ([1997]) describe the motivation for using noun phrases in information retrieval as:
  Noun compounds are widely used across sub-language domains for describing concepts succinctly
  Contiguity makes them relatively easy to detect and extract from text corpora
  Unlike many phrasal constructions, noun compounding is generally applied to express tighter, more long lived relationships between concepts, thereby contributing less noise
  Most proper nouns are captured by the technique
  Nouns and noun compounds account for the bulk of the phrases that show up in actual queries
  In most cases, the relationship between a noun compound and the head noun of the compound is a strict conceptual specialization.

A disadvantage of using noun phrasing for the generation of thesauri and concept spaces is that a phrase which most succinctly defines a textual passage may not actually appear in the passage (Lewis & Croft, [1990]). However, this problem also exists for other automatic indexing techniques.

Devanbu and colleagues ([1991]) discuss the application of NLP techniques. Their application, LaSSIE, parses user queries into noun phrases for searching. The phrases are then used to locate code segments in complex large-scale computer applications for code reusability. In (Anick & Vaithyanathan, [1997]), noun phrases are used in both query formation and document clustering in the Internet domain. In the medical information retrieval domain, the use of noun phrasing has also been championed by Carnegie Mellon's CLARIT project (Evans, [1994]).

Finding the appropriate MeSH terms to identify appropriate medical literature using existing search engines was identified as a problem by Cooper and Miller ([1998]). MeSH terms are the document keywords assigned by human indexers at the National Library of Medicine. Cooper and Miller created tools to extract noun phrases from the free text portion of a patient's medical record in order to map these relevant report concepts to MeSH terminology for searching the online medical literature databases. Using both lexical (PostDoc) and statistical (Pindex) tools, they were able to generate MeSH terms which they referred to as a controlled vocabulary of noun phrases. They reported that a hybrid of both techniques was able to extract 66% of the relevant concepts in the patient record and generated, on average, 71 terms per patient record.

One goal of noun phrasing research is to investigate the possibility of combining traditional keyword and syntactic approaches with semantic approaches to improve information retrieval quality. A common criticism of keyword based data mining techniques and searches is that single word terms lack an appropriate level of context for meaning evaluation. Incorporating a noun phrase parser into the descriptor term identification phase of the search engine would enable information retrieval systems to identify noun phrases and evaluate word meaning in the context of an entire noun phrase, potentially improving the accuracy and level of detail of the information retrieved and the quality of the relationships identified.

Our approach is to extract the relevant concepts from the free text in the documents and allow searchers to choose from the MeSH and the noun phrases which occur in the literature itself.

2.7. UMLS(r) Knowledge Sources

The UMLS Knowledge Sources, created by the NLM, is made up of four components: The Metathesaurus, the Semantic Network, the SPECIALIST Lexicon, and the Information Sources Map. The Unified Medical Language System (UMLS) is designed to facilitate the retrieval and integration of information from multiple machine-readable biomedical information sources (UMLS, [1998]). The most current version (9th edition) of the UMLS was released in January 1998. Since the UMLS is based on several existing standard medical vocabularies, we believe it can be useful to medical experts who are familiar with those vocabularies by helping them refine their queries.

The Metathesaurus contains semantic information about biomedical concepts and terms from various controlled medical vocabularies and classifications. Manually created by the NLM, the Metathesaurus has captured special word relationships that are not available through statistically generated thesauri. For example, the Metathesaurus contains the following relationships: synonyms, parent terms, children terms, broader and narrower terms, and terms related in other ways (similar terms that are not synonyms). The Semantic Network is designed to provide a consistent categorization of the semantic types to which all concepts in the Metathesaurus have been assigned. It captures hierarchical relationships, such as isa and inverse isa relationships, and nonhierarchical relationships, such as spatially related to and functionally related to. In total there are 132 semantic types and 53 relationships.

The SPECIALIST Lexicon is an English language lexicon similar to the Wall Street Journal or the Brown Corpora except that it contains many biomedical terms that do not exist in either of these two sources. The current version contains approximately 100,000 lexical records that map to 32,000 words. The Information Sources Map contains a database that describes biomedical information resources both inside and outside the NLM. Because of the increased availability of biomedical services on the WWW and the explosive growth of the Internet in general, NLM has focused a great deal of effort on the upgrade and development of a WWW JAVA interface (which supports CORBA-based communication) to the database. These tools can be used as building blocks for biomedical application developers.

3. Experiment Details and Metrics
Abstract 1. Introduction 2. Background 3. Experiment Details and Metrics 4. Results and Analysis References

The following section discusses the testing and analysis phases of a study conducted using phrasing tools developed at the University of Arizona AI Lab. This study was designed to determine the best algorithms to reach the goal of developing a medical digital library. First we will describe the research questions we addressed in our study, followed by a description of our test collection and the phrasing tools used for our experiments.

3.1. Research Questions

Several research questions were addressed in our experiments. In general we hoped to determine whether NLP techniques can improve medical information retrieval. This, however, is a broad question that encompasses more territory than is covered by this paper. The narrower and more realistic questions we addressed were, first, to discover if the AZ Noun Phraser was at least as good (hopefully better) than other phrase generation techniques at isolating relevant noun phrases from medical abstracts. Second, we hoped to see if the SPECIALIST Lexicon would further improve the AZ Noun Phraser's ability to generate relevant noun phrases from medical abstracts. Finally, we wanted to show that the AZ Noun Phraser could successfully be used to process a large-scale document collection.

3.2. Test Collection

Our experiment used a portion of the CANCERLIT literature databank provided by the NCI. CANCERLIT is a comprehensive archival file of more than 1,000,000 bibliographic records (most with abstracts) describing cancer research results published over the past 30 years. Approximately 200 core journals contribute the bulk of the records in this database. Other information is derived from proceedings of meetings, government reports, symposia reports, theses and selected monographs. The database is updated monthly to provide a comprehensive, up-to-date resource of published cancer research results. NCI estimates that the collection increases by more than 70,000 abstracts each year.

Record format is standard and includes the following fields: authors, their addresses, MeSH headings, document source, document title, and abstract. Preformulated searches for more than 80 clinical topics are updated and available on the Web site. The complete CANCERLIT database is now searchable on the Web at: (http://cnetdb.nci.nih.gov/canlit.htm). It is also available through the NLM and a variety of commercial database vendors, some of which also offer a CD-ROM product. (http://wwwicic.nci.nih.gov/canlit/canlit.htm). Our portion of the collection is made up of 714,451 of the most recent records and requires 1.1 gigabytes of storage space.

3.3. Phrasers

As part of an effort to improve information retrieval (IR) of text documents, the AZ Noun Phraser was created as an adjunct to existing concept space IR and term suggestion tools (Chen et al., [1996a]). Previously the AI Group had focused on a phrase generation technique called automatic indexing, which had good recall results but lacked precision (Chen et al., [1998a]). Seeking an alternative, we began investigating NLP techniques - noun phrasing in particular.

Using noun phrases makes it possible to search for terms in the context of the free text in which they are found (Tolle, [1997]). This has a distinct advantage over the most commonly used IR technique, keyword searching, in which improvement in performance is not possible beyond a certain point referred to as the keyword barrier (Mauldin, [1991]).

After performing a search of available phrasing tools, followed by testing of those that would best suit our needs, we decided to implement our own phrasing application based on the Brill part-of-speech tagger and using a pattern matching algorithm which searched for a set of generally accepted patterns of noun phrases.

The experiment reported on in this study compared a phrasing (not noun phrasing) technique called Chopper, a commercial noun phrasing technique called NPtool, automatic indexing, and four different versions of the AZ Noun Phraser. A brief description of each of these tools is presented below. We have also provided sample output from each of the tools. These were generated using a common medical abstract, although this may not be readily apparent since, in the interest of space, we have chosen to present only a subset of the phrases generated by each tool. A summary of the different techniques tested can be found in Table 1.

 
Table 1. Phrase generation techniques tested.

Phraser Description Phrase type

Chopper Machine Understanding Group, MIT All types of phrases
NPtool Commercially developed by LingSoft Noun phrases: best and interim
Automatic Indexing AI Lab Phrase Generation tool All types of phrases: best and interim
AZ Noun Phraser I Generic Noun Phraser Noun phrases: best only
AZ Noun Phraser II Noun Phraser with SPECIALIST Lexicon Noun phrases: best only
AZ Noun Phraser III Generic Noun Phraser Noun phrases: best and interim
AZ Noun Phraser IV Noun Phraser with SPECIALIST Lexicon Noun phrases: best and interim

3.3.1. Chopper

Chopper is a natural language analysis engine developed by the Machine Understanding Group at the MIT Media Laboratory under the direction of Dr. Ken Haase. Few publications explaining how Chopper works are currently available, but the tool's ability to break sentences into phrases qualified it as a candidate for our testing. Also, preliminary tests to determine its ability to correctly isolate nouns in the highly technical text of the NLM's TOXLINE abstracts looked very promising. However, since the testing was to determine which of the tools best located noun phrases, Chopper, which categorizes all types of phrases, was at a distinct disadvantage against tools developed for that specific purpose.

One of Chopper's advantages is that text can be submitted with no need for special processing, such as tokenization (the separation of punctuation and symbols from the words), before phrase generation can take place. This can also be a disadvantage since the lack of tokenization results in problems with classifying words containing punctuation (i.e., c.elegans). Chopper lists nonnoun phrases, verb phrases, prepositional phrases, etc., so an associated problem is that phrases contain every word in the submitted text. Words are even repeated where necessary to generate complete phrases. An example of Chopper's output can be seen in Figure 3.

Figure 3. Single sentence phrase output from Chopper.
[Normal View 44K | Magnified View 65K]

Phrases include all parts of speech: determiners, conjunctions, etc. For our purposes these words were considered extraneous information. During testing, Chopper also appeared to drop sections and in some cases reported only the beginning of the submitted abstracts (i.e., it encountered a term it could not process and stopped). Another potential integration problem is that the lexicon may be difficult or impossible to update with medical terms. However, even without the ability to incorporate medical terminology, Chopper's sophisticated part-of-speech tagger produced good results identifying parts of speech in abstracts from the NLM's TOXLINE literature databank.

3.3.2. Automatic Indexing

At the University of Arizona, automatic indexing has been used extensively to extract concepts from textual data. Experiments in a variety of domains, from the Internet (Chen et al., [1996b]) to the INSPEC scientific engineering abstract collection (Chen et al., [1998a]), have been conducted. These experiments have shown the usefulness of this technique in thesaurus generation. The Automatic Indexing application parses input files by combining adjacent words into phrases using punctuation and stopwords as phrase delimiters. Due to the application's inability to assign parts-of-speech to a word, it generates phrases by making multiple passes through the stopword list and applies hard-coded word adjacency rules. This is followed by word stemming and finally, by term-phrase formation.

Automatic indexing has been quite useful for replacing human indexers when the number of files to be processed is extremely large. The greatest problem associated with the tool is low precision in information retrieval results, compared with human indexer terms. It is for this reason that natural language parsing generation tools are being investigated as a replacement for this application. A subset of the output from Automatic Indexing can be found in Figure 4.

Figure 4. Subset of phrase output from automatic indexing of a medical abstract.
[Normal View 20K | Magnified View 29K]

3.3.3. NPtool

NPtool, a commercially available noun phrase detector, relies on rule-based disambiguation instead of statistical/stochastic methods to determine part-of-speech (Voutilainen, [1997]). Originally developed by Dr. Atro Voutilainen at the Department of General Linguistics at Helsinki University using Karlsson's 1100 disambiguation rules and constraint grammar formalism, NPtool is currently distributed by a Finnish company, LingSoft.

NPtool lists words with associated parts of speech and indicates to which phrase the word belongs. Word output is normalized (input terms are returned as root words and all output is lowercase). Phrases that the designers consider irrelevant such as in spite of and in addition to are removed. The final output is a list of noun phrases and an indication of how certain the tool is of the output (Arppe, [1995]) - ? - uncertain and OK - certain as shown in Figure 5.

Figure 5. A subset of phrases generated from a medical abstract by NPtool.
[Normal View 40K | Magnified View 59K]

The major advantage of NPtool is its ability to locate noun phrases. Also, unlike Chopper and the AZ Noun Phraser, it not only lists the best choice for the noun phrase (OK), it also lists other possible options (?). In evaluation of the tool, we concluded that restricting comparison to the best choice should improve precision, since only the most accurate phrases are reported. Including other potential noun phrases should improve recall, since more phrases would be generated and thus be considered as valid. Alternatively, including more phrases, especially ones the tool suggests are not the best possible phrases, would introduce noise and reduces precision. For our experiments, only the OK choices were evaluated.

A third NPtool advantage is that tokenization is not required. A potential problem associated with the normalization of the output to lowercase is that it could cause phrase-matching problems. For instance, single strain breaks, typically represented by the acronym SSBs, would be returned from NPtool as ssbs a term with little or no value to the user.

The greatest disadvantage of NPtool is that, because it is a commercial product, source code is not available. This makes it difficult to integrate the tool with existing applications and impossible to alter it in the future. Updates must come from the software developer. Also, there are monetary and availability considerations associated with commercially developed software.

3.3.4. AZ Noun Phraser

The AZ Noun Phraser was developed at the University of Arizona AI Lab to extract high-quality phrases from textual data. It is made up of three main components: a tokenizer, a tagger, and a noun phrase generator. The tokenizer module is designed to take raw text input and create output that conforms to the UPenn Treebank word tokenization rules. Its task is to separate all punctuation and symbols from text without interfering with textual content. The tool operates on both plain text and a specialized version of SGML (Standard Generalized Markup Language) tagged files. Sample output from tokenization is shown in Figure 6.

Figure 6. Tokenization of a paragraph from a medical abstract by AZ Noun Phraser.
[Normal View 60K | Magnified View 88K]

The tagger module of the AZ Noun Phraser is a significantly revised version of the Brill tagger (Brill, [1993]). The Brill tagger, generally considered to be one of the more robust part-of-speech taggers, is rule-based and trainable, relying on transformation-based error-driven learning to improve its accuracy. Our initial implementation included two existing multi-domain corpora: the Wall Street Journal Corpus and the Brown Corpus. Two later versions that also included the SPECIALIST Lexicon from the National Library of Medicine were tested to determine the performance gain of adding a domain-specific lexicon. Figure 7 shows the output generated by the tagger.

Figure 7. Text tagging of a paragraph from a medical abstract by AZ Noun Phraser.
[Normal View 52K | Magnified View 76K]

In an effort to make the tool better able to handle large textual collections, an extensive effort was required to rewrite the tagger to enable the tool to operate in a wider variety of environments (NT, Solaris, IRIX). Memory and CPU usage were dramatically decreased, but the part-of-speech tagging was executed with equal precision. The third major component of the AZ Noun Phraser is the phrase generation module, which converts the words and associated part-of-speech tags generated by the tagger into noun phrases. The phraser utilizes a pattern-matching algorithm to isolate phrases from the tagged text. Two of the versions of the noun phraser generated only the longest possible phrase. Two of the versions generated both the longest phrase and all possible interim phrases. Our aim was to create a direct comparison between the AZ Noun Phraser and NPtool and automatic indexing, each of which also generates all phrases, not just the longest possible phrase. An example of the phrases generated from the tagged text above, appears in Figure 8.

Figure 8. Phrase output from AZ Noun Phraser.
[Normal View 20K | Magnified View 30K]

3.4. Experimental Design

The 19 subjects who participated in our study were required to be familiar with medical terminology, particularly cancer literature. The subjects were medical librarians, doctors, medical students, and researchers. We intentionally chose a cross-section of medical terminology experts to avoid limiting our findings to a single cohesive group, thus destroying external validity. Our subjects were recruited from the Arizona Cancer Center, the University of Arizona College of Medicine, and the University of Arizona Health Sciences Library.

Ten abstracts were randomly selected from our portion of the CANCERLIT collection. Each abstract was processed by each phrase extraction tool. The lists of phrases from the different techniques were ordered alphabetically and combined into a single list with duplicates removed. These lists were paired with the abstracts to create the subject test packets. The order of the abstracts in the packets was randomly assigned using a pseudo-random number generator to avoid problems associated with learning and/or subject fatigue. The instructions that accompanied the packet directed the subjects to read the abstracts one at a time and choose relevant representative phrases from the associated list. They were instructed to completed phrase selection for each abstract before proceeding to the next. During the experiment, subjects were allowed to look at the abstract when selecting phrases. Although no time limit for completing the packet assignment was imposed, most subjects were able to complete the experiment in 1 hour. The total number of phrases generated was 1415 unique phrases - approximately 140 per abstract.

3.5. Relevance Assessment

Since subjects were expected to have brought heir own biases regarding relevance to the experiment, we attempted to minimize this effect by giving specific instructions.

Relevance was defined to our subjects using the follow text:
  Relevant: Highly relevant phrases are ones that best represent the topic covered in the text of the document. For example, a document on hunting dogs might contain the following relevant phrases: bird hunting or Labrador Retriever.
  Not Relevant: Not relevant phrases are ones that are not closely related to the topic. Phrases such as water bottle or sunny day would be considered not relevant, since even though they may appear in document about hunting dogs and are well-formed phrases, they are not, in general, representative of the topic.
  Error: Error phrases are those which are malformed or incomplete. For instance the phrases bradore Retri or the dog was would be considered errors.

To avoid adding our own biases to the experiment, these instructions were reviewed and revised by a social scientist, a librarian, and a medical professional prior to conducting the experiment. We also conducted a pilot study using a different test collection.

3.6. Metrics

In information retrieval experiments, the most commonly used techniques measure recall and precision or some derivative of these metrics (Harter, [1996]), (Hersh, [1996]). Recall is the extent to which known relevant phrases are generated (comprehensiveness) and precision represents how well the generation of non-relevant phrases is suppressed (accuracy). Often these can be considered to be conflicting goals, since, typically, the more phrases generated the higher the percentage of recall and the lower the percentage of precision. This can clearly be seen in the results of our experiment. As defined by Hersh in (Hersh, [1996]), recall (R) is the proportion of relevant documents (or phrases) selected from the collection. It is calculated using the following formula:

Relative recall is often used as an evaluation technique when the size of a collection is too large to determine the total number of relevant phrases. It is also used by researchers to avoid problems associated with disparate cognitive evaluation of relevance by expert subjects (Hersh, [1996]). It has been shown that subjects frequently do not agree on relevance for reasons ranging from bias to task difficulty (Harter, [1996]). To avoid these problems, we chose a modified relative recall to which we refer herein as subject recall (SR) and calculated as follows:

Subject precision (P) is the proportion of relevant phrases in the search and was calculated as follows:

4. Results and Analysis
Abstract 3. Experiment Details and Metrics 4. Results and Analysis 5. Conclusions and Discussion Acknowledgements References

By testing multiple NLP tools simultaneously, we were able to gather a great deal of information from our experiment. In the following section, we first present our comparison of all the phrase generation tools, followed by a comparison of the tools that generate interim phrases in addition to the longest phrase. Next we compared the different versions of the AZ Noun Phraser in order to determine which version test subjects preferred. Finally we offer a pair-wise comparison of the effects of including the SPECIALIST Lexicon in the AZ Noun Phraser.

4.1. Comparison of All NLP Tools

Our results show a significant performance difference among the NLP tools. Chopper's results were dramatically different from those of the other phrase generation tools. Both recall and precision values for Chopper were significantly lower than those for all other techniques. This result was not surprising since Chopper is not specifically a noun phrase generation tool. One possible interpretation of this result is that NLP tools that produce noun phrases are preferable to more general phrase generation tools. Our statistical analysis of this result can be seen in Table 2 and Figure 9. Table 2 is a summary of the one-way analysis of variance shown in Figure 9.

 
Table 2. Summary of recall and precision results.

Recall (%) Precision (%)

Chopper 12.90 6.95
NPtool 71.74 34.32
Automatic Indexing 60.00 34.90
AZ Noun Phraser I 37.31 39.05
AZ Noun Phraser II 39.79 40.00
AZ Noun Phraser III 51.16 35.05
AZ Noun Phraser IV 51.53 36.00

Figure 9. Analysis of variance for subject recall and precision.
[Normal View 51K | Magnified View 76K]

4.1.1. Recall

The technique with the highest recall value was NPtool. This result was followed closely by those of Automatic Indexing and the AZ Noun Phraser versions III and IV, showing that these tools are closely comparable. Also significant is that these techniques produced the large number of phrases: 600, 511, 633 and 623, respectively, compared with 265 and 273, respectively, for the AZ Noun Phraser versions I and II.

4.1.2. Precision

Precision was higher for the tools that generated fewer phrases, AZ Noun Phraser I (AZNPI and AZNPII, longest phrase generation only; used Brown and WSJ Corpora) and AZ Noun Phraser II (longest phrase generation only; used Brown, WSJ, and the SPECIALIST Lexicon). This indicated that test subjects considered a higher percentage of the phrases generated by these tools to be relevant. An alternate way of looking at this result is to say that these tools generated fewer phrases that were of no search value and therefore have reduced the effectiveness of information retrieval.

4.2. Comparison of Interim Noun Phrase Generation Tools

There were no significant differences among the precision of the interim phrase generation tools. As previously mentioned, several of the tools generate not just the longest phrase, but also interim phrases. NPtool had the best recall of all the techniques tested. In precision AZNP IV had the best precision. Overall the tools appeared to be fairly comparable, although the argument could be made that the AZ Noun Phraser is probably not as useful for recall as either Automatic Indexing or NPtool. These results can be observed in the statistical output shown in Figure 10.

Figure 10. Recall and precision comparison of interim phrase generation tools.
[Normal View 45K | Magnified View 66K]

4.3. Comparison of AZ Noun Phraser Versions

Highest recall was generated by AZ Noun Phraser versions III and IV precision was highest on AZ Noun Phraser versions I and II. Figure 11 compares the different versions of the AZ Noun Phraser. The versions that generated interim phrases (AZNPIII and AZNPIV) performed significantly better on recall than the original AZ Noun Phraser, both with and without the SPECIALIST Lexicon (AZNPI and AZNPII). However, the latter performed better than the former for precision. Choosing which of these tools suits an application becomes a matter of deciding whether the goal is precision or recall.

Figure 11. Recall and precision comparison of AZ noun phraser versions.
[Normal View 44K | Magnified View 65K]

4.4. Effect of the SPECIALIST Lexicon

The SPECIALIST Lexicon improved both recall and precision for medical journal abstracts. To address our research question regarding the SPECIALIST Lexicon, we compared the versions of the AZ Noun Phraser in pairs - those with interim phrasing and those without. Our results show that, although not with statistical significance, the incorporation of the SPECIALIST Lexicon improved both recall and precision. The slight improvement is likely due to the fact that the Brown Corpus and Wall Street Journal lexicons, while not domain specific, contain numerous medical terms. Also, since the AZ Noun Phraser automatically tags unknown words as nouns, technical medical terms contained in the abstracts (which usually are nouns) are combined into noun phrases. Our results of this analysis are shown in Figures 12 and 13.

Figure 12. Recall and precision comparison of AZ Noun Phraser I and II with and without SPECIALIST lexicon.
[Normal View 37K | Magnified View 55K]

Figure 13. Recall and precision comparison of AZ Noun Phraser III and IV with and without SPECIALIST lexicon.
[Normal View 39K | Magnified View 57K]

4.5. Generalizability of Results

In order to support our results shown in sections 4.1-4.5, we conducted a generalizability study to test for high inter-rater reliability and low intra-rater covariance. We accomplished this by using a MIVQUE0 method (Hartley, Rao, & LaMotte, [1978]) in an SAS® procedure, VARCOMP, which provides variance components estimates based on expected mean squared. MIVQUE0 produces unbiased estimates that are invariant with respect to the fixed effects of the model and are locally best quadratic unbiased estimates, given that the true ratio of each component to the residual effects are adjusted only for the fixed effects. The independent variables in our analysis were raters and phrases; response (selecting a phrase as relevant) was the dependent variable. We ran this analysis for each individual tool. The results are summarized in Table 3.

 
Table 3. Breakdown of estimated response variance by tool.

Variance Component Chopper NPtool Automatic Index AZ Noun Phraser I AZ Noun Phraser II AZ Noun Phraser III AZ Noun Phraser IV

Phrase .02228019 .08184487 .07890830 .07338153 .07637109 .07540029 .07608341
Rater .00043666 .01223199 .01365748 .01942137 .01884889 .01485272 .01495364
Phrase Rater .02963704 .12170564 .12457624 .14079048 .14097374 .12375209 .12504054

Table 3 shows a breakdown, by tool (the population we were evaluating) shown against phrases and raters. It can be established from our analysis that across techniques, raters had a 2% or smaller variance in their responses. For Chopper there was almost complete agreement, with an estimated variance of .04%. The other tools ranged from 1.2% to 1.9% rater variance, as shown in row two in Table 3. Phrase variance with respect to response was also low for Chopper at less than 2% variance. The other tools ranged from 7% (AZ Noun Phraser I) to 8% (NPtool) variance, as shown in row one in Table 3. Row three in Table 3 shows an interaction value between phrases and raters that can be explained by random error. The results of this study show a high level of agreement across raters, both as to phrases chosen as relevant and those that were not.

4.6. Overlap Analysis

Moderate to high phrase generation overlap occurred between the NLP Tools. In order to determine whether phrases that the different tools generated were unique, we performed a phrase overlap analysis. Moderate to high overlap was found between differing phrasing tools, the highest being between AZ Noun Phrasers III and IV and NPtool. A high overlap between different AZ Noun Phrasers is to be expected, as is a low overlap between Chopper and the other techniques, since Chopper is designed to generate all phrases, not just noun phrases. This indicates that the different techniques, with the exception of Chopper, generated similar phrases given the same text as input. These results further support the claim that these techniques are fairly comparable.

The results of the overlap analysis can be seen in Tables 4 and 5. In Table 4 the diagonals show the number of phrases generated by each technique. The off-diagonals are raw scores of the number of phrases that overlapped with those generated by the other techniques. For instance, in the output examples in the previous section describing the different phrasing techniques, Figures 4, 5, and 8 show that NPtool, Automatic Indexing, and the AZ Noun Phraser generated the phrase immunfluorescence analysis of microtubule(s). Table 5 shows the Jaccard score for these relationships, which is a percentage of overlap.

 
Table 4. Phrase overlap - Raw score

Chopper NPtool Auto Index AZ Noun Phraser I AZ Noun Phraser II AZ Noun Phraser III AZ Noun Phraser IV

Chopper 559            
NPtool 48 600          
Auto Index 33 240 511        
AZNP I 22 212 124 265      
AZNP II 25 217 125 253 273    
AZNP III 54 442 378 240 240 633  
AZNP IV 59 439 361 246 251 591 623

 
Table 5. Phrase overlap - Jaccard score.

Chopper (%) NPtool (%) Auto Index (%) AZ Noun Phraser I AZ Noun Phraser II (%) AZ Noun Phraser III (%) AZ Noun Phraser IV (%)

Chopper 100            
NPtool 4.32 100          
Auto Index 3.18 27.55 100        
AZNP I 2.74 32.47 19.02 100      
AZNP II 3.10 33.08 18.97 88.77 100    
AZNP III 4.75 55.88 49.35 36.47 36.04 100  
AZNP IV 5.25 55.99 46.70 38.32 38.91 88.87 100

4.7. Performance Analysis

Collection size is an important issue in digital libraries. The literature test collection used for our experiment was a subset of a larger collection of documents - a collection that continues to grow. It is therefore important to conduct a large-scale performance analysis to determine whether the proposed noun phrasing technique is computationally scalable to real-world applications.

Using 8 nodes of a 32-node Origin2000 at the National Computational Science Alliance, at the University of Illinois Urbana-Champaign, we processed 623,690 CANCERLIT journal abstracts, our portion of the CANCERLIT collection at the time. For the AZ Noun Phraser I, the combined total CPU time for all nodes was 13 hours, 48 minutes, and 33 seconds (13:48:33). The resultant CPU time was .08 seconds per abstract. As a performance benchmark, Automatic Indexing was able to process the same collection at the speed of .005 seconds per abstract.

Though computationally expensive, the AZ Noun Phraser is capable of processing large collections of documents and can therefore be considered a viable tool for generating noun phrases for large-scale medical digital libraries.

5. Conclusions and Discussion
Abstract 3. Experiment Details and Metrics 4. Results and Analysis 5. Conclusions and Discussion Acknowledgements References

The AI Lab Medical Group's goal is to develop techniques to improve medical information retrieval and to create more effective interfaces for both medical and non-medical users. Several conclusions can be drawn from our experiment:

5.1. The AZ Noun Phraser Is Comparable to Other Noun Phrase Generation Techniques

Our results clearly show that the AZ Noun Phraser is as good as or better than other phrase generation techniques we compared in our study.

5.2. The AZ Noun Phraser Performance Improved With the Addition of SPECIALIST Lexicon

We were able to show that the SPECIALIST Lexicon increased the ability of the AZ Noun Phraser to generate relevant noun phrases using the CANCERLIT test collection. However, it is surprising to see how well the AZ Noun Phraser was able to perform in the medical domain without the aid of the medical lexicon. For this experiment we used the SPECIALIST Lexicon issued in 1997. The newer version released by NLM in January 1998 will be incorporated into later releases of AZ Noun Phraser.

5.3. The AZ Noun Phraser Is Computationally Feasible for Processing Large Collections

Although more computational expensive than the existing efficient automatic indexing tool, the AZ Noun Phraser demonstrated its ability to process large-scale digital library collections.

Our experiments give evidence to the viability of using our technique for future experimentation and application. The future directions of this research will be to further analyze its performance by incorporating the AZ Noun Phraser into existing tools such as our Concept Spaces for medical and non-medical domains. User studies are currently underway to compare the usefulness of the AZ Noun Phraser and Concept Spaces in geographical information systems, Internet searching, law enforcement, as well as expanding the Concept Space to the complete CANCERLIT collection. These experiments will focus on determining which is the best version of the AZ Noun Phraser to include in these different domains.

We were pleased with the improvement generated by incorporating NLM's Specialist Lexicon. Additional opportunities to incorporate more of the NLM's UMLS tools exist. We intend to pursue the incorporation of the Semantic Net and the Metathesaurus in future releases of the AZ Noun Phraser and Concept Space to investigate query expansion and high precision filtering of retrieved documents. These improvements will eventually be built into a medical information retrieval interface to the entire CANCERLIT collection that we hope can more effectively serve the needs of seekers of information in the literature database.

Acknowledgements
Abstract 3. Experiment Details and Metrics 4. Results and Analysis 5. Conclusions and Discussion Acknowledgements References

This project was supported in part by the following grants:
  NSF/ARPA/NASA Digital Library Initiative, IRI-9411318, 1994-1998 (B. Schatz, H. Chen, et al., Building the Interspace: Digital Library Infrastructure for a University Engineering Community),
  NSF CISE, IRI-9525790, 1995-1998 (H. Chen, Concept-based Categorization and Search on the Internet: A Machine Learning, Parallel Computing Approach),
  National Computational Science Alliance (NCSA) IRI970000N and IRI970002N, enabling the use of the NCSA SGI/CRAY Origin2000, 1997-1999.
  National Library of Medicine (NLM) Toxicology and Environmental Health Research Participation Program through the Oak Ridge Institute for Science and Education (ORISE), 1996-1997.

Additional funding and support were provided by the National Cancer Institute and the National Institutes of Health.

We would like to thank AI Lab staff members, in particular Robin Sewell, for her assistance with the user study; Nuala Bennett at CANIS Research Lab, University of Illinois and Tamas Doszkocs at the National Library of Medicine for their suggestions and knowledge of NLP systems; and Alexa McCray at the National Library of Medicine and Susan Hubbard and Nick Martin at the National Cancer Institute.

References
Abstract 3. Experiment Details and Metrics 4. Results and Analysis 5. Conclusions and Discussion Acknowledgements References
Adam, N.R., & Yesha, Y. (1996). Strategic directions in electronic commerce and digital libraries: Towards a digital agora. ACM Computing Surveys , 28, 818-835.
Anick, P.G., & Vaithyanathan, S. (1997). Exploiting clustering and phrases for context-based information retrieval. Paper presented at the 20th annual international ACM SIGIR conference on research and development, Philadelphia, PA.
Arppe, A. (1995). Term extraction from unrestricted text. Paper presented at the 10th Nordic conference on computational linguistics (NODALIDA-95), Helsinki, Finland.
Bates, M.J. (1986). Subject access in online catalogs: A design model. Journal of the American Society for Information Science , 37, 357-376.
Boguraev, B., & Pustejovski, J. (1996). Issues in text-based lexical acquisition. Corpus processing for lexical analysis. Cambridge, MA: MIT Press.
Brill, E. (1993). A corpus-based approach to language learning. Unpublished Ph.D. Dissertation, University of Pennsylvania, Philadelphia.
Brill, E. (1995). Transformation-based error-driven learning and natural language processing. Computational Linguistics , 21, 543-565.
Chen, H., Martinez, J., Kirchhoff, A., Ng, T.D., & Schatz, B.R. (1998a). Alleviating search uncertainty through concept associations: Automatic indexing, co-occurrence analysis, and parallel computing. Journal of the American Society for Information Science , 49, 206-216.
Chen, H., Schatz, B.R., Ng, D.T., & Yang, M.S. (1999). Breaking the semantic barrier: A concept space experiment on the convex exemplar parallel supercomputers. Submitted to Journal of the American Society for Information Science.
Chen, H., Schatz, B.R., Ng, T.D., Martinez, J.P., Kirchhoff, A.J., & Lin, C. (1996a). A parallel computing approach to creating engineering concept spaces for semantic retrieval: The Illinois digital library initiative project (Grant Submission).
Chen, H., Schuffels, C., & Orwig, R. (1996b). Internet categorization and search: A machine learning approach. Journal of Visual Communications and Image Representation , 7, 88-102.
Chen, H., Zhang, Y., & Houston, A.L. (1998b). Semantic indexing and searching using a Hopfield net. Journal of Information Science , 24.
Cimino, J.J., Johnson, S.B., Peng, P., & Aguirre, A. (1994). From ICD9-CM to MeSH using the UMLS: A how-to guide. Paper presented at the annual symposium on computer applications in medical care.
Cooper, G.F., & Miller, R.A. (1998). An experiment comparing lexical and statistical methods for extracting MeSH terms from clinical free text. Journal of the American Medical Informatics Association , 5, 62-75.
Crouch, C.J. (1990). An approach to the automatic construction of global thesauri. Information Processing and Management , 26, 629-640.
Cullingford, R.E. (1986). Natural language processing. Totowa, NJ: Rowman and Littlefield.
Cutting, D., Kupiec, J., Pedersen, J., & Sibun, P. (1992) A practical part of speech tagger. Paper presented at the 3rd conference on applied language processing, Trento, Italy.
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science , 41, 391-407.
Detmer, W.M., & Shortliffe, E.H. (1997). Using the internet to improve knowledge diffusion in medicine. Communications of the ACM , 40, 101-108.
Devanbu, P., Brachman, R., Selflidge, P., & Ballard, B. (1991). LaSSIE: A knowledge-based software information system. Communications of the ACM , 34, 34-49.
Doszkocs, T.E. (1983). CITE NLM: Natural-language searching in an online catalog. Information Technology and Libraries , 2, 364-380.
Dumais, S.T. (1994). Latent semantic indexing (LSI) and TREC-2. Text retrieval conference (TREC-2) (pp. 105-115).
Evans, D.A. (1994). Specifying adverse drug reactions for formulating contexts through CLARIT processing of medical abstracts. Paper presented at the proceedings of RIAO '94, New York, NY.
Fox, E.A., & Marchionini, G. (1998). Toward a worldwide digital library. Communications of the ACM , 41, 29-32.
Furnas, G.W., Landauer, T.K., Gomez, L.M., & Dumais, S.T. (1987). The vocabulary problem in human-system communication. Communications of the ACM , 30, 964-971.
Gallant, S.I. (1988). Connectionist expert system. Communications of the ACM , 31, 152-169.
Girardi, M.R., & Ibrahim, B. (1993, April 30th). An approach to improve the effectiveness of software retrieval. Paper presented at the 3rd annual Irvine software symposium, University of California, Irvine, CA.
Halverson, P. (1995). Document processing: Overview. In R.A. Cole, (Ed.), Survey of the state of the art in human language technology (pp. 255-258). New York, NY: Cambridge University Press.
Harter, S.P. (1996). Variations in relevance assessments and the measurement of retrieval effectiveness. Journal of the American Society for Information Science , 47, 37-49.
Hartley, H.O., Rao, J.N.K., & LaMotte, L. (1978). A simple synthesis-based method of variance component estimation. Biometrics , 34, 233-244.
Hersh, W.R. (1996). Information retrieval: A health care perspective, 1st ed. New York, NY: Springer-Verlag.
Houston, A.L., Chen, H., Hubbard, S.M., Schatz, B.R., Ng, T.D., Sewell, R.R., & Tolle, K.M. (1999a). Data mining on the Internet: Research on a cancer information system. AI Review .
Houston, A.L., Chen, H., Schatz, B.R., Hubbard, S.M., Sewell, R.R., & Ng, T.D. (1999b). Exploring the use of concept space to improve medical information retrieval. International Journal of Decision Support Systems .
Johnson, S.B., Aguirre, A., Peng, P., & Cimino, J. (1994). Interpreting natural language queries using the UMLS. Paper presented at the annual symposium on computer applications in medical care.
Karlsson, F., & Karttunen, L. (1995). Sub-sentenial processing. New York, NY: Cambridge University Press.
Lewis, D.D., & Croft, B. (1990). Term clustering of syntactic phrases. Paper presented at the proceedings of the 13th international ACM SIGIR conference on research and development in information retrieval.
Lewis, D.D., & Sparck-Jones, K. (1996). Natural language processing for information retrieval. Communications of the ACM , 39, 92-101.
Lynch, C., & Garcia-Molina, H. (1995). Interoperability, scaling and the digital libraries research agenda. Reston, VA: Information Infrastructure Technology and Applications (IITA) Digital Libraries Workshop.
Mauldin, M. (1991). Retrieval performance in Ferret. Paper presented at the proceedings of the 14th ACM SIGIR conference on research and development in information retrieval, Chicago, IL.
Quirk, R. (1985). A comprehensive grammar of the English language. London, UK: Longman.
Ramsey, M., Chen, H., Zhu, B., & Schatz, B. (1999). A collection of visual thesauri for browsing large collections of geographic images. Journal of the American Society for Information Science (Perspectives Issue on Visual Information Retrieval Interfaces) .
Salton, G. (1986). Another look at automatic text-retrieval systems. Communications of the ACM , 29, 648-656.
Salton, G. (1989). Automatic text processing. Addison-Wesley Publishing Company Inc.
Salton, G., Wong, A., & Yang, C.S. (1975). A vector space model for automatic indexing. Communications of the ACM , 18, 613-620.
Srinivasan, P. (1996). Query expansion and MEDLINE. Information Processing and Management , 32, 431-443.
Tolle, K.M. (1997). Improving concept extraction from text using noun phrasing tools: An experiment in medical information retrieval. Unpublished Masters Thesis, University of Arizona, Tucson.
UMLS. (1998). UMLS knowledge sources, 9th ed., U.S. Dept. of Health and Human Services.
Voutilainen, A. (1997). A short introduction to NPtool. Available at: http://www.lingsoft.fi/doc/nptool/intro/.
Zaenen, A., & Uszkoreit, H. (1995). Language analysis and understanding. In R.A. Cole, (Ed.), Survey of the state of the art in human language technology (pp. 109-110), New York, NY: Cambridge University Press.




Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved.