OOHAY: Visualizing the Web

OOHAY: Visualizing the Web

Research Goals

DLI-1: The Digital Library project phase 1 at University of Arizona was one of the original NSF-funded Digital Library Initiative (DLI) projects. The project aimed to develop techniques to enhance information retrieval of large digital collections and to support semantic interoperability across subject domains.

OOHAY: Visualizing the Web (DLI-2): The Object Oriented Hierarchical Automatic Yellow Page (OOHAY) project is part of the Phase 2 Digital Library Initiative project at the University of Arizona, which is funded by NSF Digital Library Initiative-2. The goal of our digital library project is to develop techniques and methodologies for automatically analyzing and visualizing large collections of unstructured documents. The project will integrate system and human-generated classification systems and will create a high-performance digital library classification system (i.e., the OOHAY system).

Introduction

As digital library applications become more overwhelming, pressing, and diverse, several well-known information retrieval (IR) problems have become even more urgent in this network-centric information age. The conventional approaches to addressing information overload and interoperability problems are manual in nature, requiring human experts as information intermediaries to create knowledge structures and/or classification systems (e.g., the National Library of Medicine's Unified Medical Language System, UMLS) to bridge the gap between vocabulary differences. As information content and collections become even larger and more dynamic, we believe a system-aided, algorithmic, bottom-up approach to creating large-scale digital library classification systems is needed.

Research Questions

(1) Can various clustering algorithms produce classification results comparable to classification systems generated by human beings? Which algorithm produces the best result and under what condition? 
(2) Are these clustering algorithms computationally feasible to create classification systems based on large-scale digital library collections? What optimization and parallelization techniques are needed to achieve such scalability?

Research Plan

The proposed research aims to develop an architecture and the associated techniques needed to automatically generate classification systems from large textual collections and to unify them with manually created classification systems to assist in effective digital library retrieval and analysis. Both algorithmic developments and user evaluation in several sample domains will be conducted in this project. Scalable automatic clustering methods, including Ward's clustering, multi-dimensional scaling, latent semantic indexing, and self-organizing map, will be developed and compared. Most of these algorithms, which are computationally intensive, will be optimized based on the sparsity of common keywords in textual document representations. Using parallel, high-performance platforms as a time machine for simulation, we plan to parallelize and benchmark the above clustering algorithms for large-scale collections (on the order of millions of documents) in several domains. Results of these automatic classification systems will be represented using several novel hierarchical display methods.

The testbed of research will include three application domains that consist of both large-scale collections and existing classification systems: (1) medicine: CancerLit (700,000 cancer abstracts) and the NLM's UMLS (500,000 medical concepts), (2) geoscience: GeoRef and Petroleum Abstracts (800,000 abstracts) and Georef thesaurus (26,000 geoscience terms), and (3) Web application: a WWW collection (1.5M web pages) and the Yahoo! classification (20,000 categories). Medical subjects, geo scientists, and WWW search engine users will be used in our evaluation plan.  

 

Funding Sources and Acknowledgements

  • National Science Foundation (NSF)
  • Advanced Research Projects Agency (ARPA)
  • National Aeronautics and Space Administration (NASA)
  • National Library of Medicine (NLM)
  • Library of Congress (LOC)
  • National Endowment for the Humanities (NEH)
  • Federal Bureau of Investigation (FBI)

University of Arizona Partners:

  • Artificial Intelligence Lab
  • Department of Management Information Systems
  • Health Sciences Library
  • Arizona Cancer Center
  • Science and Engineering Library

Participating Institutions and Agencies:

  • National Center for Supercomputing Applications (NCSA)
  • American Geological Institute (GeoRef Abstracts)
  • University of Tulsa (Petroleum Abstracts)
  • National Library of Medicine (UMLS)

Corporate Affiliates/Industrial Partners:

  • Silicon Graphics Inc. (SGI)

Approach and Methodology

Technologies:

  • Multi-threaded spiders for web page collection
  • High-precision web page noun phrasing and entity identification
  • Multi-layered, parallel, automatic web page topic directory/hierarchy generation
  • Dynamic web search result summarization and visualization
  • Adaptive, 3D web-based visualization

Team Members

Dr. Hsinchun Chen hchen@eller.arizona.edu
Chia-Jung Hsu  
Chunju Tseng  
Michael Chau  
Jialun Qin  
Wei Xi  
Yilu Zhou  

Publications

  1. H. Chen, Y. Chung, M. Ramsey, and C. Yang "A Smart Itsy Bitsy Spider for the Web," Journal of the American Society for Information Science Special Issue on "AI Techniques for Emerging Information Systems Applications," Volume 49, Number 7, Pages 604-618, 1998.
  2. H. Chen, Y. Chung, M. Ramsey, and C. Yang "An Intelligent Personal Spider (Agent) for Dynamic Internet/Intranet Searching", Decision Support Systems , Volume 23, Pages 41-58, May 1998.
  3. D. Roussinov and H. Chen "Document Clustering for Electronic Meetings: An Experimental Comparison of Two Techniques" Decision Support Systems , Volume 27, Pages 67-80, November 1999.
  4. H. Chen "Semantic Research for Digital Libraries" D-Lib Magazine, Volume 5, Number 10/11, October/November 1999.
  5. H. Chen "Digital Libraries" Journal of the American Society for Information Science, Special Issue on Digital Libraries, Volume 51, Number 3, 2000.
  6. H. Chen, Introduction to the Special Topic Issue: Part 2, Towards Building a Global Digital Library, Journal of the American Society for Information Science, Special Issue on Digital Libraries, Volume 51, Number 4, Pages  
    311-312, 2000.
  7. B. Zhu and H. Chen, "Validating a Geographic Image Retrieval System", Journal of the American Society for Information Science, Volume 51, Number 7, Pages 625-634, 2000.
  8. C. Lin, H. Chen and J. F. Nunamaker, "Verifying the Proximity Hypothesis for Self-Organizing Maps,"Journal of Management Information Systems, Volume 16, Number 3, Pages 57-70, 2000.
  9. L. Houston, H. Chen, B. R. Schatz, R. R. Sewell, K. M. Tolle, T. E. Doszkocs, S. M. Hubbard, and D. T. Ng, Exploring the Use of Concept Spaces to Improve Medical Information Retrieval. Decision Support Systems, Special Issue on Decision Support for Health Care in a New Information Age, Volume 30, Number 2, Pages 171-186, 2000.
  10. K. M. Tolle, H. Chen, and H. Chow, Estimating drug/plasma concentration levels by applying neural networks to pharmacokinetic data sets, Decision Support Systems, Special Issue on Decision Support for Health Care in a New Information Age, Volume 30, Number 2, Pages 139-152, 2000.
  11. G. Leroy, K. M. Tolle, and H. Chen, Customizable and Ontology-Enhanced Medical Information Retrieval Interfaces, Methods of Information in Medicine, 2000, forthcoming.
  12. C. C. Yang, J. Yen, and H. Chen, Intelligent Internet Searching Engine based on Hybrid Simulated Annealing, Decision Support Systems, 2000, forthcoming.
  13. D. G. Roussinov and H. Chen, Information Navigation on the Web by Clustering and Summarizing Query Results, Information Processing and Management, 2000, forthcoming.