Helioid Applies for NSF SDCI Grant

The National Science Foundation’s Software Development for Cyberinfrastructure (SDCI) program is dedicated to the development of widely redeployable software systems which advance the methods of research and education employed in a broad range of fields in science and engineering. SDCI directly addresses the software-centric issues laid out in NSF’s Cyberinfrastructure Vision for 21st Century Discovery, seeking to fund and promote the development of innovative software solutions to issues such as the management and analysis of extremely large collections of data, collaboration across disparate disciplines and institutions, and make the use of high performance computing systems both more productive and more accessible. SDCI addresses these issues through awards given across five software focus areas (HPC systems, digital data management, broadband and networking, middleware, and cybersecurity) as well as three cross-cutting issues in software development (sustainability, self-manageability, and energy efficiency).

Helioid has proposed — under the Digital Data focus area — to develop a set of hybridized clustering algorithms capable of generating optimally structured ontologies from multiple graphs of unstructured data. These algorithms would be used to support a knowledge management software suite that would allow for more productive and intuitive exploration of extremely large collections of unstructured data by utilizing interactive visualizations of the clusterings generated by the hybrid algorithms. By structuring documents or other data segments in a collection in such a way as to intuitively represent the conceptual connections between them and the topic structure of the corpus, these knowledge management tools would allow researchers to more easily discover data from pre-existing research pertinent to their interests. These tools will also make it easier for relative novices to a given subject area to familiarize themselves with the ongoing research being carried out within that domain, by essentially providing them with an interactive map of the published work or shared data available for that field. In order to further meet individual researchers’ needs, the knowledge management tools will be self refining, using data generated by user behavior to assess which topics the user seems to be more interested in and tweaking the visual representations accordingly. We have also proposed to apply our algorithms to the generation of meta-data for large corpuses, given a set of pre-tagged data from which extrapolate the appropriate context for each tag. Since most of our work has been in the clustering of documents using text based and link based graphs of a given corpus, we have proposed to focus on structuring large collections of textual data, while designing the hybridization algorithms and knowledge management tools employing them to be as adaptable as possible to as wide a range of data types as possible. By giving as much meaningful structure to these unstructured datasets as possible, and better enabling scientists to productively navigate the body of results of past research in a given field, the open source knowledge management software Helioid proposes to develop would help accelerate the pace of scientific innovation and discovery.

For a more substantive description of what Helioid has proposed for this program, see our project summary below.

Project Summary

The development of a suite of knowledge management and discovery software is proposed for the Software Development for Cyberinfrastructure program under the Software for Digital Data focus area. The proposed project will build upon established theory in knowledge representation to develop hybrid clustering algorithms using multiple views of datasets; design an intuitively navigable information visualization schema to graphically represent data clusterings; and utilize these algorithms and visualization schemas to develop software for knowledge management and discovery. These knowledge management tools will enable researchers to browse large document collections with increased dexterity, efficiency and efficacy. The hybrid clustering algorithms are derived by generating various directed graphs of documents, links between documents (e.g. bibliographical citations, hyperlinks and text similarity), and relative weights of the links, and then applying a spectral clustering technique to find the optimal clustering given the different graphs. The knowledge management tools present users with navigable, malleable representations of the document corpus, displaying conceptual connections between documents or topics. As users interact with the search refinement tools, information regarding their interests and intentions is gleaned from their browsing behavior and is then used to refine the clustering algorithms to reflect the users’ particular needs. These tools will facilitate explorative search and will better enable understanding of the bodies of knowledge encompassed by online corpuses such as Citeseer.

Intellectual Merit of the Proposed Activity
The knowledge management tools developed by the project expand upon recent innovations in topic modeling and information visualization. A wide range of studies have explored the use of hierarchically structured visual representations of corpuses of published work to detect emerging research paradigms, identify leaders or exemplars within research subject areas, and even point the way to new research topics. These studies accomplish this by using concept maps or category collages to illustrate the conceptual connections between different papers, collections of papers, or different disciplines. The goal of the project build upon these successes by combining multiple algorithms for clustering documents in a given corpus. The proposed project applies these innovations to collections of scientific publications and the large collections of documents and information available on the web. This enables productive exploratory search of these corpuses, and a means of gaining a more substantial understanding of the overarching ideas in a subject area, by graphically representing the conceptual connections between documents pertaining to a given knowledge domain.

Broader Impact of the Proposed Activity
The hybridization algorithm developed by the proposed project can be used to optimize across clustering algorithms utilizing a wide variety of types of input data, beyond the text-based and link-based clustering algorithms focused on by the project. The knowledge management tools developed by the project will be open source and can be re-deployed across a broad range of research activities in any subject domain. The project builds a wide network of researchers, students and educators who will use the resources developed by the project to more effectively discover new information or gain an understanding of a subject area through the exploration of a corpus of published work in that area. By enabling researchers to more effectively locate existing research related to their own — of which they were previously unaware — and to familiarize themselves with peripherally related research, the proposed project’s knowledge management tools will facilitate an acceleration in the pace of scientific innovation. These knowledge management tools will also allow relative novices to more efficiently gain an understanding of unfamiliar subject areas or engage in meaningful research. This will effectively lower the barriers to entry for underrepresented groups in these disciplines, as well as low income students.

Leave a Reply