The Foundations of Data and Visual Analytics (FODAVA) program at the NSF focuses on promoting innovative approaches to information visualization and representative allowing users to easily see existing patterns and to discover new patterns in their data. Particular attention is given to large and potentially dynamic data sets. Algorithms and techniques developed within the program and potentially applicable to a number of areas such as scientific, engineering, commercial, financial, and governmental. The program is operated in conjunction with the Department of Homeland Security as it is expected that the tools developed will be applicable to intelligence analysis applications.
Helioid’s proposal is to develop knowledge management tool based on hybrid algorithms of current clustering and topic modeling algorithms (see below). These tools will integrate adaptive algorithms with visualization interfaces responsive to user feedback. With regards to intelligence applications, consider the task of locating Osama Bin Laden. We’d like to organize the available information so that given the name a path is constructed to potential locations which can then be further investigated and discriminated against. The visualization tool’s task is to present information efferent from the name so that the user can accurately and quickly narrow down the future path. In this case, with such a poorly defined fitness function (whether the path being followed is leading to Bin Laden’s location) we’d want to allow the user to return to previous paths if the ones explored aren’t producing results. We can picture the algorithm as providing the user with a partial automata, at each step the user is provided with potential transitions and chooses a transition (or transitions) to reach the next state. The algorithm provides priors on potential transitions leading to the goal.
There’s is certainly much more that could be said on both the applications and the theory behind this process. For now, refer to the below for details on our proposal.
Project Summary
The proposed project will build upon established theory in knowledge representation to develop a set of multi-view clustering algorithms; develop a hybridization algorithm, using a combination of spectral analysis and stacking algorithms, to find an optimal synthesis of these clustering algorithms; and utilize these algorithms to support a set of graphical tools for knowledge management and discovery. These knowledge management tools will enable users to browse large document collections with increased dexterity, efficiency, and efficacy, by applying novel hybrid clustering/topic-modeling algorithms to the collections. These hybrid clustering algorithms are derived by generating various directed graphs of documents, links between documents (e.g. bibliographical citations, hyperlinks and text similarity), and relative weights of the links, and then applying a spectral clustering technique to find the optimal clustering given the different graphs. The project will develop a suite of information visualization schemas which will intuitively represent the topic model structure of a given corpus derived by these algorithms. The hybrid clustering algorithms and information visualization framework developed by the project will together support a set of interactive knowledge management tools. These knowledge management tools present users with navigable, malleable representations of the document corpus, displaying conceptual connections between documents or topics. The document clusterings and their graphical representations will be continuously refined by feedback generated by user activity. As users interact with the search refinement tools, information regarding their interests and intentions is gleaned from their browsing behavior and is then used to refine the clustering algorithms to reflect the users’ particular needs. These tools will facilitate explorative search and will better enable understanding of the bodies of knowledge encompassed by the document collections.
Intellectual Merit of the Proposed Activity
The knowledge management tools developed by the project expand upon recent innovations in topic modeling and information visualization. Implementations of these innovative approaches to knowledge management have already proven successful in stimulating innovation amongst professionals within specific academic disciplines. A wide range of studies have explored the use of hierarchically structured representations of corpuses of published work and graphic tools such as concept maps or category collages, to detect emerging research paradigms, identify leaders or exemplars within research subject areas, and even point the way to new research topics. These studies accomplish this by using concept maps and category collages to illustrate the conceptual connections between different papers, collections of papers, or different disciplines. The primary problem the project addresses is then to find and deploy an optimal synthesis of these different approaches to knowledge management. The proposed project applies these innovations to collections of scientific publications and the large collections of documents and information available on the web. This enables productive exploratory search of these corpuses, and a means of gaining a more substantial understanding of the overarching ideas in a subject area, by graphically representing the conceptual connections between documents pertaining to a given knowledge domain. Moreover, the project increases the involvement of relative novices to given academic subjects by getting educators and students acquainted with and involved in contemporary research relating to a given lesson.
Broader Impact of the Proposed Activity
The hybridization algorithm developed by the proposed project can be used to optimize across clustering algorithms utilizing a wide variety of types of input data, beyond the text- based and link-based clustering algorithms focused on by the project. Any data transformations that can represent a corpus as a graph with nodes, edges connecting nodes and weights for the edges can be hybridized using these algorithms with any other graph representing the same data. The project will make the hybridization algorithms available to future FODAVA projects, and other projects in the fields of document classification, knowledge management, visual analytics, etc., in order to enable these future projects to experiment with combinations of different data transformations. The project will also make the knowledge management system developed available via the web, in order to provide as wide a user base as posible with access to these research tools. The project builds a wide network of researchers, students and educators who will use the resources developed by the project to more effectively discover new information or gain an understanding of a subject area through the exploration of a corpus of published work in that area. By enabling researchers to more effectively locate existing research related to their own — of which they were previously unaware — and to familiarize themselves with peripherally related research, the proposed project’s knowledge management tools will facilitate an acceleration in the pace of scientific innovation. By presenting users with an intuitive, navigable representation of the topics studied in a given subject area, and their relationships to one another, these knowledge management tools will allow relative novices to more efficiently gain an understanding of these subject areas. By making it easier for novice researchers to familiarize themselves with new subject areas and engage in meangingful scientific or other academic research, the project’s knowledge management tools will also lower the barriers to entry for underrepresented groups in these disciplines (eg. minorities, women and low income students). Making the scientific research tools discussed above publicly available will create a means of disseminating information, and more importantly, promoting an understanding of ongoing scientific research.
