INSITE Center for Business Intelligence and Analytics is developing graph algorithms for biomedical big data.
Non-communicable diseases (NCD) like diabetes are caused by social, environmental, and genetic factors. Rapidly scaling datasets from big data sources such as genome sequencing, social media, and medical sensors, make it possible to explore factors associated with NCDs in unprecedented new ways. These data hold the promise of a new understanding to complex disease that can be applied towards prevention and directed therapy. Yet, the toolkit and necessary big data platform to analyze these large-scale datasets is under-developed, stymieing the use of new data and translation to biomedical knowledge. To unlock these valuable resources, a new ecosystem of interoperable, scalable and discoverable biomedical data resources is required. To this end, we propose to develop big data tools using graph database technology for biomedical research through an interdisciplinary collaboration called BRIDGE (Biomedical Research Innovation through Dynamic Graph Engineering).
We aim to use and expand on rapidly developing graph database technology from open source projects to: (1) build tools to convert biomedical data into interoperable graph databases, (2) develop methods to interconnect disparate graph databases, (3) develop robust strategies for preserving privacy when translating biomedical data to graph database networks, and (4) develop methods for visualizing network data for exploratory and predictive analyses. These tools can be applied broadly in any biomedical or disease area to uncover unseen patterns from disparate big data sources. Further, the proposed research addresses unsolved problems in graph database theory, especially related to privacy preservation and graph database network interoperability, that are broadly applicable in any field of study towards big data analytics.
This work has the capacity to transform how biomedical data are stored, analyzed and visualized towards an integrated biomedical knowledge environment. Our proposal focuses on four core areas of data science research to develop a robust data toolkit: (1) graph database storage and fast retrieval using the Hadoop architecture, (2) data linking via graph databases, (3) privacy preservation for network data in graph databases, and (4) large-scale map-based data visualization of networks. This big data toolkit will initially be developed in partnership with clinicians and biomedical researchers using data from both traditional sources (e.g. electronic health records and lab tests) and non-traditional sources (e.g. sensors and social media) allowing for connectivity across various data modalities. Moreover, characteristics of different data types will be explored including very small to very large datasets and sparse to dense datasets within the same unified framework. Therefore this work offers new advances in: (1) how large scale clinical data sets are stored and mined, (2) data sharing by informatically removing barriers given patient privacy, and (3) data visualization of large-scale multi-dimensional datasets for discovery and clinical interpretation. BRIDGE proposes to develop innovative tools for big data analytics through extending open source big data technologies and providing new health related data functionality. These big data tools will be fully integrated with one another and made available as open source software. Further, to contribute to the broader big data analytics infrastructure, all code will be distributed back to the open source programs they are based on (e.g. Neo4j, Hadoop, GMap) for continued community development.