Informatics and Data Science

Informatics plays a key role in the scientific discovery process by building the infrastructure and tools that connect the EDRN research institutions into a virtual knowledge system.

In collaboration with the Jet Propulsion Laboratory (JPL), the EDRN has developed informatics tools and databases to support biomarker development and validation, including tools to support: processing, capture, curation and sharing of data before publication; a national biomarker knowledge system; a biomarker data infrastructure consisting of 1,500 biomarkers, 200 protocols, 2,500 publications, and 100 terabytes of data; and pilot imaging projects. These tools and databases are accessible online.

This infrastructure captures and links data from across the EDRN using nearly 1,500 annotations of cancer biomarkers to terabytes of analysis results in the EDRN data commons, known as LabCAS. LabCAS provides support to capture data from validation studies linking data from laboratory tracking tools at the DMCC to the analysis of data captured in the EDRN. Several tools are open source and are developed through collaborations with NCI’s Information Technology for Cancer Research (ITCR) program. The entire knowledge environment is integrated with the EDRN website portal, providing secure, multi-layer access to data for EDRN, NCI, research and public communities.

Big Data and Data Science

EDRN is active in the research on next generation capabilities for crowd sourcing, machine learning, computational analysis, and visualization.

JPL and Caltech have been working to bring in tools, such as Zooniverse, which have been successfully used in fields such as astronomy. These tools have been useful for generating large labeled sets required to train machine learning algorithms in the classification of features in images. JPL and Caltech have been working with different PIs to explore the use of these tools to improve the capture, annotation and construction of databases using crowdsourcing and collaborative methods in data analysis with a goal of looking at how data-driven approaches (e.g., deep learning) can be applied. The capture of EDRN data in LabCAS provides a foundation for opening up new possibilities in these areas and enabling new analysis approaches for consortiums like the EDRN, which are highly distributed and diverse.

Additionally, Caltech, JPL, and NCI have explored the use of nascent virtual reality (VR) capabilities for analysis of multi-dimensional data. A recent prototype exploring lung cancer was presented at SigGraph 2019, one of the foremost research conferences in visualization and graphics. The VR prototype demonstrated multidimensional 3D data visualization of EDRN’s radiology data, integrating with LabCAS. The prototype demonstrated opportunities for entirely new approaches for data exploration as data increases in size and complexity.

The following provides an overview of EDRN’s progress in data science: