Informatics and Data Science
Informatics plays a key role in the scientific discovery process by building the infrastructure and tools that connect the EDRN research institutions into a virtual knowledge system.
In collaboration with the Jet Propulsion Laboratory (JPL), the EDRN has developed informatics tools and databases to support biomarker development and validation, including tools to support: processing, capture, curation and sharing of data before publication; a national biomarker knowledge system; a biomarker data infrastructure consisting of 1,500 biomarkers, 200 protocols, 2,500 publications, and 100 terabytes of data; and pilot imaging projects. These tools and databases are accessible online.
This infrastructure captures and links data from across the EDRN using nearly 1,500 annotations of cancer biomarkers to terabytes of analysis results in the EDRN data commons, known as LabCAS. LabCAS provides support to capture data from validation studies linking data from laboratory tracking tools at the DMCC to the analysis of data captured in the EDRN. Several tools are open source and are developed through collaborations with NCI’s Information Technology for Cancer Research (ITCR) program. The entire knowledge environment is integrated with the EDRN website portal, providing secure, multi-layer access to data for EDRN, NCI, research and public communities.
Big Data and Data Science
EDRN is active in the research on next generation capabilities for crowd sourcing, machine learning, computational analysis, and visualization.
JPL and Caltech have been working to bring in tools, such as Zooniverse, which have been successfully used in fields such as astronomy. These tools have been useful for generating large labeled sets required to train machine learning algorithms in the classification of features in images. JPL and Caltech have been working with different PIs to explore the use of these tools to improve the capture, annotation and construction of databases using crowdsourcing and collaborative methods in data analysis with a goal of looking at how data-driven approaches (e.g., deep learning) can be applied. The capture of EDRN data in LabCAS provides a foundation for opening up new possibilities in these areas and enabling new analysis approaches for consortiums like the EDRN, which are highly distributed and diverse.
Additionally, Caltech, JPL, and NCI have explored the use of nascent virtual reality (VR) capabilities for analysis of multi-dimensional data. A recent prototype exploring lung cancer was presented at SigGraph 2019, one of the foremost research conferences in visualization and graphics. The VR prototype demonstrated multidimensional 3D data visualization of EDRN’s radiology data, integrating with LabCAS. The prototype demonstrated opportunities for entirely new approaches for data exploration as data increases in size and complexity.
The following provides an overview of EDRN’s progress in data science:
- Standards and a process for capturing highly annotated cancer biomarker data.
- Development of the LabCAS data commons infrastructure enabling EDRN to capture data, run pipelines, and link analytics including 77 collections and 23, 200 files.
- Deployment and instantiation of LabCAS on Amazon Web Services to support massive scalability and computation as well as collaborative analysis tools.
- Integration of analytical tools including OHIF, caMicroscope, QuPath, 3D Slicer.
- Development of cancer biomarker ontology and common data elements for annotating data in the data commons, and sharing of these standards with the research community by deposition in caDSR.
- Integration of EDRN’s data commons LabCAS into validation studies/reference sets including:
- Breast Reference Set
- Prostate MRI
- Development of a genomics pipeline.
- Development of a Secretome tool/pipeline.
- Development of a pipeline for miRNA measurements.
- Capture and curation of 1,500 EDRN biomarkers in the EDRN biomarker database:
- Linking of EDRN biomarker studies to researched and discovered biomarkers
- Linking of publications and data
- Linking of external data sources
- Development and collaboration on OncoMX Portal through ITCR with George Washington University and integration with EDRN to link biomarkers and gene mutations.
- Development of crowd sourcing techniques based on Zoonverse for analysis of lung imaging data with UCLA and Moffitt.
- Development of the EDRN Portal to integrate sites, protocols, biomarkers, and data into a searchable knowledge environment:
- Collaboration and coordination with the DMCC to maintain portal information
- Daily integration of databases to support an integrate biomarker data environment
- Collaboration with NCI’s Center for Bioinformatics and Information Technology to provide operational support for running the portal
- Research and development of image alignment and an automated data pipeline to support automated alignment of 3D imaging for biomarker discovery in pancreatic cancer.
- Presentations to the HTAN on the biomarker knowledge environment and model-drive architecture for large scale data integration, sharing and analytics.