Welcome to Nightingale Open Science /datasets
Today, health data are mostly locked up in small sandboxes, controlled by a handful of private companies or well-resourced researchers. Nightingale Open Science aims to unlock those data, securely and ethically, and make them available for the public good.
Just as ImageNet jumpstarted the field of machine vision, Nightingale seeks to build a community of researchers working in a new scientific field: ‘computational medicine.’
The Datasets
Waveforms
-
ed-bwh-ecg
: Assessing Heart Attack Risk (104,000 ECG waveforms) -
silent-cchs-ecg
: Diagnosing ‘Silent’ Heart Attack (48,000 ECG waveforms) -
arrest-ntuh-ecg
: Subtyping Cardiac Arrest (24,106 ECGs waveforms) -
mcmed-stanford-multi
: Multimodal Clinical Monitoring in the ED (1,000,000 Waveforms)
Microscopy Images
-
brca-psj-path
: Identifying High-Risk Breast Cancer (175,000 Biopsy Slides) -
tb-wellgen-smear
: Detecting Active Tuberculosis (75,000 TB Smear Images)
X-ray Images
-
fracture-aimi-xray
: Predicting Fractures (224,000 Chest X-ray Images) -
covid-psj-xray
: Emergency Triage of Covid-19 Patients (27,500 Chest X-ray Images)
Multiple Diagnostics
-
tamil-jpal-multi
: Tamil Nadu J-PAL Data Dictionary (82,000 Diagnostic Images)
To view our platform, visit app.nightingalescience.org.
💡 To help us with capacity planning, we would appreciate your input on the resources you think you’ll need for your research in terms of GPU-hours. Please let us know here.
What makes these datasets special
Our datasets are curated around medical mysteries—heart attack, cancer metastasis, cardiac arrest, bone aging, Covid-19—where machine learning can be transformative. We designed these datasets with four key principles in mind:
-
The core of each dataset is a large collection of medical images: x-rays, ECG waveforms, digital pathology (and more to come). These rich, high-dimensional signals are too complex for humans to fully see or process—so machine vision can add huge value.
-
Each image is linked to at least one ground truth outcome: data on what happened to the patient, not a doctor’s interpretation of the image. This allows researchers to build algorithms that learn from nature—not from humans.
-
The data are diverse: we work with health systems across the US and the world, including under-resourced ones whose data aren’t usually represented in machine learning. This lets the resulting algorithm speak to the needs of diverse populations.
-
Access is secure and ethical: all data are completely deidentified, and as an extra precaution, no download is allowed. We carefully track everything anyone does on the platform. Only non-commercial use is allowed, so the knowledge generated from the data benefits everyone.
What’s in the documentation
In addition to standard data dictionaries, our documentation contains the following materials:
-
The medical problem these data solve. We always start with the reason we—and our health system partners—are so excited about these datasets, and why we think they can make a difference for real health outcomes.
-
Our partners. We work closely with health leaders and researchers around the world, who are as passionate about solving these problems as we are. And they will be watching carefully to see what results you’ll produce—the reason they created these data is because they want to find ways to partner with you, and apply your tools to make a difference for real patients.
-
Dataset construction and contents. Some of the biggest problems in machine learning—failures to generalize, dataset shift, lack of representation, etc.—happen when researchers overlook critical details about how the data were made. Our documentation explains exactly how an observation came to be in the dataset, and flags the tradeoffs or problems inherent in that process. We also carefully define the key variables, and show the relationship among them in an easy-to-read schematic.
-
What’s next. In addition to adding new datasets, we are constantly expanding our existing datasets. So we always note exactly what is in the current version of the dataset, and what will be added in the next version. Our version terminology keeps track of what’s being added: for example, if we add observations (rows) to a version 1.0 dataset, without changing the basic dataset schema, we’ll call the resulting dataset 1.1; if we expand the set of variables (i.e., columns), we’ll call the resulting dataset 2.0.
A word on working across datasets
We’ve tried to make sure the same semantic concept (e.g., age) is named the same way across datasets. We use patient_ngsci_id
to denote individuals, and [study]_id
to denote studies (e.g., biopsies, x-rays; note that one study can generate multiple images—e.g., the many magnified parts of a biopsy specimen, or the two orthogonal views taken in most chest x-rays). Please be aware, however, that some other variable names may still be inconsistent (and please let us know when you find inconsistencies, so we can fix them). Finally, note that image files were not normalized in any way—they are made available in the same format they were given to us by our health system partners.
We think one of the most exciting things about Nightingale is that diverse but fundamentally similar data are available in the same place: there is no reason you can’t, for example, (pre-) train a model in Boston, and see how it works in Taipei. We hope that researchers will take advantage of this and do this—and many other things we could never dream of.
Now what
- Register for the Nightingale OS Platform
- Wait to get admitted; we’ll onboard users as quickly as possible
- Use free Nightingale cpu.xsmall instances explore the data in a JupyterLab environment
- Fund a project using a credit card to use our full range of instance sizes, including GPUs
- Invite collaborators to join your Nightingale project