Welcome to Nightingale Open Science

The Datasets

Waveforms

ed-bwh-ecg: Assessing Heart Attack Risk (104,000 ECG waveforms)
silent-cchs-ecg: Diagnosing ‘Silent’ Heart Attack (48,000 ECG waveforms)
arrest-ntuh-ecg: Subtyping Cardiac Arrest (24,106 ECGs waveforms)
mcmed-stanford-multi: Multimodal Clinical Monitoring in the ED (1,000,000 Waveforms)

Microscopy Images

brca-psj-path: Identifying High-Risk Breast Cancer (175,000 Biopsy Slides)
tb-wellgen-smear: Detecting Active Tuberculosis (75,000 TB Smear Images)

X-ray Images

mrkr-emory-xray: Emory Knee Radiograph (500,000 Knee X-ray Images)
fracture-aimi-xray: Predicting Fractures (224,000 Chest X-ray Images)

Multiple Diagnostics

tamil-jpal-multi: Tamil Nadu J-PAL Data Dictionary (82,000 Diagnostic Images)

What makes these datasets special

Our datasets are curated around medical mysteries—heart attack, cancer metastasis, cardiac arrest, bone aging, Covid-19—where machine learning can be transformative. We designed these datasets with four key principles in mind:

The core of each dataset is a large collection of medical images: x-rays, ECG waveforms, digital pathology (and more to come). These rich, high-dimensional signals are too complex for humans to fully see or process—so machine vision can add huge value.
Each image is linked to at least one ground truth outcome: data on what happened to the patient, not a doctor’s interpretation of the image. This allows researchers to build algorithms that learn from nature—not from humans.
The data are diverse: we work with health systems across the US and the world, including under-resourced ones whose data aren’t usually represented in machine learning. This lets the resulting algorithm speak to the needs of diverse populations.
Access is secure and ethical: all data are completely deidentified, and as an extra precaution, no download is allowed. Only non-commercial use is allowed, so the knowledge generated from the data benefits everyone.