dataset versions

Identifying high-risk breast cancer using digital pathology images: A Nightingale Open Science dataset

Authors: Carlo Bifulco1, Brian Piening1, Tucker Bower1, Ari Robicsek1, Roshanthi Weerasinghe1, Soohee Lee1, Nick Foster2, Nathan Juergens2, Josh Risley2, Senthil Nachimuthu2, Katy Haynes2, Ziad Obermeyer2,3

1 Providence Cancer Institute
2 Nightingale Open Science
3 University of California, Berkeley

Lead Nightingale analysts: Nick Foster

When using this resource, please cite: more options
(missing reference)

Additionally, please cite: more options
Sendhil Mullainathan and Ziad Obermeyer. 2022. Solving medicine’s data bottleneck: Nightingale Open Science. Nature Medicine 28, 5 (May 2022), 897–899. DOI:

The problem

Every year, 40 million women get a mammogram; some go on to have an invasive biopsy to better examine a concerning area. Underneath these routine tests lies a deep—and disturbing—mystery. Since the 1990s, we have found far more ‘cancers’, which has in turn prompted vastly more surgical procedures and chemotherapy. But death rates from metastatic breast cancer have hardly changed.

When a pathologist looks at a biopsy slide, she is looking for known signs of cancer: tubules, cells with atypical looking nuclei, evidence of rapid cell division. These features, first identified in 1928, still underlie critical decisions today: which women must receive urgent treatment with surgery and chemotherapy? And which can be prescribed “watchful waiting”, sparing them invasive procedures for cancers that would never harm them?

Dataset overview

There is already evidence that algorithms can predict which cancers will metastasize and harm patients on the basis of the biopsy image. Fascinatingly, these algorithms also hone in on features that humans neglect, for example, the nature of the non-cancerous tissue surrounding the tumor. But to date, the datasets linking biopsy images to patient outcomes—metastasis, death—have been far smaller than what is needed to apply modern approaches.

This dataset will link 175,000 biopsy slides from 11,000 unique patients to cancer registry data on cancer stage, electronic health record data on presence of metastasis, and Social Security data on mortality. Linking these rich biopsy data to outcome labels will allow researchers to train algorithms to identify patients at high risk of poor outcomes (metastasis, death), and compare them to the pathologist’s initial staging decision.

Each observation in the dataset corresponds to a microscopy image from a breast biopsy specimen, collected at the Providence Cancer Institute (Portland, OR) between January 1st, 2010 and December 31st, 2020. We retrieved the physical microscope slides from the Institute’s biospecimen repository and digitized the slide at 40x magnification as a NanoZoomer Digital Pathology Image (NDPI) file on a Hamamatsu NanoZoomer S360. We were very fortunate to work with Hamamatsu on this project: their state of the art scanners are built on a foundation of 15 years of product innovation in digital pathology, and another 65 years of photonics experience. The resulting files contain the high-resolution image of the slide in addition to several down-sampled (lower resolution) versions. These multiple resolutions allow a pathologist to examine the entire slide and then quickly zoom into areas of interest at a higher resolution.

Our partners

Providence is a not-for-profit health care system operating in seven states and serves as the parent organization for 100,000 caregivers. The combined system includes 51 hospitals, 829 clinics, and other health, education and social services across Washington, Oregon, California, Alaska, Montana, New Mexico, and Texas.

This dataset was conceived of and created by Carlo Bifulco, MD, Director of Molecular Pathology and Pathology Informatics; Brian Piening, PhD, Technical Director of Clinical Genomics, and thanks to the leadership of Ari Robicsek, Chief Medical Analytics Officer at Providence. We are particularly proud of this dataset because it holds the promise of targeting new patterns in breast cancer tumors, providing insight into which patients may be at risk and need preventive treatment.

We are very grateful to Hamamatsu, developers of the NanoZoomer 360 platform, who supported this work with a grant from their Product Marketing Division. Hamamatsu cares deeply about deploying products and technology that can empower researchers and advance patient outcomes and has been a key collaborator here.

Dataset details


dataset versions

This dataset v2: This dataset contains images and outcomes for 72,400 biopsy slides that correspond to 4,200 cases ranging from 2014 to 2020. These images currently occupy 133TB of space, and we’re working on transferring the next batch as soon as possible. 25% of that data is reserved as a holdout for validation of research and future competitions. The split is done at the patient level.

v2: 72,400 

What’s next: We are expecting addition cancer registry outcomes from the state cancer registry. Currently, we only have outcomes from Providence’s cancer registry, which only includes patients who received their initial diagnosis and treatment at Providence.

We will also add more whole slide images and outcomes to the dataset, including cases from 2010 to 2014. We estimate to have 175,000 whole slide images in total.

Dataset schema

Dataset Observations Connection to Key Outcomes
Dataset construction and key outcome variables are shown in the diagram above. A note on color choices: the burnt siena (orange) indicates the node that corresponds to the observations (rows) in the dataset, and the grape (purple) indicates key patient outcomes.

Key variables


We identified patients in our cohort that had a metastatic disease diagnosis using the encounters and diagnosis tables in Providence’s Epic EMR. To find metastatic diagnoses, we searched for instances with ICD9 codes 174.xx or 175.xx (for primary breast malignancy) that also had codes 196.xx, 197.xx, or 198.xx recorded on the same date (as previously described in the literature [1, 2]). The earliest diagnosis for each of these patients is used. Of note, the earliest metastatic diagnosis may predate a given biopsy as this dataset is representative of all the biopsies for breast cancer at Providence, and a biopsy may be involved in determining whether there is a recurrence or progression of cancer.

This method is ‘strict’ in the sense that it requires a breast cancer diagnosis to be present on the same day as a metastatic diagnosis. A less strict definition would be to look only for the presence of a metastatic code (in case the coding of breast cancer was implied). In any case, the raw data are provided and researchers can use the definitions they prefer.

v2 Dataset metastatic diagnosis and mortality outcomes

N Biopsies 3,256  
N Patients 2,568  
Years after biopsy First Metastatic Diagnosis (strict) Mortality
Before biopsy 1.3% 0%
0-1 3.6% 1.9%
1 1% 1.7%
2 0.86% 1.9%
3 0.58% 1.4%
4 0.37% 0.92%
5 0.31% 0.61%
6 0.092% 0.58%
7 0.031% 0.031%
Total 8.2% 9%

Notes: Percentages are calculated as percent of biopsies. These cases are drawn from later dates in our overall dataset (2016 to 2020), so are ‘right-censored’ when it comes to longer follow up times. We will be adding more as they are scanned and transferred.


We identified patients with records of death using multiple data sources. The three sources are the Epic EMR, the cancer registry, and the Social Security Death Index. Unfortunately, these sources are not death certificates, so we do not know the cause of death.


Using the Providence cancer registry, we identified the cancer stage in the records for the patient during the year of the biopsy. Cancer stage typically incorporates information from a biopsy, but additional information is needed to establish the TNM (Tumor, Node, Metastasis) stage. For more information, see the National Cancer Institute’s descriptions of registries and staging.

Staging information is only available for the subset of biopsies in patients whose cancer was first diagnosed, or formally re-staged, at Providence and thus recorded in their cancer registry. For example, if the biopsy is the result of a patient seeking a second opinion after receiving initial treatment at another health system, the stage will not be included in the dataset.

v2 Cancer stage for biopsies

  Stage count Metastatic Diagnosis (strict) Mortality
0 418 0.96% 2.6%
I 1425 3.9% 5%
II 644 12% 13%
III 173 35% 32%
IV 61 72% 51%
Total 2721 9% 9.2%

Demographics & Comorbidities:

In v2, basic demographic data were obtained from the EMR and linked to case IDs. We also obtained information on patient comorbidities using the ICD-9 and ICD-10 codes in the Charleson comorbity index (CCI), a validated index of conditions associated with mortality. Conditions were only included if patients were diagnosed in the two years preceding their biopsy date. The list of conditions include: dementia, peripheral vascular disease, pulmonary disease, liver disease, diabetes, cerebral vascular accident, congestive heart failure, diabetes complications, cancer, peptic ulcer, severe liver disease, metastatic cancer, connective tissue disorder, acute myocardial infarction, renal disease, HIV, and paraplegia.

v2 Comorbidities

Cancer 85%
Pulmonary disease 21%
Diabetes 13%
Metastatic cancer 13%
Renal disease 9.3%
Cerebral vascular accident 8.1%
Congestive heart failure 6.5%
Peripheral vascular disease 5.8%
Diabetes complications 4.9%
Acute myocardial infarction 3.3%
Dementia 2.7%
Connective tissue disorder 2.3%
Peptic ulcer 1.2%
Liver disease 0.98%
Paraplegia 0.83%
Severe liver disease 0.55%
HIV 0%


A critical input to these predictions is data on cancer treatments: clearly, women who received treatment for breast cancer must be handled differently in predictions from women who were not treated, because treatment decreases the likelihood of metastasis and death. While coarse treatment approaches can be inferred from the stage assigned, we have now added more data on the specific initial therapy women received for the treatment of their cancer, including the surgical procedure and margins, as well as chemotherapy, radiation, hormone therapy or immunotherapy. Data on initial therapeutics were only available for those patients in the Providence cancer registry.


In addition to the histologic features pathologists describe from biopsy slides, there are an increasing number of molecular tumor features with prognostic and therapeutic value. These include hormone or growth factor receptors (ER, PR, HER2), as well as panels of tumor genes known to promote cancer growth and spread (Oncotype Dx, multigene signature assays). As above, these additional pathologic details were only available for patients in the Providence cancer registry.

v2 Growth factor receptor positivity

  Biopsies with results Biopsies with positive results Percent Positive
ER Summary 2773 2385 86%
PR Summary 2756 2067 75%
HER2 Summary 2763 304 11%

Notably, there was a small subset of patients with duplicate entries in the Providence cancer registry, and a subset of those patients with differing values for a treatment or pathology variable (for example, a single breast cancer case with two entries, one of which documents a surgical margin with “no residual tumor”, the other of which documents “microscopic residual tumor.”). In these cases, we aggregated duplicates into one entry per patient, and we replaced any incongruous values with NA.

Table of contents

Copyright © 2021-2023 Nightingale Open Science. All rights reserved.