Skip to main content Link Menu Expand (external link) Document Search Copy Copied
dataset versions

Identifying high-risk breast cancer using digital pathology images: A Nightingale Open Science dataset

Authors: Carlo Bifulco1, Brian Piening1, Tucker Bower1, Ari Robicsek1, Roshanthi Weerasinghe1, Soohee Lee1, Nick Foster2, Nathan Juergens2, Josh Risley2, Katy Haynes2, Ziad Obermeyer2,3

1 Providence Cancer Institute
2 Nightingale Open Science
3 University of California, Berkeley

Lead Nightingale analysts: Nick Foster, Nathan Juergens

When using this resource, please cite: more options
Carlo Bifulco, Brian Piening, Tucker Bower, Ari Robicsek, Roshanthi Weerasinghe, Soohee Lee, Nick Foster, Nathan Juergens, Josh Risley, Senthil Nachimuthu, Katy Haynes, and Ziad Obermeyer. 2021. Identifying high-risk breast cancer using digital pathology images. DOI:

Additionally, please cite: more options
Sendhil Mullainathan and Ziad Obermeyer. 2022. Solving medicine’s data bottleneck: Nightingale Open Science. Nature Medicine 28, 5 (May 2022), 897–899. DOI:

The problem

Every year, 40 million women get a mammogram; some go on to have an invasive biopsy to better examine a concerning area. Underneath these routine tests lies a deep—and disturbing—mystery. Since the 1990s, we have found far more ‘cancers’, which has in turn prompted vastly more surgical procedures and chemotherapy. But death rates from metastatic breast cancer have hardly changed.

When a pathologist looks at a biopsy slide, she is looking for known signs of cancer: tubules, cells with atypical looking nuclei, evidence of rapid cell division. These features, first identified in 1928, still underlie critical decisions today: which women must receive urgent treatment with surgery and chemotherapy? And which can be prescribed “watchful waiting”, sparing them invasive procedures for cancers that would not harm them?

Dataset overview

There is already evidence that algorithms can predict which cancers will metastasize and harm patients on the basis of the biopsy image, which would help doctors make this decision. Fascinatingly, these algorithms also hone in on features that humans neglect, for example the nature of the non-cancerous tissue surrounding the tumor. But to date, the datasets linking biopsy images to patient outcomes—metastasis, death—have been far smaller than what is needed to apply modern approaches.

This dataset will link 175,000 biopsy slides from 11,000 unique patients to cancer registry data on cancer stage, electronic health record data on presence of metastasis, and Social Security data on mortality. Linking these rich biopsy data to outcome labels will allow researchers to train algorithms to identify patients at high risk of poor outcomes (metastasis, death), and compare them to the pathologist’s initial staging decision.

Each observation in the dataset corresponds to a microscopy image from a breast biopsy specimen, collected at the Providence Cancer Institute (Portland, OR) between January 1st, 2010 and December 31st, 2020. We retrieved the physical microscope slides from the Institute’s biospecimen repository and digitized the slide at 40x magnification as a NanoZoomer Digital Pathology Image (NDPI) file on a Hamamatsu NanoZoomer S360. We were very fortunate to work with Hamamatsu on this project: their state of the art scanners are built on a foundation of 15 years of product innovation in digital pathology, and another 65 years of photonics experience. The resulting files contain the high-resolution image of the slide in addition to several down-sampled (lower resolution) versions. These multiple resolutions allow a pathologist to examine the entire slide and then quickly zoom into areas of interest at a higher resolution.

Our partners

Providence is a not-for-profit health care system operating in seven states and serves as the parent organization for 100,000 caregivers. The combined system includes 51 hospitals, 829 clinics, and other health, education and social services across Washington, Oregon, California, Alaska, Montana, New Mexico, and Texas.

This dataset was conceived of and created by Carlo Bifulco, MD, Director of Molecular Pathology and Pathology Informatics; Brian Piening, PhD, Technical Director of Clinical Genomics, and thanks to the leadership of Ari Robicsek, Chief Medical Analytics Officer at Providence. We are particularly proud of this dataset because it holds the promise of targeting new patterns in breast cancer tumors, providing insight into which patients may be at risk and need preventive treatment.

We are very grateful to Hamamatsu, developers of the NanoZoomer 360 platform, who supported this work with a grant from their Product Marketing Division. Hamamatsu cares deeply about deploying products and technology that can empower researchers and advance patient outcomes and has been a key collaborator here.

Dataset details


dataset versions

This dataset v1: This dataset contains images and outcomes for 24,939 biopsy slides that correspond to 1,648 cases ranging from late 2017 to 2020. At the time of launch these images already occupies 37TB of space, and we’re working on transferring the next batch as soon as possible.

What’s next for v1.1 (released: April 2022): Images and outcomes for 1500 more biopsies. This will add 41,000 more slides to the dataset and expand it to span cases from 2016 to 2020.

v1.1: 68,000 
v1: 24,939 

What’s next for v2 (released: April 2022): A critical input to these predictions is data on cancer treatments: clearly, women who received treatment for breast cancer must be handled differently in predictions from women who were not treated, because treatment decreases the likelihood of metastasis and death. While coarse treatment approaches can be inferred from the stage assigned, we will add more data on procedures (with surgery or chemotherapy).

We have currently completed linkage of biopsy specimens to the hospital’s cancer registry data. This captures cancer stage for women who received their initial diagnosis at the cancer center, but not those who were referred after an initial diagnosis elsewhere. In the next version, we will add linkage to the state-level cancer registry, to capture additional staging information from other facilities.

Dataset schema

Dataset Observations Connection to Key Outcomes
Dataset construction and key outcome variables are shown in the diagram above. A note on color choices: the burnt siena (orange) indicates the node that corresponds to the observations (rows) in the dataset, and the grape (purple) indicates key patient outcomes.

Key variables


We identified patients in our cohort that had a metastatic disease diagnosis using the encounters and diagnosis tables in Providence’s EPIC system. To find metastatic diagnoses, we searched for instances with ICD9 codes 174.xx or 175.xx (for primary breast malignancy) that also had codes 196.xx, 197.xx, or 198.xx recorded on the same date (as previously described in the literature [1, 2]). The earliest diagnosis for each of these patients is used. Of note, the earliest metastatic diagnosis may predate a given biopsy as this dataset is representative of all the biopsies for breast cancer at Providence, and a biopsy may be involved in determining whether there is a recurrence or progression of cancer.

This method is ‘strict’ in the sense that it requires a breast cancer diagnosis to be present on the same day as a metastatic diagnosis. A less strict definition would be to look only for the presence of a metastatic code (in case the coding of breast cancer was implied). In any case, the raw data are provided and researchers can use the definitions they prefer.

v1 Dataset metastatic diagnosis and mortality outcomes

N Biopsies 1,648  
N Patients 1,436  
Years after biopsy First Metastatic Diagnosis (strict) Mortality
Before biopsy 0.91% 0%
0-1 3.3% 0.55%
1 1.2% 1.4%
2 0.73% 0.73%
3 0.24% 0.73%
4 0.06% 0.00%
Total 6.4% 3.4%

Notes: Percentages are calculated as percent of biopsies. These cases are drawn from later dates in our overall dataset (2017 to 2020), so are ‘right-censored’ when it comes to longer follow up times. We will be adding more as they are scanned and transferred.


We identified patients with records of death using multiple data sources. The three sources are EPIC, the cancer registry, and the Social Security Death Index. Unfortunately, these sources are not death certificates, so we do not know the cause of death.


Using the Providence cancer registry, we identified the first recorded cancer stage for a patient within a year after the biopsy. Cancer stage typically incorporates information from a biopsy, but additional information is needed to establish the TNM (Tumor, Node, Metastasis) stage. For more information on registries and staging, see the National Cancer Institute’s description here.

Staging information is only available for the subset of biopsies in patients whose cancer was first diagnosed, or formally re-staged, at Providence and thus recorded in their cancer registry. For example, if the biopsy is the result of a patient seeking a second opinion after receiving initial treatment at another health system, the stage will not be included in the dataset. Some cases were eligible for staging at Providence, but no stage was recommended (indicated with “No Stage Rec,” as opposed to missing stage information).

v1 First cancer stage recorded for biopsies

  Stage count Metastatic Diagnosis (strict) Mortality
No Stage Rec 199 13% 5.5%
0 131 0.8% 0.76%
I 629 3.3% 3.2%
II 77 14% 3.9%
III 19 47% 16%
IV 17 59% 24%
Total 1072 7.2% 3.9%

Table of contents

Copyright © 2021-2023 Nightingale Open Science. All rights reserved.