Data dictionary v2
Table of contents
Dataset Changes
Dataset Reorganization (Feb 2023)
- Demographic fields moved from
outcomes.csv
todemographics.csv
.ndpi/
directory flattened. The year directories were removed.File tree changes
. └── brca-psj-path ├── ... ├── v2 │ ├── cancer-dx.csv │ ├── comorbidities.csv + │ ├── demographics.csv │ ├── outcomes.csv │ ├── pathology-items.csv │ ├── slide-biopsy-map.csv │ ├── social-determinants.csv │ └── treatments.csv └── ndpi - ├── 2016 │ ├── 0035de3d-81ec-4945-a760-55518ba8b376.ndpi │ └── ... - ├── 2017 │ ├── 00a94273-e9ab-42f5-a47e-512a13e8603e.ndpi │ └── ... ...
Note
The dates in this dataset have been shifted by a random number of days. All dates for any particular patient have been shifted by the same amount in order to preserve the time duration between events.
Male patients made up less than 2% of biopsy patients and were excluded from the dataset.
File tree
.
└── brca-psj-path
├── ...
├── v2
│ ├── cancer-dx.csv
│ ├── comorbidities.csv
│ ├── demographics.csv
│ ├── outcomes.csv
│ ├── pathology-items.csv
│ ├── slide-biopsy-map.csv
│ ├── social-determinants.csv
│ └── treatments.csv
└── ndpi
├── 0035de3d-81ec-4945-a760-55518ba8b376.ndpi
├── 00a94273-e9ab-42f5-a47e-512a13e8603e.ndpi
└── ...
Slide biopsy mapping
slide-biopsy-map.csv
This table contains the mapping between the digital pathology image files and the corresponding biopsies. One or more slides are produced from the tissue samples of a biopsy procedure. The number of slides for each biopsy in the dataset can vary from 1 to 100.
Column Name | Description | Sample |
---|---|---|
slide_id | Unique identifier for each digital pathology image | c9cc2d38-a042-4883-9ab1-141e7b876678 |
biopsy_id | Unique identifier for each biopsy case | b2423e0f-b92f-44ad-8d83-c45b0066a68a |
slide_path | Filepath for the NDPI file of the slide | /path/to/{slide_id}.ndpi |
Outcomes
outcomes.csv
This table contains outcomes for each biopsy case. There are patient that have multiple biopsies.
Column Name | Description | Sample |
---|---|---|
biopsy_id | Unique identifier for each biopsy case | b2423e0f-b92f-44ad-8d83-c45b0066a68a |
patient_ngsci_id | Unique patient identifier | 821a6ba7-f5aa-49d3-a4c6-313ff649b715 |
case_year | Year of biopsy | 2018 |
biopsy_dt | Date of biopsy | 2152-01-01 |
mortality | If there is a record of patient death 0 : no death record 1 : death | 1 |
death_dt | Date of patient death | 2155-08-24 |
in_registry | Whether entries were found in the Providence cancer registry for the patient that match the time of the biopsy 0 : not in cancer registry1 : in cancer registry | 1 |
stage | Cancer stage for patient in the year of the biopsy | IA |
strict_metastatic_dx | Whether patient has a strict metastatic diagnosis as described in the documentation 0 : no 1 : yes | 0 |
strict_metastatic_dx_dt | Date of first strict metastatic disease diagnosis | 2154-01-30 |
Demographics
demographics.csv
Column Name | Description | Sample |
---|---|---|
biopsy_id | Unique identifier for each biopsy case | b2423e0f-b92f-44ad-8d83-c45b0066a68a |
sex | Sex of patient | F |
race | Self identified race 1 : White or Caucasian 2 : Black or African American 3 : American Indian or Alaska Native4 : Asian5 : Native Hawaiian or Pacific Islander8 : other 9 : unknown | 1 |
ethnicity | Self identified ethnicity0 : Non-hispanic or Latino 1 : Hispanic or Latino9 : unknown | 0 |
birth_dt | De-identified date of birth | 2041-03-15 |
Social determinants
social-determinants.csv
Column Name | Description | Sample |
---|---|---|
biopsy_id | Unique identifier for each biopsy case | b2423e0f-b92f-44ad-8d83-c45b0066a68a |
bmi | The last recording of BMI at or before the date of biopsy [units: kg/m2] | 25 |
tobacco |
0 : no documented smoking1 : ICD10 codes F17.XX or Z72.0X | 0 |
Comorbidities
comorbidities.csv
Comorbidities are those included in the Charleson comorbity index (CCI), and were obtained from patient charts using ICD-9 and ICD-10 codes. Comorbidites were only included if patients were diagnosed in the two years before the biopsy date.
For each included comorbidity, 0
: does not have diagnosis, 1
: has diagnosis.
Treatments
treatments.csv
This table contains the treatments for patient in Providence’s cancer registry. Treatment at another health system would not be recorded in this table. The following is a helpful resource SEER Program Coding and Staging Manual 2021.
Column Name | Description | Sample |
---|---|---|
biopsy_id | Unique identifier for each biopsy case | b2423e0f-b92f-44ad-8d83-c45b0066a68a |
cancer_registry_dx_dt | Cancer diagnosis date | 2156-01-01 |
most_definitive_surgical_procedure_cd | For codes and additional detail, SEER 2021 Manual, Breast Surgery Codes | 22 |
most_definitive_radiation_modality_cd | For codes and additional detail, SEER Program Coding and Staging Manual 2021, pg. 191 | 31 |
surgical_margin_cd | For codes and additional detail, SEER Program Coding and Staging Manual 2021, pg. 166 | 8 |
radiation_summ_cd | For codes and additional detail, SEER 2003 Code Manual, pg. 134a | 1 |
chemo_summ_cd | For codes and additional detail, SEER 2003 Code Manual, pg. 137b | 87 |
immuno_therapy_cd | For codes and additional detail, SEER 2003 Code Manual, pg. 139b | 1 |
hormone_summ_cd | For codes and additional detail, SEER 2003 Code Manual, pg. 138b | 87 |
{therapeutic modality}_dt | Multiple data items with date of administered therapy (i.e. rx_chemo_dt, first_surgery_dt, etc.) | 2156-01-13 |
stg_dx_summ_cd | For codes and additional detail, NAACCR archives | 2 |
Pathology items
pathology-items.csv
Column Name | Description | Sample |
---|---|---|
biopsy_id | Unique identifier for each biopsy case | b2423e0f-b92f-44ad-8d83-c45b0066a68a |
grade_clinical | Grade before any treatment For codes and additional detail, NAACCR Site Specific Data Items, breast | 2 |
grade_pathological | Grade after resection For codes and additional detail, NAACCR Site Specific Data Items, breast | 2 |
er_summary |
0 : ER negative 1 : ER positive For codes and additional detail, NAACCR Site Specific Data Items, breast | 1 |
pr_summary |
0 : PR negative 1 : PR positive For codes and additional detail, NAACCR Site Specific Data Items, breast | 1 |
her2_summary |
0 : HER2 negative 1 : HER2 positive For codes and additional detail, NAACCR Site Specific Data Items, breast | 0 |
multigene_signature_method | For codes and additional detail, NAACCR Site Specific Data Items, breast | 1 |
multigene_signature_result | For codes and additional detail, NAACCR Site Specific Data Items, breast | X4 |
response_neoadjuvant_therapy | For codes and additional detail, SEER 2021 Manual, Neoadjuvant treatment effect, breast | 2 |
Cancer diagnosis
cancer-dx.csv
This table contains the diagnoses for the patient cohort with whole number ICD9 codes 174, 175, 196, 197, 198. These codes are for breast cancer and metastatic cancer diagnosis. The ‘strict’ metastatic diagnosis in the outcomes table is derived from these codes as described in the documentation.
Column Name | Description | Sample |
---|---|---|
biopsy_id | Unique identifier for each biopsy case | 8dc5bacb-5904-45c4-9136-8aa8e16c3711 |
icd9 | ICD9 diagnosis codes for the patient with the whole number 174, 175, 196, 197, 198 | 174.1 |
dx_dt | Date of diagnosis | 2153-04-03 |