Data dictionary v2.1
Table of contents
Dataset Changes
v2.1.1 Changes (Aug 2024)
- Changes to Existing Fields
- Date fields have been reshifted
- New Fields
- Added to outcomes.csv
- specimen_taken_dt
- vital_status
- last_contact_dt
- recurrence_dt
- recurrence_type
- New Table
v2.1 Changes (Mar 2024)
- Major change: An error was identified in the date shifting. The dates were shifted per biopsy and not per patient. This has been fixed, and the date differences between events for a patient will be the same as the date differences before de-identification.
- All cancer diagnosis codes were added to
cancer-dx.csv
slide-biospy-map.csv
file name changed tobiopsy-slides.csv
- Two new tables
Dataset Reorganization (Feb 2023)
- Demographic fields moved from
outcomes.csv
todemographics.csv
.ndpi/
directory flattened. The year directories were removed.File tree changes
. └── brca-psj-path ├── ... ├── v2 │ ├── cancer-dx.csv │ ├── comorbidities.csv + │ ├── demographics.csv │ ├── outcomes.csv │ ├── pathology-items.csv │ ├── slide-biopsy-map.csv │ ├── social-determinants.csv │ └── treatments.csv └── ndpi - ├── 2016 │ ├── 0035de3d-81ec-4945-a760-55518ba8b376.ndpi │ └── ... - ├── 2017 │ ├── 00a94273-e9ab-42f5-a47e-512a13e8603e.ndpi │ └── ... ...
Note
The dates in this dataset have been shifted by a random number of days. All dates for any particular patient have been shifted by the same amount in order to preserve the time duration between events.
Male patients made up less than 2% of biopsy patients and were excluded from the dataset.
File tree
.
└── brca-psj-path
├── ...
├── v2.1
│ ├── biopsy-slides.csv
│ ├── cancer-dx.csv
│ ├── cancer-staging.csv
│ ├── ccsr-dx.csv
│ ├── comorbidities.csv
│ ├── demographics.csv
│ ├── other-dx.csv
│ ├── outcomes.csv
│ ├── pathology-items.csv
│ ├── social-determinants.csv
│ └── treatments.csv
└── ndpi
├── 0035de3d-81ec-4945-a760-55518ba8b376.ndpi
├── 00a94273-e9ab-42f5-a47e-512a13e8603e.ndpi
└── ...
Slide biopsy mapping
biopsy-slides.csv
This table contains the mapping between the digital pathology image files and the corresponding biopsies. One or more slides are produced from the tissue samples of a biopsy procedure. The number of slides for each biopsy in the dataset can vary from 1 to 100.
Column Name | Description | Sample |
---|---|---|
slide_id | Unique identifier for each digital pathology image | c9cc2d38-a042-4883-9ab1-141e7b876678 |
biopsy_id | Unique identifier for each biopsy case | b2423e0f-b92f-44ad-8d83-c45b0066a68a |
patient_ngsci_id | Unique patient identifier | 821a6ba7-f5aa-49d3-a4c6-313ff649b715 |
slide_path | Filepath for the NDPI file of the slide | /path/to/{slide_id}.ndpi |
Outcomes
outcomes.csv
This table contains outcomes for each biopsy case. There are patient that have multiple biopsies.
Column Name | Description | Sample |
---|---|---|
biopsy_id | Unique identifier for each biopsy case | b2423e0f-b92f-44ad-8d83-c45b0066a68a |
patient_ngsci_id | Unique patient identifier | 821a6ba7-f5aa-49d3-a4c6-313ff649b715 |
case_year | Year of biopsy | 2018 |
biopsy_dt | Date of biopsy was ordered | 2152-01-01 |
specimen_taken_dt | Date specimen was taken | 2152-01-02 |
mortality | If there is a record of patient death 0 : no death record 1 : death | 1 |
death_dt | Date of patient death if on record | 2155-08-24 |
vital_status | If patient is determined to be alive 0 : not alive 1 : aliveNAACCR Data Item | 0 |
last_contact_dt | Date patient was last contacted to determine vital status NAACCR Data Item | 2155-08-24 |
recurrence_dt | Date of cancer recurrence NAACCR Data Item | 2154-05-02 |
recurrence_type | Type of cancer recurrence NAACCR Data Item | 04 |
stage | Cancer stage for patient in the year of the biopsy | IA |
strict_metastatic_dx | Whether patient has a strict metastatic diagnosis as described in the documentation 0 : no 1 : yes | 0 |
strict_metastatic_dx_dt | Date of first strict metastatic disease diagnosis | 2154-01-30 |
in_registry | Whether entries were found in the Providence cancer registry for the patient that match the time of the biopsy 0 : not in cancer registry1 : in cancer registry | 1 |
Cancer Staging
Cancer staging quantifies the extent that cancer has progressed in a patient. Here is an overview of cancer staging.
Column Name | Description | Sample |
---|---|---|
biopsy_id | Unique identifier for each biopsy case | b2423e0f-b92f-44ad-8d83-c45b0066a68a |
patient_ngsci_id | Unique patient identifier | 821a6ba7-f5aa-49d3-a4c6-313ff649b715 |
stage | Cancer Stage | IA |
assessment_type | The type of staging describes when the staging takes placeclinical : Clinical staging takes place before treatmentpathological : Pathological staging takes place as a part of the first surgery | clinical |
t_value | Tumor component of staging system. Describes the size and extent of the main tumor | T1 |
n_value | Node component of staging system. The number of nearby lymph nodes that have cancer | N0 |
m_value | Metastatic component of staging system. Whether the cancer has metastasized | M0 |
stage_dt | Date of staging | 2152-01-25 |
Demographics
demographics.csv
Column Name | Description | Sample |
---|---|---|
biopsy_id | Unique identifier for each biopsy case | b2423e0f-b92f-44ad-8d83-c45b0066a68a |
sex | Sex of patient | F |
race | Self identified race 1 : White or Caucasian 2 : Black or African American 3 : American Indian or Alaska Native4 : Asian5 : Native Hawaiian or Pacific Islander8 : other 9 : unknown | 1 |
ethnicity | Self identified ethnicity0 : Non-hispanic or Latino 1 : Hispanic or Latino9 : unknown | 0 |
birth_dt | De-identified date of birth | 2041-03-15 |
Social determinants
social-determinants.csv
Column Name | Description | Sample |
---|---|---|
biopsy_id | Unique identifier for each biopsy case | b2423e0f-b92f-44ad-8d83-c45b0066a68a |
bmi | The last recording of BMI at or before the date of biopsy [units: kg/m2] | 25 |
tobacco |
0 : no documented smoking1 : ICD10 codes F17.XX or Z72.0X | 0 |
Comorbidities
comorbidities.csv
Comorbidities are those included in the Charleson comorbity index (CCI), and were obtained from patient charts using ICD-9 and ICD-10 codes. Comorbidites were only included if patients were diagnosed in the two years before the biopsy date.
For each included comorbidity, 0
: does not have diagnosis, 1
: has diagnosis.
Treatments
treatments.csv
This table contains the treatments for patient in Providence’s cancer registry. Treatment at another health system would not be recorded in this table. The following is a helpful resource SEER Program Coding and Staging Manual 2021.
Column Name | Description | Sample |
---|---|---|
biopsy_id | Unique identifier for each biopsy case | b2423e0f-b92f-44ad-8d83-c45b0066a68a |
cancer_registry_dx_dt | Cancer diagnosis date | 2156-01-01 |
most_definitive_surgical_procedure_cd | For codes and additional detail, SEER 2021 Manual, Breast Surgery Codes | 22 |
most_definitive_radiation_modality_cd | For codes and additional detail, SEER Program Coding and Staging Manual 2021, pg. 191 | 31 |
surgical_margin_cd | For codes and additional detail, SEER Program Coding and Staging Manual 2021, pg. 166 | 8 |
radiation_summ_cd | For codes and additional detail, SEER 2003 Code Manual, pg. 134a | 1 |
chemo_summ_cd | For codes and additional detail, SEER 2003 Code Manual, pg. 137b | 87 |
immuno_therapy_cd | For codes and additional detail, SEER 2003 Code Manual, pg. 139b | 1 |
hormone_summ_cd | For codes and additional detail, SEER 2003 Code Manual, pg. 138b | 87 |
{therapeutic modality}_dt | Multiple data items with date of administered therapy (i.e. rx_chemo_dt, first_surgery_dt, etc.) | 2156-01-13 |
stg_dx_summ_cd | For codes and additional detail, NAACCR archives | 2 |
Pathology items
pathology-items.csv
Column Name | Description | Sample |
---|---|---|
biopsy_id | Unique identifier for each biopsy case | b2423e0f-b92f-44ad-8d83-c45b0066a68a |
grade_clinical | Grade before any treatment For codes and additional detail, NAACCR Site Specific Data Items, breast | 2 |
grade_pathological | Grade after resection For codes and additional detail, NAACCR Site Specific Data Items, breast | 2 |
er_summary |
0 : ER negative 1 : ER positive For codes and additional detail, NAACCR Site Specific Data Items, breast | 1 |
pr_summary |
0 : PR negative 1 : PR positive For codes and additional detail, NAACCR Site Specific Data Items, breast | 1 |
her2_summary |
0 : HER2 negative 1 : HER2 positive For codes and additional detail, NAACCR Site Specific Data Items, breast | 0 |
multigene_signature_method | For codes and additional detail, NAACCR Site Specific Data Items, breast | 1 |
multigene_signature_result | For codes and additional detail, NAACCR Site Specific Data Items, breast | X4 |
response_neoadjuvant_therapy | For codes and additional detail, SEER 2021 Manual, Neoadjuvant treatment effect, breast | 2 |
Cancer diagnosis
cancer-dx.csv
This table contains all the cancer diagnoses codes that can be found in the Providence EHR for the patient cohort.
Column Name | Description | Sample |
---|---|---|
patient_ngsci_id | Unique patient identifier | 821a6ba7-f5aa-49d3-a4c6-313ff649b715 |
icd9 | ICD-9 diagnosis codes | 174.4 |
icd10 | ICD-10 diagnosis codes | C50.411 |
dx_dt | Date of diagnosis (date-shifted) | 2153-04-03 |
Other Diagnosis
other-dx.csv
This table contains diagnoses codes for the patient cohort. Currently, this only contains cardiovascular codes.
Column Name | Description | Sample |
---|---|---|
patient_ngsci_id | Unique patient identifier | 821a6ba7-f5aa-49d3-a4c6-313ff649b715 |
icd9 | ICD-9 diagnosis codes | 429.3 |
icd10 | ICD-10 diagnosis codes | I51.7 |
dx_dt | Date of diagnosis (date-shifted) | 2153-07-05 |
CCSR Diagnosis Categories
ccsr-dx.csv
The Clinical Classifications Software Refined (CCSR) for ICD-10-CM diagnoses aggregates more than 70,000 ICD-10-CM diagnosis codes into over 530 clinically meaningful categories. For more details about CCRS refer here.
For a quick reference of the CCRS categories refer here.
In this table there is a row for each biopsy and CCRS category. There are three flags.
- Prior - Whether the patient had any diagnosis code from the category before the biopsy.
- Post_1yr - Whether the patient had any diagnosis code from the category within the year after the biopsy.
- Post_1yr_plus - Whether the patient had any diagnosis code from the category past one year after the biopsy.
If all three are 0
the row is omitted.
Column Name | Description | Sample |
---|---|---|
biopsy_id | Unique identifier for each biopsy case | b2423e0f-b92f-44ad-8d83-c45b0066a68a |
ccrs_category | CCRS category | NEO070 |
prior | The patient had a diagnosis in this category prior to the biopsy | 1 |
post_1yr | The patient had a diagnosis in this category within one year after the biopsy | 1 |
post_1yr_plus | The patient had a diagnosis in this category past one year after the biopsy | 0 |