Data dictionary v1
File tree
.
└── fracture-aimi-xray
├── ADT.csv
├── DX.csv
├── enounter.csv
├── flowsheet.csv
├── Lab
│ ├── lab000000000000.csv
│ ├── lab000000000001.csv
│ ├── lab000000000002.csv
│ ├── ...
│ └── lab000000000027.csv
├── procedure.csv
└── images
├── train [64540 entries]
│ ├── patient00001
│ │ └── study1
│ │ └── view1_frontal.jpg
│ ├── patient00002
│ │ ├── study1
│ │ │ ├── view1_frontal.jpg
│ │ │ └── view2_lateral.jpg
│ │ └── study2
│ │ └── view1_frontal.jpg
│ ├── patient00003
│ │ └── study1
│ │ └── view1_frontal.jpg
│ ├── patient00004
│ │ └── study1
│ │ ├── view1_frontal.jpg
│ │ └── view2_lateral.jpg
│ ...
├── train.csv
├── valid [200 entries]
└── valid.csv
EHR
ADT
ADT.csv
contains admissions data.
Column Name | Data Type | Num. Unique | |
---|---|---|---|
0 | Chexpert_ID | object | 50675 |
1 | event_id | int64 | 6642690 |
2 | department_id | float64 | 343 |
3 | room_id | float64 | 1399 |
4 | room_csn_id | float64 | 1965 |
5 | bed_id | float64 | 4532 |
6 | bed_csn_id | float64 | 5052 |
7 | bed_status_c | float64 | 3 |
8 | jittered_effective_time | object | 5253 |
9 | jittered_event_time | object | 5020 |
- Rows: 6,642,690
- Columns: 10
- Median patient has 66 events (25th: 26, 75th: 158, Max: 3936)
- Most records (80%) between 2010 and 2016
Diagnoses
DX.csv
contains diagnosis data.
Column Name | Data Type | Num. Unique | |
---|---|---|---|
0 | Chexpert_ID | object | 51056 |
1 | dx_id | int64 | 133534 |
2 | dx_name | object | 124375 |
3 | icd10 | object | 27896 |
4 | icd9 | object | 14698 |
5 | diagnosis_date | object | 7165 |
- Rows:5,290,238
-
Columns: 6
- Median patient has 61 diagnoses (25th: 24, 75th: 131, Max: 1442)
- Top 5000 diagnoses make up 80% of all diagnoses
- Most common diagnoses:
- nonspecific abnormal finding of lung field
- Unspecified essential hypertension
- Nonspecific abnormal electrocardiogram (ECG) (EKG)
- Shortness of breath
- Unspecified pleural effusion
Encounter
Encounter.csv
contains encounters with Stanford Hospitals.
Column Name | Data Type | Num. Unique | |
---|---|---|---|
0 | Chexpert_ID | object | 63973 |
1 | pt_class | object | 52 |
2 | Contact_Date | object | 6110 |
3 | ADT_Arrival | object | 3012 |
4 | Hospital_Admission | object | 5786 |
5 | Appointment | object | 6115 |
6 | appt_type | object | 520 |
7 | appt_status | object | 7 |
8 | admission_type | object | 7 |
9 | appt_description | object | 48 |
- Rows: 5,591,038
-
Columns: 10
- Median patient has 5 encounters (25th: 2, 75th: 18, Max: 3680)
- Seven kinds of (non-NA) encounters:
- Elective (1.4M)
- Emergency (81K)
- Urgent (38K)
- Trauma Center (3.3K)
- Outpatient (129)
- Newborn (76)
- Emergent (16)
- Most common encounters are:
- Appointment (1.62M)
- Orders Only (763K)
- BPA (422K)
- Scan (284K)
- Hx Scan (261K)
- Surgery (233K)
- Hx Clinic (215K)
- Addmission (Discharged) (138K)
- Anesthesia Event (123K)
- TransChart Notes Only (113K)
- Appointment description only includes reason for cancelation
Flowsheet
Flowsheet.csv
contains patient health metrics. There are some outliers that are most likely data entry errors.
Column Name | Data Type | Num. Unique | |
---|---|---|---|
0 | Chexpert_ID | object | 37186 |
1 | Recorded_Date | object | 4933 |
2 | WEIGHT | float64 | 4266 |
3 | WEIGHT_UNITS | object | 1 |
4 | HEIGHT | float64 | 817 |
5 | HEIGHT_UNITS | object | 1 |
6 | BP | object | 5322 |
7 | BP_UNITS | object | 1 |
8 | TEMP | float64 | 58 |
9 | TEMP_UNITS | object | 1 |
10 | BMI | float64 | 23965 |
11 | BMI_UNITS | object | 1 |
- Rows: 37,186
- Columns: 12
Procedure
Procedure.csv
contains patient health procedures.
Column Name | Data Type | Num. Unique | |
---|---|---|---|
0 | Chexpert_ID | object | 37002 |
1 | description | object | 9389 |
2 | code_type | object | 3 |
3 | code | object | 9228 |
4 | Procedure_Date | object | 2923 |
- Rows: 6,476,021
-
Columns: 5
- Most common procedures:
description | count | |
---|---|---|
0 | RADEX CH 1 VIEW FRNT | 1064973 |
1 | SUBSEQ HOSPITAL EVAL/MGMT/HIGH COMPLEX/35 MIN | 389407 |
2 | CRITICAL CARE/EVAL/MGTMT; FIRST 30-74 MINUTES | 369392 |
3 | ECG ROUTINE ECG W/LEAST 12 LDS I&R ONLY | 239297 |
4 | SUBSEQ HOSPITAL EVAL/MGMT/MOD COMPLEX/25 MIN | 232169 |
Lab
There are twenty-eight lab files, e.g. lab000000000000.csv
.
Column Name | Data Type | Num. Unique | |
---|---|---|---|
0 | Chexpert_ID | object | 40645 |
1 | Lab_Order_Date | object | 3811 |
2 | Lab_Taken_Date | object | 3825 |
3 | Lab_Result_Date | object | 3864 |
4 | order_type | object | 18 |
5 | proc_code | object | 1637 |
6 | group_lab_name | object | 3294 |
7 | lab_name | object | 2816 |
8 | base_name | object | 3579 |
9 | ord_value | object | 73117 |
10 | ord_num_value | float64 | 12756 |
11 | reference_unit | object | 195 |
12 | result_flag | object | 8 |
13 | ordering_mode | object | 1 |
- Rows: 6,110,538
- Columns: 14
Images
Each patient has a series of studies which correspond to groups of chest X-rays taken together. Each study has one more views, e.g. view1_frontal.jpg
. From the CheXpert documentation:
CheXpert is a dataset consisting of 224,316 chest radiographs of 65,240 patients who underwent a radiographic examination from Stanford University Medical Center between October 2002 and July 2017, in both inpatient and outpatient centers.
Labels
From the CheXpert documentation:
Each report was labeled for the presence of 14 observations as positive, negative, or uncertain. We decided on the 14 observations based on the prevalence in the reports and clinical relevance, conforming to the Fleischner Society’s recommended glossary whenever applicable. We then developed an automated rule-based labeler to extract observations from the free text radiology reports to be used as structured labels for the images.
Our labeler is set up in three distinct stages: mention extraction, mention classification, and mention aggregation. In the mention extraction stage, the labeler extracts mentions from a list of observations from the Impression section of radiology reports, which summarizes the key findings in the radiographic study. In the mention classification stage, mentions of observations are classified as negative, uncertain, or positive. In the mention aggregation stage, we use the classification for each mention of observations to arrive at a final label for the 14 observations (blank for unmentioned, 0 for negative, -1 for uncertain, and 1 for positive).