Emory Knee Radiograph Dataset (MRKR)
Authors: Brandon Price1, Jason Adleberg2, Kaesha Thomas3, Zach Zaiman4, Aawez Mansuri5, Beatrice Brown-Mulry4, Chima Okecheukwu6, Judy Gichoya3, Hari Trivedi3
1 Department of Radiology, University of Florida, Gainesville, FL
2 Department of Radiology, Mount Sinai Health System, New York City, NY
3 Department of Radiology, Emory University, Atlanta, GA
4 Department of Computer Science, Emory University, Atlanta, GA
5 School of Medicine, Emory University, Atlanta, GA
6 Department of Computer Science, Georgia Institute of Technology, Atlanta, GA
Lead Nightingale analyst: Nick Foster
When using this resource, please cite: more options
Brandon Price, Jason Adleberg, Kaesha Thomas, Zach Zaiman, Aawez Mansuri, Beatrice Brown-Mulry, Chima Okecheukwu, Judy Gichoya, and Hari Trivedi. 2025. Emory Knee Radiograph Dataset. DOI:https://doi.org/10.48815/N53W2JAdditionally, please cite: more options
Sendhil Mullainathan and Ziad Obermeyer. 2022. Solving medicine’s data bottleneck: Nightingale Open Science. Nature Medicine 28, 5 (May 2022), 897–899. DOI:https://doi.org/10.1038/s41591-022-01804-4
Dataset Overview
The Emory Knee Radiograph (MRKR) dataset is a large, demographically diverse collection of 503,261 knee radiographs from 83,011 patients, 40% of which are African American. This dataset provides imaging data in DICOM format along with detailed clinical information, including patient- reported pain scores, diagnostic codes, and procedural codes, which are not commonly available in similar datasets. The MRKR dataset also features imaging metadata such as image laterality, view type, and presence of hardware, enhancing its value for research and model development.MRKR addresses significant gaps in existing datasets by offering a more representative sample for studying osteoarthritis and related outcomes, particularly among minority populations, thereby providing a valuable resource for clinicians and researchers.
Dataset Details
Patient Data
This dataset contains 503,261 knee radiographs of 83,011 patients. The patient cohort was developed by identifying adult patients who received knee radiographs between 2002 and 2021 from four hospitals affiliated with Emory University in Atlanta, GA, USA. These hospitals include two community hospitals, one urban hospital, and one academic hospital. Patients under the age of 18-years-old were excluded from the dataset.
After the patient cohort was identified, patient data was extracted from a Clinical Data Warehouse (CDW), which regularly received information from a Electronic Health Record system. This patient data includes demographic data, procedure codes, diagnosis codes, and patient pain records.
Dataset Summary
Number of Patients | Percentage of Patients | ||
---|---|---|---|
Patient Demographics | |||
Unique Patients | 83,011 | 100% | |
Sex
| Female | 51,175 | 61.6% |
Male | 31,836 | 38.4% | |
Age, Years
| Mean (St. Dev.) | 59.2 (15.4) | |
Median | 61 | ||
Race
| White | 36,927 | 44.5% |
Black | 33,503 | 40.4% | |
Asian | 2,893 | 3.5% | |
Unknown/Unreported | 8,751 | 10.5% | |
Multiple | 536 | 0.6% | |
American Indian or Alaskan Native | 244 | 0.3% | |
Native Hawaiian or Other Pacific Islander | 157 | 0.2% | |
Ethnicity
| Hispanic | 2,501 | 3.0% |
Non-Hispanic | 66,378 | 80.0% | |
Unknown/Unreported | 14,132 | 17.0% | |
Clinical Outcomes | |||
Arthroplasty | 14,843 | 17.9% |
Radiographs
The X-ray images for this dataset where extracted from the institution’s Picture Archiving and Communications System (PACS) in DICOM format. DICOMS that included non-radiographic modalities or secondary captures were dropped. The pixel data of the images were de-identified using MD.ai. This removed names, dates, and times from the DICOM images.
The original imaging data was found to contain erroneous anatomic regions such as chest or hand radiographs. To identify these images a ConvNeXt model was trained on an annotated subset of the images. The non-knee radiographs were excluded from the dataset.
It was also found that the DICOM metadata was unreliable when describing the laterality and view type of the radiograph. Some more metadata features that were unreliable were weight bearing studies and presence of arthroplasty. To address this, 6,000 manually annotated radiographs were used to train a multi-class ConvNeXt classifier to create labels for these features.
Key Variables
Pain Score:
This dataset includes patient pain scores with the associated date. The pain scores are an integer from 0 to 10 with free-text location of pain. Locations that are associated with the knee have been flagged. These scores can be found in the pain score table.
Osteoarthritis Severity:
This dataset includes the Kellgren-Lawrence Grading Scale (KLG) score per knee to measure osteoarthritis severity. These scores where predicted using a state-of-the-art, open-source model developed by Duke University. The KLG score range from 0 (no radiographic evidence of osteoarthritis) to 4 (severe features of osteoarthritis). No systematic manual review or verification of the model accuracy was done on MRKR, however an anecdotal review of several hundred images showed good performance, suggesting the model generalized to the dataset.
This scores can be found in the x-ray image metadata table.
Diagnosis of Diseases of Interest:
This dataset includes diagnosis codes for patients. Several diseases of interest have been flagged, such as knee osteoarthritis. These have been flagged using diagnosis code prefixes and can be found in the diagnosis table.