Aaron Smith, BS

Evaluating GPT-4o for Automated Classification of Skin Lesions Using the HAM10000 Dataset

Article Type

Article

Changed

Thu, 02/26/2026 - 10:29

Display Headline

Evaluating GPT-4o for Automated Classification of Skin Lesions Using the HAM10000 Dataset

Author(s)

To the Editor:

The widespread availability and popularity of ChatGPT (OpenAI) have sparked interest in its potential applications within various fields, including medical diagnostics.¹ In dermatology, large language models (LLMs) already are being cited as a possible way to reliably respond to common patient queries and produce concise patient education materials.^2,3 That being said, there is skepticism regarding the technology’s efficacy and reliability in producing accurate treatment plans, with variability among popular LLMs; for example, a recent study by Chau et al⁴ demonstrated that ChatGPT was best at providing specific and accurate information regarding patient-facing responses to questions about 5 dermatologic diagnoses compared to Google Bard (now rebranded as Google Gemini) and Bing AI (now rebranded as Microsoft Copilot), which more often produced inaccurate or nonspecific responses. Google Bard also declined to answer one prompt.⁴ Large language models also have been evaluated in diagnosing skin lesions. In 2024, SkinGPT-4 (a pretrained multimodel LLM developed by Zhou et al⁵) achieved just over 80% accuracy in interpreting images of skin lesions and was considered informative by 82.5% of board-certified dermatologists, demonstrating that LLMs may have the potential to become integrated into clinical practice.⁵

Our study aimed to evaluate the performance of GPT-4o (OpenAI)—a widely accessible, low-cost LLM—in diagnosing dermatologic conditions using the HAM10000 dataset, a well-curated collection of dermatoscopic images developed for training and benchmarking artificial intelligence (AI) algorithms.⁶ HAM10000 comprises images representing 7 distinct skin conditions: actinic keratoses (ak), basal cell carcinoma (bcc), benign keratosis (bk), dermatofibroma (df), melanoma (mel), melanocytic nevi (nv), and vascular skin lesions (vsl), providing a robust platform for multiclass classification assessment. We evaluated GPT-4o using 100 dermatoscopic images per condition to assess diagnostic accuracy, potential biases, and limitations in skin lesion identification. The HAM10000 dataset was selected because it offers a large standardized reference set of dermatoscopic (rather than conventional clinical) images commonly used in dermatologic AI research. GPT-4o was chosen due to its patient-friendly interface, widespread use, and prior reports suggesting greater reliability in skin lesion assessment compared with other LLMs.

One hundred images from each of the 7 dermatologic categories were randomly selected for use in our analysis in 2024. The images were selected by our data scientist (J.C.) through random sampling from the dataset. Each image was separately presented to GPT-4o without any preprocessing or modification alongside 2 prompts designed to evaluate the diagnostic capabilities of GPT-4o. Both prompts included the same list of 7 dermatologic conditions for answer choices but differed in contextual information, where prompt 1 provided patient demographic information and localization of the dermatological condition but prompt 2 did not provide these details (Table). No follow-up questions were presented.

For prompt 1, the confusion matrix showed a strong bias toward detecting mel and bcc, with high true positives (mel, 83%; bcc, 37%)(eFigure 1). This pattern possibly suggests a tendency to favor malignant labels (eg, mel, BCC) when uncertainty is present. Interestingly, df and vsl also had notable true positives (46% and 37%, respectively), which is unexpected for less critical conditions because the model’s correct classifications were uneven across benign lesions. Actinic keratoses and nv showed higher misclassification rates, suggesting the model struggled to distinguish them from other lesions.

Chetla-eFig-1 — eFIGURE 1. Confusion matrix for Prompt 1. GPT-4o showed a bias toward predicting basal cell carcinoma and melanoma. The values were calculated by comparing the true category of each image with the predicted category of each image. That data point was then placed in the appropriate cell in the confusion matrix.

As shown in eTable 1, prompt 1 exhibited the highest recall for mel at 0.83 but performed worse in precision (0.242) and specificity (0.567) compared to ak, which had an extremely low recall (0.03) but very high specificity (0.992) and moderate precision score (0.375). The highest precision score was seen with vsl (0.738), which also achieved high scores in specificity (0.982) and accuracy (0.88) and performed moderately well in recall (0.31). All performance metrics are reported as proportions (0-1.0), wherein 1.0 indicates 100.

For prompt 2, the second confusion matrix followed similar trends as prompt 1 but still differed in key areas (eFigure 2). Melanoma detection remained strong (true positives, 95%), while bcc shows slightly fewer true positives (24%). Vascular skin lesions improve in true positives (40%), and df dropped slightly (33%). The model continues to struggle with ak and nv, with notable misclassifications observed across other categories

Chetla-eFig-2 — eFIGURE 2. Confusion matrix for Prompt 2. GPT-4o showed a slight bias toward predicting basal cell carcinoma and melanoma. The values were calculated by comparing the true category of each image with the predicted category of each image. That data point was then placed in the appropriate cell in the confusion matrix.

Similar to prompt 1, prompt 2 achieved its highest recall for mel (0.95%), but demonstrated lower precision (0.223%) and specificity (0.488%) for this class. Prompt 2 also produced the highest accuracy for vascular skin lesions (0.90%). The highest specificity was observed for both bk and ak (0.992% each); however, ak again demonstrated the lowest recall, with a value of 0.01%.

A previous study utilizing a model of binary classification to distinguish between mel and benign dermatologic conditions demonstrated poor performance.¹ Additionally, prior studies have employed a less-strict, open-ended style question approach to examine ChatGPT’s ability to diagnose mel with limited efficacy.⁷ The HAM10000 dataset was specifically selected despite its limitations (including the absence of clinical images and limited diversity in skin tones) due to its comprehensive nature, robust annotation standards, and widespread acceptance in dermatologic AI research. Compared to the Diverse Dermatology Images dataset, which notably lacks skin tone diversity, HAM10000 provides a balanced representation of several dermatologic conditions crucial for multiclass classification tasks, making it suitable for benchmarking AI performance. This study aimed to eliminate these limitations by employing a multiclass classification approach; however, despite this switch, our results indicate continued and major limitations of the diagnostic capabilities of GPT-4o.

In its current form, GPT-4o appeared to demonstrate a clear accuracy bias toward correctly identifying specific and severe dermatologic conditions (eg, mel, bcc) but showed low and variable class-level performance for other categories (eg, ak, nv, df, vsl), with frequent misclassification into melanoma or basal cell carcinoma and low recall for some classes (eTables 1 and 2). This finding emphasized that GPT-4o currently lacks the reliability needed for real-life clinical applications in dermatology, as both binary and multiclass models fail to achieve consistent accurate performance across all skin conditions. Notably, GPT-4o may generate false-positive malignant classifications among patients due to its skew in predicted labels toward labeling benign lesions as malignant.

From the patient perspective, younger individuals may upload images of benign nevi only to unnecessarily fear a mel diagnosis after receiving GPT-4o results. Statistically, younger patients are less likely than older patients to have malignant lesions and more likely to instead present with common vsl or df—lesions that GPT-4o appears likely to identify correctly.⁸ For older users, however, the situation may differ. Beyond ak being misclassified as bcc, older patients also may encounter GPT-4o outputs that mislabel lesions as mel, raising concerns and heightening anxiety. Given the technology’s tendency to overestimate the risk of serious dermatologic conditions, this behavior poses a considerable challenge in its current state and may inadvertently intensify public anxiety around mel.

A notable limitation of our study was that, compared to publicly available datasets, the HAM10000 dataset includes only dermatoscopic images rather than a combination of clinical and dermatoscopic images. Furthermore, the HAM10000 dataset comprises images primarily from White patients, whereas other diverse databases (eg, the Diverse Dermatology Images dataset) may be more suitable for training AI algorithms to accurately diagnose skin lesions in individuals with a variety of skin tones.⁹

Ultimately, our results signal that major advancements in the design and training of LLMs such as GPT-4o are necessary before these systems can be integrated into dermatologic diagnostic decision-making to offer benefit rather than cause harm. Consulting a health care professional rather than relying solely on AI, which might otherwise lead to avoidable stress, unnecessary alarm, and potentially increased health care costs due to unwarranted follow-up and testing, should remain the recommended standard of care for patients suspecting a skin lesion.

References

Caruccio L, Cirillo S, Polese G, et al. Can ChatGPT provide intelligent diagnoses? A comparative study between predictive models and ChatGPT to define a new medical diagnostic bot. Expert Syst Appl. 2024;235:121186. doi:10.1016/j.eswa.2023.121186
Ferreira AL, Chu B, Grant-Kels JM, et al. Evaluation of ChatGPT dermatology responses to common patient queries. JMIR Dermatol. 2023;6:E49280. doi:10.2196/49280
Chen R, Zhang Y, Choi S, et al. The chatbots are coming: risks and benefits of consumer-facing artificial intelligence in clinical dermatology. J Am Acad Dermatol. 2023;89:872-874. doi:10.1016/j.jaad.2023.05.088
Chau C, Feng H, Cobos G, et al. The comparative sufficiency of ChatGPT, Google Bard, and Bing AI in answering diagnosis, treatment, and prognosis questions about common dermatological diagnoses. JMIR Dermatol. 2025;8:E60827. doi:10.2196/60827
Zhou J, He X, Sun L, et al. Pre-trained multimodal large language model enhances dermatological diagnosis using SkinGPT-4. Nat Commun. 2024;15:5649. doi:10.1038/s41467-024-50043-3
Tschandl P, Rosendahl C, Kittler H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci Data. 2018;5:180161. doi:10.1038/sdata.2018.161
Shifai N, van Doorn R, Malvehy J, et al. Can ChatGPT vision diagnose melanoma? An exploratory diagnostic accuracy study. J Am Acad Dermatol. 2024;90:1057-1059. doi:10.1016/j.jaad.2023.12.062
Cortez JL, Vasquez J, Wei ML. The impact of demographics, socioeconomics, and health care access on melanoma outcomes. J Am Acad Dermatol. 2021;84:1677-1683. doi:10.1016/j.jaad.2020.07.125
Daneshjou R, Vodrahalli K, Novoa RA, et al. Disparities in dermatology AI performance on a diverse, curated clinical image set. Sci Adv. 2022;8:Eabq6147. doi:10.1126/sciadv.abq6147

Article PDF

CT117003099.pdf

Author and Disclosure Information

Nitin Chetla and Aaron Smith are from the School of Medicine, University of Virginia, Charlottesville. Matthew Chen and Priyanka Kadam are from the Renaissance School of Medicine, Stony Brook University, New York. Tamer R. Hage is from the School of Medicone, Virginia Commonwealth University, Richmond. Joseph Chang is from the University of Passau, Germany. Dr. Ladrigan is from Comprehensive Dermatology of Rochester, New York.

The authors have no relevant financial disclosures to report.

Correspondence: Tamer R. Hage, BS (tamerwh@gmail.com).

Cutis. 2026 March;117(3):98-100, E2-E4. doi:10.12788/cutis.1359

Issue

Cutis - 117(3)

Publications

Cutis

MDedge Dermatology

Topics

Mixed Topics

Page Number

98-100

Read more about Evaluating GPT-4o for Automated Classification of Skin Lesions Using the HAM10000 Dataset

Sections

Author(s)

Author(s)

Author and Disclosure Information

The authors have no relevant financial disclosures to report.

Correspondence: Tamer R. Hage, BS (tamerwh@gmail.com).

Cutis. 2026 March;117(3):98-100, E2-E4. doi:10.12788/cutis.1359

Author and Disclosure Information

The authors have no relevant financial disclosures to report.

Correspondence: Tamer R. Hage, BS (tamerwh@gmail.com).

Cutis. 2026 March;117(3):98-100, E2-E4. doi:10.12788/cutis.1359

Article PDF

CT117003099.pdf

Article PDF

CT117003099.pdf

To the Editor:

References

Caruccio L, Cirillo S, Polese G, et al. Can ChatGPT provide intelligent diagnoses? A comparative study between predictive models and ChatGPT to define a new medical diagnostic bot. Expert Syst Appl. 2024;235:121186. doi:10.1016/j.eswa.2023.121186
Ferreira AL, Chu B, Grant-Kels JM, et al. Evaluation of ChatGPT dermatology responses to common patient queries. JMIR Dermatol. 2023;6:E49280. doi:10.2196/49280
Chen R, Zhang Y, Choi S, et al. The chatbots are coming: risks and benefits of consumer-facing artificial intelligence in clinical dermatology. J Am Acad Dermatol. 2023;89:872-874. doi:10.1016/j.jaad.2023.05.088
Chau C, Feng H, Cobos G, et al. The comparative sufficiency of ChatGPT, Google Bard, and Bing AI in answering diagnosis, treatment, and prognosis questions about common dermatological diagnoses. JMIR Dermatol. 2025;8:E60827. doi:10.2196/60827
Zhou J, He X, Sun L, et al. Pre-trained multimodal large language model enhances dermatological diagnosis using SkinGPT-4. Nat Commun. 2024;15:5649. doi:10.1038/s41467-024-50043-3
Tschandl P, Rosendahl C, Kittler H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci Data. 2018;5:180161. doi:10.1038/sdata.2018.161
Shifai N, van Doorn R, Malvehy J, et al. Can ChatGPT vision diagnose melanoma? An exploratory diagnostic accuracy study. J Am Acad Dermatol. 2024;90:1057-1059. doi:10.1016/j.jaad.2023.12.062
Cortez JL, Vasquez J, Wei ML. The impact of demographics, socioeconomics, and health care access on melanoma outcomes. J Am Acad Dermatol. 2021;84:1677-1683. doi:10.1016/j.jaad.2020.07.125
Daneshjou R, Vodrahalli K, Novoa RA, et al. Disparities in dermatology AI performance on a diverse, curated clinical image set. Sci Adv. 2022;8:Eabq6147. doi:10.1126/sciadv.abq6147

References

Caruccio L, Cirillo S, Polese G, et al. Can ChatGPT provide intelligent diagnoses? A comparative study between predictive models and ChatGPT to define a new medical diagnostic bot. Expert Syst Appl. 2024;235:121186. doi:10.1016/j.eswa.2023.121186
Ferreira AL, Chu B, Grant-Kels JM, et al. Evaluation of ChatGPT dermatology responses to common patient queries. JMIR Dermatol. 2023;6:E49280. doi:10.2196/49280
Chen R, Zhang Y, Choi S, et al. The chatbots are coming: risks and benefits of consumer-facing artificial intelligence in clinical dermatology. J Am Acad Dermatol. 2023;89:872-874. doi:10.1016/j.jaad.2023.05.088
Chau C, Feng H, Cobos G, et al. The comparative sufficiency of ChatGPT, Google Bard, and Bing AI in answering diagnosis, treatment, and prognosis questions about common dermatological diagnoses. JMIR Dermatol. 2025;8:E60827. doi:10.2196/60827
Zhou J, He X, Sun L, et al. Pre-trained multimodal large language model enhances dermatological diagnosis using SkinGPT-4. Nat Commun. 2024;15:5649. doi:10.1038/s41467-024-50043-3
Tschandl P, Rosendahl C, Kittler H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci Data. 2018;5:180161. doi:10.1038/sdata.2018.161
Shifai N, van Doorn R, Malvehy J, et al. Can ChatGPT vision diagnose melanoma? An exploratory diagnostic accuracy study. J Am Acad Dermatol. 2024;90:1057-1059. doi:10.1016/j.jaad.2023.12.062
Cortez JL, Vasquez J, Wei ML. The impact of demographics, socioeconomics, and health care access on melanoma outcomes. J Am Acad Dermatol. 2021;84:1677-1683. doi:10.1016/j.jaad.2020.07.125
Daneshjou R, Vodrahalli K, Novoa RA, et al. Disparities in dermatology AI performance on a diverse, curated clinical image set. Sci Adv. 2022;8:Eabq6147. doi:10.1126/sciadv.abq6147

Issue

Cutis - 117(3)

Issue

Cutis - 117(3)

Page Number

98-100

Page Number

98-100

Publications

Publications

Topics

Article Type

Display Headline

Evaluating GPT-4o for Automated Classification of Skin Lesions Using the HAM10000 Dataset

Display Headline

Evaluating GPT-4o for Automated Classification of Skin Lesions Using the HAM10000 Dataset

Sections

Original Research

Inside the Article

Practice Points

Even with a multiclass classification framework designed to assist GPT-4o, the model encountered notable challenges in accurately diagnosing skin lesions.
In its current form, GPT-4o may provide inaccurate and misleading information to patients who use its interface to evaluate suspected skin lesions. Patients should continue to seek clinical consultation from health care professionals.

Disallow All Ads

Content Gating

No Gating (article Unlocked/Free)

Alternative CME

Disqus Comments

Default

Gate On Date

Wed, 02/25/2026 - 16:23

Un-Gate On Date

Wed, 02/25/2026 - 16:23

Consolidated Pubs: Do Not Show Source Publication Logo

Use ProPublica

CFC Schedule Remove Status

Wed, 02/25/2026 - 16:23

Conference Recap Checkbox

Not Conference Recap

Clinical Edge

Display the Slideshow in this Article

Medscape Article

Display survey writer

Reuters content

Disable Inline Native ads

WebMD Article

survey writer start date

Wed, 02/25/2026 - 16:23

Teaser Media

Is Simultaneous Bilateral Total Knee Arthroplasty (BTKA) as Safe as Staged BTKA?

Article Type

Article

Changed

Thu, 09/19/2019 - 13:21

Display Headline

Is Simultaneous Bilateral Total Knee Arthroplasty (BTKA) as Safe as Staged BTKA?

Author(s)

Scott Hadley, MD

Michael Day, MD, MPhil

Take-Home Points

Complication rates did not statistically significantly differ between simultaneous and staged TKA.
Length of stay of 2 TKA admissions was greater than 1 BTKA admission.
Transfusion requirements were greater in BTKA.
Avoid bilateral procedures in ASA 3 patients.
Develop institutional protocols for BTKA with multidisciplinary input.

In the United States, osteoarthritis is the most common cause of knee pain and one of the leading causes of disability.¹ Total knee arthroplasty (TKA) is an effective treatment for end-stage osteoarthritis of the knee.² Whether patients with severe, debilitating bilateral disease should undergo simultaneous bilateral TKA (BTKA) or staged BTKA (2 separate procedures during separate hospital admissions) continues to be debated. The relative risks and benefits of simultaneous BTKA relative to staged BTKA or unilateral TKA are controversial.^3-6 Proponents of simultaneous BTKA have argued that this surgery results in shorter hospital length of stay (LOS) and higher patient satisfaction without increased risk of perioperative complications,^7-9 and opponents have argued that it leads to increased perioperative mortality and complications and should not be performed routinely.^10,11

The safety of simultaneous BTKA cannot necessarily be extrapolated from data on unilateral TKA. Authors have argued that the complication rate for simultaneous BTKA is not comparable to the rate for unilateral TKA but instead is double the rate.¹² Although a doubled rate may more closely approximate the true risk of simultaneous BTKA, it still does not account for the increased surgical impact of 2 procedures (vs 1 procedure) on a patient. In this regard, comparing simultaneous and staged BTKA provides a more accurate assessment of risk, as long as the interval between surgeries is not excessive. The major stress experienced during TKA affects the cardiovascular, pulmonary, and musculoskeletal systems, and full recovery may take up to 6 months.^13-15 Outcome studies have found significant improvement in validated measures of function and pain up to but not past 6 months.^13,15 Furthermore, a large study comparing American Society of Anesthesiologists (ASA) scores with morbidity and mortality rates recorded in the New Zealand Total Joint Database established 6 months as a best approximation of postoperative mortality and morbidity risk.¹⁴ Given these data, we propose that the most accurate analysis of postoperative morbidity and mortality would be a comparison of simultaneous BTKA with BTKA staged <6 months apart. The staged procedures fall within the crucial postoperative period when increased morbidity and mortality would more likely be present. A between-surgeries interval >6 months would effectively separate the 2 procedures, rendering their risks not truly representative.

We retrospectively analyzed all simultaneous BTKA and staged BTKA (<6 months apart) surgeries performed at our orthopedic specialty hospital between 2005 and 2009. We hypothesized there would be no significant difference in perioperative morbidity or mortality between the groups.

Methods and Materials

Our institution’s Institutional Review Board approved this study. All patients who underwent either simultaneous BTKA or staged BTKA (<6 months apart) at a single orthopedic specialty hospital between 2005 and 2009 were retrospectively identified. Twenty-five surgeons performed the procedures. Which procedure to perform (simultaneous or staged) was decided by the attending surgeon in consultation with an anesthesiologist. Preoperative medical diagnostic testing was determined by the internist, who provided medical clearance, and was subject to review by the anesthesiologist. A patient was excluded from simultaneous BTKA only if the medical or anesthesiology consultant deemed the patient too high risk for bilateral procedures. Revision TKAs were excluded from the study.

Implant, approach, tourniquet use, and TKA technique were selected by the individual surgeons. Strategies for the simultaneous procedures were (1) single surgeon, single team, sequential, start second knee after closure of first, and (2) single surgeon, single team, sequential, start second knee after implantation of first but before closure. The decision to proceed with the second knee was confirmed in consultation with the anesthesiologist after implantation and deflation of the tourniquet on the first knee.

Individual electronic patient charts were reviewed for information on demographics, comorbidities, anesthesia type, antibiotics, and postoperative venous thromboembolism prophylaxis. Demographic variables included age, sex, height, weight, and body mass index (BMI). Comorbidities recorded were diabetes mellitus, coronary artery disease, prior myocardial infarction, stroke, and endocrinopathies. In addition, available ASA scores were recorded. The primary outcome was perioperative complications, defined as any complications that occurred within 6 months after surgery. These included death, pulmonary embolism (PE), and deep surgical-site infections (SSIs). Secondary outcome measures were LOS, discharge location (rehabilitation or home), and blood transfusion requirements.

The 2 groups (simultaneous BTKA, staged BTKA) were compared using Student t test for continuous variables and χ² test for categorical variables. Subgroup analysis was performed to compare healthier patients (ASA score 1 or 2) with patients who had more severe comorbidities (ASA score 3). Statistical significance was set at P < .05.

Results

Between 2005 and 2009, 371 patients had simultaneous BTKA, and 67 had staged BTKA (134 procedures) <6 months apart (Table 1).

Mean recovery interval between staged procedures was 4.3 months (range, 2-6 months). Mean age was 63.9 years (range, 44-88 years) for the simultaneous BTKA patients and 63.1 years (range, 35-81 years) for the staged BTKA patients (P = .105). Both groups had proportionately more female patients (69.8% in the simultaneous BTKA group, 64.2% in the staged BTKA group), but there was no sex difference between the groups (P = .359). There were 71 (19.1%) morbidly obese patients (body mass index [BMI], ≥40 kg/m²) in the simultaneous group and 14 (20.9%) in the staged group (P = .739). The groups had statistically similar proportions of diabetes mellitus and coronary artery disease (P = .283).

Most surgeries (84.4% simultaneous, 90.3% staged) were performed with the patient under spinal anesthesia, and there was a trend (P = .167) toward more frequent use of general anesthesia in the simultaneous group relative to the staged group (Table 2).

Intraoperative antibiotics were given in all cases, and there were no significant differences in antibiotic type between the groups. Postoperative chemical venous thromboembolism prophylaxis was administered to all patients, depending on surgeon preference, and there were no significant differences between the groups.

The 2 cohorts’ perioperative complication rates were not statistically significantly different (P = .97) (Table 3).

The simultaneous BTKA group had 13 complications: 7 PEs (1.9%), 5 deep SSIs (1.08%), and 1 respiratory arrest (0.27%). The staged BTKA group had only 1 complication, a deep SSI (0.75%). There were no significant differences in rates of individual complications (deep vein thrombosis, PE, SSI; P = .697) or intensive care unit admission (P = .312). Mean number of transfusion units was 1.39 for simultaneous BTKA and 0.66 for both staged TKAs combined (P = .042). Mean aggregated LOS for both procedures in the staged BTKA was 8.93 days per patient, and mean LOS for simultaneous BTKA was 4.94 days per patient, significantly shorter (P = .0001). The percentage of postoperative discharges from hospital to an inpatient acute rehabilitation center was significantly higher (P = .0001) in the simultaneous BTKA group (92.7%) than in the staged BTKA group (50.7%).

There was no statistically significant difference (P = .398) in occurrence of postoperative complications between the 2 cohorts compared on ASA scores, and the difference between patients with ASA score 1 or 2 and those with ASA score 3 was not statistically significant (P = .200) (Table 4).

There was a trend (P = .161) toward more complications in 85 patients with BMI of ≥40 kg/m² (morbidly obese), of whom 5 (5.9%) had a complication, than in 9 patients (2.6%) with BMI of <40 kg/m², but the difference was not statistically significant because of the sample size.

Discussion

Although there was no significant difference in postoperative complication rates within 6 months after surgery between the simultaneous and staged BTKA groups, the incidence of complications in the simultaneous group was notable. The disproportionate size of the 2 comparison groups limited the power of our study to analyze individual perioperative complications. This study may be underpowered to detect differences in complications occurring relatively infrequently, which may explain why the difference in number of complications (13 in simultaneous group, 1 in staged group) did not achieve statistical significance (β = 0.89). Post hoc power analysis showed 956 patients would be needed in each group to adequately power for such small complication rates. However, our results are consistent with those of other studies.^13-15 The 1.9% PE rate in our simultaneous BTKA group does not vary from the average PE rate for TKA in the literature and is actually lower than the PE rate in a previous study at our institution.¹⁶ Fat embolism traditionally is considered more of a concern in bilateral cases than in unilateral cases. Although fat embolism surely is inherent to the physiologic alterations caused by TKA, we did not find clinically significant fat embolism in either cohort.

Similarly, the 1.08% rate of deep SSIs is within the range for postoperative TKA infections at our institution and others.¹⁷ Our staged BTKA group’s complication rate, 0.75% (1 SSI), was slightly lower than expected. However, 0.75% is in keeping with institutional norms (typical rate, ~1%). We would have expected a nonzero rate for venous thromboembolism, and perhaps such a rate would have come with an inclusion period longer than 6 months. Last, the death in the simultaneous BTKA group was not an outlier, given the published rate of mortality after elective total joint surgery.¹⁸The characteristics of our simultaneous and staged BTKA groups were very similar (Table 1), though the larger number of staged-group patients with diabetes mellitus and coronary artery disease may represent selection bias. Nevertheless, the proportions of patients with each of 3 ASA scores were similar. It is also important to note that, in this context, a high percentage of patients in each group (33.6% simultaneous, 37.5% staged) received ASA score 3 from the anesthesiologist (P > .05). This may be an important factor in explaining the larger though not statistically significant number of complications in the simultaneous group (13) relative to the staged group (1).

We therefore consider ASA score 3 to be a contraindication to a bilateral procedure, and for simultaneous BTKA we have developed a set of exclusion criteria that include ASA score 3 or 4 (Table 5). These criteria reflect input from our surgeons, anesthesiologist, and medical specialists, as well as the data presented here.

Other authors have studied the safety of simultaneous vs staged BTKA and drawn conflicting conclusions.^11,19-21 Walmsley and colleagues²¹ found no differences in 90-day mortality between 3 groups: patients with simultaneous BTKA, patients with BTKA staged within 5 years, and patients with unilateral TKA. Stefánsdóttir and colleagues¹¹ found that, compared with simultaneous BTKA, BTKA staged within 1 year had a lower 30-day mortality rate. Meehan and colleagues²⁰ compared simultaneous BTKA with BTKA staged within 1 year and found a lower risk of infection and device malfunction and a higher risk of adverse cardiovascular outcomes in the simultaneous group. A recent meta-analysis found that, compared with staged BTKA, simultaneous BTKA had a higher risk of perioperative complications.¹⁹ A systematic review of retrospective studies found simultaneous BTKA had higher rates of mortality, PE, and transfusion and lower rates of deep SSI and revision.²² A survey of Medicare data found higher 90-day mortality and myocardial infarction rates for simultaneous BTKA but no difference in infection and revision rates.²³ Clearly, there is no consensus as to whether simultaneous BTKA carries higher risks relative to staged BTKA.

The amount of blood transfused in our simultaneous BTKA group was more than double that in the 2 staged TKAs combined. It is intuitive that the blood loss in 2 concurrent TKAs is always more than in 1 TKA, but the clinical relevance of this fact is unknown. Transfusions have potential complications, and this risk needs to be addressed in the preoperative discussion.

LOS for simultaneous BTKA was on average 4 days shorter than the combined LOS (2 hospitalizations) for staged BTKA. This shorter LOS has been shown to provide the healthcare system with a cost savings.⁸ However, not considered in the equation is the difference in cost of rehabilitations, 2 vs 1. In the present study, 92.7% of simultaneous BTKA patients and only 50.7% of staged BTKA patients were discharged to an inpatient acute rehabilitation unit. Interestingly, the majority of the staged patients who went to inpatient rehabilitation did so after the second surgery. At our institution at the time of this study, simultaneous BTKA patients, and staged BTKA patients with the second surgery completed, were more likely than unilateral TKA patients to qualify for inpatient acute rehabilitation. Staged BTKA patients’ higher cost for 2 rehabilitations, rather than 1, adds to the cost savings realized with simultaneous BTKA. In the context of an episode-based payment system, the cost of posthospital rehabilitation enters the overall cost equation and may lead to an increase in the number of simultaneous BTKAs being performed.

Conclusion

In this study, the incidence of postoperative complications was higher for simultaneous BTKA than for staged BTKA performed <6 months apart, but the difference was not significantly different. There were significant differences in LOS and blood transfusion rates between the groups, as expected. At present, only patients with ASA score 1 or 2 are considered for simultaneous BTKA at our institution. Patients with ASA score 3 or higher are not eligible.

Am J Orthop. 2017;46(4):E224-E229. Copyright Frontline Medical Communications Inc. 2017. All rights reserved.

References

1. Hootman JM, Helmick CG. Projections of US prevalence of arthritis and associated activity limitations. Arthritis Rheum. 2006;54(1):226-229.

2. Kolettis GT, Wixson RL, Peruzzi WT, Blake MJ, Wardell S, Stulberg SD. Safety of 1-stage bilateral total knee arthroplasty. Clin Orthop Relat Res. 1994;(309):102-109.

3. Kim YH, Choi YW, Kim JS. Simultaneous bilateral sequential total knee replacement is as safe as unilateral total knee replacement. J Bone Joint Surg Br. 2009;91(1):64-68.

4. Luscombe JC, Theivendran K, Abudu A, Carter SR. The relative safety of one-stage bilateral total knee arthroplasty. Int Orthop. 2009;33(1):101-104.

5. Memtsoudis SG, Ma Y, González Della Valle A, et al. Perioperative outcomes after unilateral and bilateral total knee arthroplasty. Anesthesiology. 2009;111(6):1206-1216.

6. Zeni JA Jr, Snyder-Mackler L. Clinical outcomes after simultaneous bilateral total knee arthroplasty: comparison to unilateral total knee arthroplasty and healthy controls. J Arthroplasty. 2010;25(4):541-546.

7. March LM, Cross M, Tribe KL, et al; Arthritis C.O.S.T. Study Project Group. Two knees or not two knees? Patient costs and outcomes following bilateral and unilateral total knee joint replacement surgery for OA. Osteoarthritis Cartilage. 2004;12(5):400-408.

8. Reuben JD, Meyers SJ, Cox DD, Elliott M, Watson M, Shim SD. Cost comparison between bilateral simultaneous, staged, and unilateral total joint arthroplasty. J Arthroplasty. 1998;13(2):172-179.

9. Ritter MA, Harty LD. Debate: simultaneous bilateral knee replacements: the outcomes justify its use. Clin Orthop Relat Res. 2004;(428):84-86.

10. Restrepo C, Parvizi J, Dietrich T, Einhorn TA. Safety of simultaneous bilateral total knee arthroplasty. A meta-analysis. J Bone Joint Surg Am. 2007;89(6):1220-1226.

11. Stefánsdóttir A, Lidgren L, Robertsson O. Higher early mortality with simultaneous rather than staged bilateral TKAs: results from the Swedish Knee Arthroplasty Register. Clin Orthop Relat Res. 2008;466(12):3066-3070.

12. Noble J, Goodall J, Noble D. Simultaneous bilateral total knee replacement: a persistent controversy. Knee. 2009;16(6):420-426.

13. Fortin PR, Penrod JR, Clarke AE, et al. Timing of total joint replacement affects clinical outcomes among patients with osteoarthritis of the hip or knee. Arthritis Rheum. 2002;46(12):3327-3330.

14. Hooper GJ, Rothwell AG, Hooper NM, Frampton C. The relationship between the American Society of Anesthesiologists physical rating and outcome following total hip and knee arthroplasty: an analysis of the New Zealand Joint Registry. J Bone Joint Surg Am. 2012;94(12):1065-1070.

15. MacWilliam CH, Yood MU, Verner JJ, McCarthy BD, Ward RE. Patient-related risk factors that predict poor outcome after total hip replacement. Health Serv Res. 1996;31(5):623-638.

16. Hadley SR, Lee M, Reid M, Dweck E, Steiger D. Predictors of pulmonary embolism in orthopaedic patient population. Abstract presented at: 43rd Annual Meeting of the Eastern Orthopaedic Association; June 20-23, 2012; Bolton Landing, NY.

17. Hadley S, Immerman I, Hutzler L, Slover J, Bosco J. Staphylococcus aureus decolonization protocol decreases surgical site infections for total joint replacement. Arthritis. 2010;2010:924518.

18. Singh JA, Lewallen DG. Ninety-day mortality in patients undergoing elective total hip or total knee arthroplasty. J Arthroplasty. 2012;27(8):1417-1422.e1.

19. Hu J, Liu Y, Lv Z, Li X, Qin X, Fan W. Mortality and morbidity associated with simultaneous bilateral or staged bilateral total knee arthroplasty: a meta-analysis. Arch Orthop Trauma Surg. 2011;131(9):1291-1298.

20. Meehan JP, Danielsen B, Tancredi DJ, Kim S, Jamali AA, White RH. A population-based comparison of the incidence of adverse outcomes after simultaneous-bilateral and staged-bilateral total knee arthroplasty. J Bone Joint Surg Am. 2011;93(23):2203-2213.

21. Walmsley P, Murray A, Brenkel IJ. The practice of bilateral, simultaneous total knee replacement in Scotland over the last decade. Data from the Scottish Arthroplasty Project. Knee. 2006;13(2):102-105.

22. Fu D, Li G, Chen K, Zeng H, Zhang X, Cai Z. Comparison of clinical outcome between simultaneous-bilateral and staged-bilateral total knee arthroplasty: a systematic review of retrospective studies. J Arthroplasty. 2013;28(7):1141-1147.

23. Bolognesi MP, Watters TS, Attarian DE, Wellman SS, Setoguchi S. Simultaneous vs staged bilateral total knee arthroplasty among Medicare beneficiaries, 2000–2009. J Arthroplasty. 2013;28(8 suppl):87-91.

Article PDF

ajo04604224e.pdf

Author and Disclosure Information

Authors’ Disclosure Statement: The authors report no actual or potential conflict of interest in relation to this article.

Acknowledgment: The authors thank Emmanuel Koli, BS, for his help with data collection.

Issue

The American Journal of Orthopedics - 46(4)

Publications

The American Journal of Orthopedics

MDedge Surgery

Topics

Knee

Arthroplasty/Joint Replacement

Orthopedics

Page Number

E224-E229

Read more about Is Simultaneous Bilateral Total Knee Arthroplasty (BTKA) as Safe as Staged BTKA?

Sections

Original Research

Author(s)

Scott Hadley, MD

Michael Day, MD, MPhil

Author(s)

Michael Day, MD, MPhil

Author and Disclosure Information

Authors’ Disclosure Statement: The authors report no actual or potential conflict of interest in relation to this article.

Acknowledgment: The authors thank Emmanuel Koli, BS, for his help with data collection.

Author and Disclosure Information

Authors’ Disclosure Statement: The authors report no actual or potential conflict of interest in relation to this article.

Acknowledgment: The authors thank Emmanuel Koli, BS, for his help with data collection.

Article PDF

ajo04604224e.pdf

Article PDF

ajo04604224e.pdf

Take-Home Points

Complication rates did not statistically significantly differ between simultaneous and staged TKA.
Length of stay of 2 TKA admissions was greater than 1 BTKA admission.
Transfusion requirements were greater in BTKA.
Avoid bilateral procedures in ASA 3 patients.
Develop institutional protocols for BTKA with multidisciplinary input.

Methods and Materials

Results

Between 2005 and 2009, 371 patients had simultaneous BTKA, and 67 had staged BTKA (134 procedures) <6 months apart (Table 1).

Discussion

Conclusion

Take-Home Points

Complication rates did not statistically significantly differ between simultaneous and staged TKA.
Length of stay of 2 TKA admissions was greater than 1 BTKA admission.
Transfusion requirements were greater in BTKA.
Avoid bilateral procedures in ASA 3 patients.
Develop institutional protocols for BTKA with multidisciplinary input.