Table-based language models for ophthalmology assessment in the emergency department
- Juan M. Lavista Ferres ,
- Shu Feng ,
- Mary Kim ,
- Nadia Popovici ,
- Lauren Lee ,
- Kaden Moore ,
- Karine D. Bojikian
Graefes Arch Clin Exp Ophthalmology |
Purpose
General-domain large language models (LLMs) have emerged as valuable tools in healthcare, however, their ability to understand and perform tasks based on data stored in tabular form has not been explored in Ophthalmology. We aimed to assess OpenAI’s Generative Pre-trained Transformer 4o (GPT-4o) performance within real emergency department (ED) eye-related encounters extracted from electronic medical records in tabular format.
Methods
We input the excel spreadsheet containing the data on 1,419 unique eye-related ED encounters, divided into (1) chief complaint (CC), history of present illness (HPI), and eye examination; (2) CC and eye examination; (3) eye examination only, into GPT-4o via Microsoft’s Azure OpenAI Service using chain-of-thought (CoT) prompting and evaluated the diagnosis and assessment performance of the LLM on the presented data. GPT-4o answers were reviewed by board-certified ophthalmologists and classified as (1) GPT-4o provided a correct diagnosis and assessment; (2) GPT-4o provided an incorrect diagnosis and assessment; (3) GPT-4o unable to provide a correct diagnosis as the encounter documentation was incorrect; (4) GPT-4o unable to provide a correct diagnosis as it required ancillary tests. A sample of encounters were reviewed by a second board-certified ophthalmologist for inter-grader agreement assessment. Average accuracy rates were used to evaluate performance and compare statistical significance across scenarios. A second CoT prompting was performed after providing the LLM with the final encounter diagnosis to evaluate disagreement/inconsistencies between the presented documentation and the reported diagnosis.
Results
GPT-4o (CoT) overall accuracy was 0.76 (95% confidence interval [CI], 0.74–0.79); no significant difference was found in accuracy when GPT-4o was presented with CC, HPI and eye findings vs. CC and eye findings vs. eye findings only (P = 0.675). The inter-grader agreement kappa was 0.841 (P < 0.001). GPT-4o identified that 6.6% of all encounters did not have EMR documentation that supported the final encounter diagnosis. When encounters with incorrect EMR documentation and encounters with requirements for ancillary tests (5.2%) were excluded, GPT-4o accuracy was 0.87 (95% CI, 0.85–0.89).
Conclusions
GPT-4o could accurately synthesize tabular data and provide assessments and diagnoses in real-world ophthalmology encounters, in addition to identify encounters with documentation that did not support the final ED encounter diagnosis. This capability has the potential to support the clinician’s diagnosis.