Citation
Gao Fei, Gao Xue, Shao Yan, et al. Application of large language models in health education for patients with diabetic retinopathy[J]. Chin J Exp Ophthalmol, 2024, 42(12):1111-1118. DOI: 10.3760/cma.j.cn115989-20240723-00207.
ABSTRACT [Download PDF] [Read Full Text]
Objective To evaluate the accuracy, completeness, and reproducibility of domestic open-source large language models (LLM) in diabetic retinopathy (DR) patient education, and to explore their potential as intelligent virtual assistants for DR patient education.
Methods A total of 41 questions and answers related to the diagnosis and treatment of DR in five categories, namely risk factors, screening and examination, symptoms and staging, diagnosis, treatment and prognosis.All questions were repeated twice as a ” new dialogue” in the LLM, and all the answers were recorded.Three senior fundus physicians independently evaluated the answers on a 6-point Likert scale for accuracy and a 3-point Likert scale for completeness and repeatability, and for each answer, the evaluator was asked to make a recommendation between the LLM and the manual answers.Five questions were randomly selected to evaluate the three open source LLM, ERNIE Bot 3.5, Qwen and Kimi chat, and the LLM with the best overall performance was selected for further evaluation in the full question bank.
Results Among the three LLM, Kimi chat had the best overall performance, Kimi chat performed best, with percentages of 6 for accuracy, 3 for completeness, and 3 for repeatability among the 5 questions at 90%, 90%, and 100%, respectively.For all questions answered, the number of words in manual replies was 106 (70, 202), which was significantly lower than 505 (386, 600) in Kimi chat ( Z=-7.866, P<0.001).There was no significant correlation between the number of Kimi chat replies and the accuracy score ( r s =-0.044, P=0.492), but it was positively correlated with the integrity score ( r s =0.239, P<0.001).The interclass correlation coefficient for accuracy and completeness scores were above 0.700 among three evaluators, with the highest agreement for repeatability at 0.853, followed by completeness of the first response at 0.771.The proportion of responses ≥5 points for accuracy was 87.0%(214/246), the proportion ≥2 points for completeness was 98.0%(241/246), and the proportion higher than 70% for repeatability was 78.5%(193/246).Kimi chat excelled in answering basic questions about the disease such as disease definition, staging, frequency of screening, and common risk factors, but performed poorly on questions involving treatment choices that require a doctor’s professional judgment.The proportion of evaluators choosing Kimi chat responses as superior was 69.5% (171/246), and the reasons for non-selection included lack of characteristic answers, inclusion of too much irrelevant information, and lack of responses to questions requiring a high degree of medical expertise.
Conclusions Kimi chat answers DR-related diagnostic questions in a detailed and well-organized manner, with a high degree of accuracy, completeness and reproducibility.