Skip to main content

The performance of ChatGPT in day surgery and pre-anesthesia risk assessment: a case-control study of 150 simulated patient presentations

Abstract

Background

Day surgery has developed rapidly in China in recent years, although it still faces a shortage of anesthesiologists to handle pre-anesthesia routine before surgery. We hypothesized that ChatGPT may assist anesthesia practitioners in preoperative assessment and answer questions on the concerns of patients. The aims of this study were to examine the ability of ChatGPT to assess preoperative risk and determine its accuracy in answering questions regarding knowledge and management of day surgery anesthesia.

Methods

One-hundred fifty patient profiles were generated to simulate day surgery patient presentations that involved complications of varying acuity and severity. The ChatGPT group and the expert group were both required to evaluate the profiles of 150 simulated patients to determine their ASA-PS classification and whether day surgery was recommended. ChatGPT was then asked to answer 131 questions about day surgery anesthesia that represented the most common issues encountered in clinical practice. The performance of ChatGPT was assessed and graded independently by two experienced anesthesiologists.

Results

A total of 150 patient profiles were included in the study (75 males [50.0%] and 75 females [50.0%]). There was no difference between the ChatGPT group and the expert group for the ASA-PS classification and assessment of anesthesia risk in the patient profiles (P > 0.05). Regarding recommendation for day surgery in patients with certain comorbidities (ASA ≥ II), the expert group was inclined to require further examination or treatment. In addition, the proportion of conclusions made by ChatGPT was smaller than that of the experts (i.e., ChatGPT n (%) vs. expert n (%): day surgery can be performed, 67 (47.9) vs. 31 (25.4); needs further treatment and evaluation, 56 (37.3) vs. 66 (44.0); and day surgery is not recommended, 18 (12.9) vs. 29 (9.3), P < 0.05). We showed that ChatGPT had extensive knowledge related to day surgery anesthesia (94.0% correct), with most of the points (70%) considered comprehensive. The performance of ChatGPT was also better in the domains of peri-anesthesia concerns, lifestyle, and emotional support.

Conclusions

ChatGPT can assist anesthesia practitioners and surgeons by alerting them to the ASA-PS classification and assessing perioperative risk in day surgery patients. ChatGPT can also be trusted to answer questions and concerns related to pre-anesthesia and therefore has the potential to provide important assistance in clinical work.

Introduction

Day-case surgery involves the management of a patient who is scheduled to be admitted and discharged on the same day (Bailey et al. 2019). This process involves admitting patients for investigation or an operation on a planned, nonresidential basis, with the provision of adequate facilities for recovery in a ward or separate unit (Goodwin et al. 1992). Historically, patients were typically hospitalized for surgical procedures and remained there until they regained independence, were able to walk on their own, and had their stitches removed. This process is attributed to a lack of comprehensive healthcare within the community, suboptimal home environments for patient care, a reduction in incomplete wound healing, and elevated rates of anesthetic and surgical complications (Ojo et al. 2010). The benefits of early mobilization after an operation are well recognized, while minimally invasive surgery is well established, resulting in more procedures being performed as day surgery (Albornoz. et al. 2011).

Several factors should be taken into account when considering the characteristics of the patient and selection of the procedure suitable for daycare surgery (Albornoz. et al. 2011). Firstly, the procedure carries a minimal risk of severe postoperative complications. Secondly, postoperative symptoms can typically be managed using oral medications or local anesthesia techniques. Thirdly, patients should be capable of independent movement prior to their discharge. Day surgery therefore facilitates a more rapid recovery, imposes less disruption on patients’ lives, and also diminishes the risk of hospital-acquired infections.

Pre-anesthetic assessment is carried out to determine whether the planned procedure is appropriate for the patient and has an acceptably low risk for perioperative complications. The assessment of anesthesia risk prior to surgery is essential to identify and exclude patients with medium- to high-risk profiles who are considered unsuitable for day-case procedures in outpatient clinics, emergency departments, or primary care settings. Of these evaluations, the American Society of Anesthesiologists-Physical Status (ASA-PS) classification is the most basic and effective (Voney et al. 2007). The ASA-PS classification is a suitable index for assessing the physical status of surgical patients and predicting adverse events during surgical anesthesia, such as the length of stay in the operating room, duration of anesthesia and surgery, and blood loss and infections, with these factors resulting in higher ASA-PS classification (Ansell et al. 2004). The increasing number of patients requiring day-case surgery and anesthesia poses a growing challenge of ensuring adequate preoperative assessment. The large population of China has resulted in a substantial demand for surgeries, although this is hindered by a shortage of anesthesiologists.

Large language models (LLM) are sophisticated artificial intelligence (AI) systems that utilize deep neural networks to learn and generate human-like natural language through training on extensive text datasets (Thirunavukarasu et al. 2023). The Chat Generative Pre-trained Transformer (ChatGPT) was developed in recent years and is considered to have the ability to help doctors relieve the pressure of preoperative assessments. ChatGPT is an advanced natural language processing (NLP) model pioneered by OpenAI (Ali et al. 2023). The model exhibits a human-like capacity for generating text, especially in the domain of chatbot dialogues. Many medical professionals are already investigating the possibility of using ChatGPT in clinical work (Grünebaum et al. 2023; Lahat et al. 2023). However, to date there have been no extensive studies that have evaluated the effectiveness of ChatGPT in answering questions accurately and holistically about day-case surgery based on preoperative assessments.

The day surgery procedures performed at our medical center are typically straightforward and uncomplicated, although patients with more complex conditions require a greater allocation of labor costs. The aim of the current study was to use ChatGPT to assess the physical status of simulated patients prior to surgery and to examine the dependability and precision of ChatGPT in the preoperative evaluation of anesthesia. The precision, comprehensiveness, and consistency of ChatGPT’s responses to frequently asked questions about treatment and patient care prior to anesthesia were also assessed.

Methods

OpenAI has released ChatGPT-4. The current study used the ChatGPT-4 online interface (ChatGPT Version 4.0), a model trained by OpenAI. ChatGPT was questioned on the patient’s profiles, including medical history, physical examination, current vital signs, and the results of diagnostic tests. A panel of experienced anesthesiologists then evaluated the correctness and accuracy of ChatGPT’s responses, with the panel consisting of three anesthesiologists, all of whom had over 10 years of clinical experience.

Ability of ChatGPT for preoperative risk assessment

Hypothetical and standardized patient profiles were generated in order to simulate day surgery patient presentations for each of the defined complications according to the checklist components of the process-oriented score (PRO-score) for risk evaluation of adults (Vogelsang et al. 2020). The score is based on national and international guidelines for preoperative risk assessment of adult patients and contains current guidelines and the assessment of vital signs (Vogelsang et al. 2020; Kristensen et al. 2014). The contents are expressed as 234 single parameters, with these items sorted by organ systems and following the airway (A), breathing (B), circulation (C), disability (D), and endocrinology (E) scheme. Overall, a total of 150 patient profiles and presentations were generated for the purpose of this study (Supplementary Table 1).

Based on the patient’s information such as history, symptoms, and test results we provided, ChatGPT drew conclusions about the patient’s ASA-PS classification (Doyle et al. 2024) (Table 1) and also determined whether they could undergo the day surgery procedure. For the expert group, two anesthesiologists were invited to analyze the ASA-PS classification of the patients and conclude whether day surgery could be performed, with a third experienced expert presiding over controversial results.

Table 1 ASA-PS classification: physical status of the patients as classified by the American Society of Anesthesiologists

In order to ensure the uniformity of the patient cohort, those with complex conditions were not simulated in our selection: (1) patients with difficult airways caused by anatomical problems in the upper respiratory tract, (2) comatose patients, and (3) patients with heart murmurs. The following were the reasons why these three types of factors were not selected:

  1. (1)

    Due to the invention of video laryngoscopy, common difficult airways (i.e., small mandibles or short spacing) are no longer challenging problems in airway management. However, surgery and tumor space-occupying lesions remain difficult to treat. Since the identification of difficult airways in patients requires professional anesthesiologists, it was not conducive to providing clinical treatment for the problems described in this experiment.

  2. (2)

    Patients with a high Glasgow score or coma were not admitted to the day surgery clinic.

  3. (3)

    Heart murmurs are more subjective and could not be evaluated scientifically and so were assessed by more objective echocardiography.

Performance of ChatGPT for answering questions regarding anesthesia

ChatGPT was instructed to answer frequently asked questions about knowledge or concerns related to peri-anesthesia were collected from questions on the networks and from professional associations and institutions, and were also the most commonly encountered conditions in clinical practice. Each question was entered as a separate, independent prompt using the “New Chat” function. There were a total of 145 questions about day surgery anesthesia. Of these questions, 11 similar or duplicate questions were deleted, while 134 questions were answered by ChatGPT. Two questions with no clear answer and one question unrelated to the topic were excluded, leaving a total of 131 questions in the final analysis (Fig. 1, Supplementary Table 2).

Fig. 1
figure 1

Flow chart of question selection for pre-anesthesia assessment. Frequently asked questions about knowledge or concerns related to peri-anesthesia were collected from questions on networks, professional associations and institutions, and experiences in clinical practice

We considered the 131 questions to be unique and pertinent and encompassed a wide array of physical status evaluations. We therefore categorized the 131 questions into 5 distinct groups: (1) Basic knowledge: the procedures to anesthesia, including issues related to anesthesia education (e.g., whether to continue taking high blood pressure medication), and postoperative rehabilitation (e.g., how long will I be awake after anesthesia?); (2) peri-anesthesia concerns and preoperative preparation (e.g., Why do I need to fast before surgery?); (3) emotional support including problems about emotional support related to surgery or anesthesia (e.g., how does anesthesia affect my body?); (4) lifestyle (e.g., I have insomnia, does it matter after anesthesia?); and (5) others.

The responses of ChatGPT to the 131 questions were graded independently by 2 anesthesiologists, and if there were controversial results, these were resolved by a third reviewer. The performance of ChatGPT was graded and divided into four types: (1) Comprehensive, (2) correct but inadequate, (3) mixed with correct and incorrect/outdated data, and (4) completely incorrect.

Statistical analysis

The data collected were analyzed using standard statistical methods. All the calculations were performed using IBM SPSS Statistical Package version 28. Descriptive statistics were calculated to describe the baseline characteristics of the simulated patients. Categorical variables were expressed as frequencies and percentages.

The differences in ASA assessment results between the ChatGPT group and the expert group were analyzed using the Mann-Whitney U-test. The results of the conclusion of recommendations for day surgery between the two groups were analyzed by the chi-square test. All patients who were labelled ASA ≥ II during the initial risk assessment were later re-evaluated, with differences in the recommendations between the two groups analyzed by the chi-square test. A P-value < 0.05 between the two groups was considered to indicate a statistically significant difference.

The proportions of each aforementioned grading for responses of each pre-anesthesia domain were calculated and reported as percentages.

Results

Assessments of the 150 simulated patients (75 males [50.0%] and 75 females [50.0%]) were performed during the study. The characteristics of the patients including sex, age, and BMI are shown in Table 2. The comorbidities (e.g., airway-related diseases, pulmonary inhalation risk, respiratory diseases, circulation-related diseases, neurological disorders, endocrine/blood system diseases, allergic history, and operation history) of the simulated patients are listed in Table 3.

Table 2 Characteristics of the simulated patients
Table 3 Comorbidities of the simulated patients

The two experts provided distinct ASA-PS scores in three cases (3/150, 2%) but reached different conclusions regarding the feasibility of day surgery in seven cases (7/150, 4.6%). A third expert was required to reach a consensus for these disputed conclusions. There were no significant differences between the ChatGPT and expert responses in the majority of cases (P = 0.064, > 0.05) (Table 4). However, there were some differences between ChatGPT and the experts regarding the conclusion as to whether day surgery could be performed after comprehensive consideration of the patients’ conditions (ASA ≥ II).

Table 4 Concordance between ChatGPT and expert preoperative assessment

For patients with certain comorbidities (ASA ≥ II), the expert group was more inclined to evaluate whether the patient was suitable for day surgery after further examination or treatment or considered that the patient’s current physical condition was not suitable for day surgery. In contrast, this proportion of conclusions made by ChatGPT was smaller (ChatGPT n (%) vs. expert n (%): day surgery can be performed, 67 (47.9) vs. 31 (25.4); needs further treatment and evaluation, 56 (37.3) vs. 66 (44.0); day surgery is not recommended, 18 (12.9) vs. 29 (9.3), P = 0.001, < 0.05) (Table 5).

Table 5 Comparison of assessment results for patients with comorbidities

ChatGPT answered 131 questions that were then evaluated by three anesthesiologists. For the answers of ChatGPT on basic knowledge related to anesthesia, the anesthesiologists considered that 70% were comprehensive, 24% were correct but inadequate, and 6% were mixed with correct and incorrect/outdated data. For peri-anesthesia concerns, 95.3% were comprehensive, and 4.7% were correct but inadequate. The responses of ChatGPT to questions regarding emotional support were all 100% and have been all praised by experts as comprehensive. Although there was no standard answer to questions on emotional support, it was possible to judge ChatGPT’s potential in this area. In terms of answers to lifestyle and other questions, ChatGPT was 91.7% comprehensive. The expert panel did not consider any of the answers to be completely incorrect. Taken together, these results indicated that although the evaluation of basic knowledge was relatively low, ChatGPT still had a certain reference value for solving problems related to anesthesia (Fig. 2).

Fig. 2
figure 2

Grading of responses by ChatGPT to questions related to peri-anesthesia. The percentage of responses graded as comprehensive, correct but inadequate, mixed with correct and incorrect/outdated data, and completely incorrect are provided

Discussion

It is widely recognized that China has a vast population, yet the nation continues to grapple with significant medical challenges, particularly due to a shortage of anesthesiologists. At present, the number of day surgery procedures is increasing, and therefore, preoperative evaluation is a very important part of anesthesia (Ojo et al. 2010). Currently, ChatGPT’s exploration in the medical field is still focused mainly on medical education and scientific writing, and there is relatively little use of it in clinical and research scenarios (Kung et al. 2023; Shay et al. 2023). One of the key benefits of ChatGPT is its ability to provide instant, accurate, and personalized responses to a wide range of questions related to health care (Cascella et al. 2023; Liu et al. 2023; Odom-Forren 2023). A study by Gupta et al. (2024) searched the database to determine how ChatGPT could be helpful to anesthesia providers, including preoperative management, ICU management, pain management, and palliative care. The results of Gupta’s study showed that ChatGPT can be extremely useful for anesthesiologists, especially for determining the dose of anesthetics, assisting in retrieving research materials, or providing guidance on how to perform certain procedures.

ASA is an important index for preoperative evaluation of both anesthesia and surgical risk and has been used widely, resulting in the index being recognized throughout the world (Riley et al. 2014; Mayhew et al. 2019). Lim et al. (2023) used ChatGPT to evaluate 10 standardized hypothetical patient scenarios and suggested that ChatGPT was able to classify ASA-PS consistently and correctly in multiple simulated patient scripts with appropriate justification and had similar performance to that of human anesthesiologists in the majority of cases. Our current study expanded the size of the patient cohort, broadened the spectrum of diseases under investigation, and showed that ChatGPT had a significant degree of utility for assessing the physical status of patients according to the ASA classification system, with its evaluations aligning largely with those of an expert panel.

This study used ChatGPT to analyze data of the patient’s medical history, examination outcomes, surgical procedures, and anesthesia techniques. Leveraging this data, surgeons and anesthesiologists can acquire suitable risk assessment indicators, thereby saving a significant amount of energy and time. There were no significant differences between ChatGPT and the responses of the experts in the majority of cases. This underscores ChatGPT’s proficiency for evaluating the physical condition of the simulated patients and achieving the correct ASA ratings. This ability of ChatGPT makes it an exceptionally efficient method for preoperative evaluation. As far as we are aware, this is the first study to assess the ability of ChatGPT to make ASA grading and preoperative evaluation of patients, a function that would have major clinical value.

To guarantee that all the anesthesiologists use the same criteria as ChatGPT for considering suitability for day-care surgery, experts need to make their judgments based on the standardized guidelines for day-case surgery 2019 (Bailey et al. 2019). In our current study, ChatGPT and the clinical experts may had different views as to whether patients were eligible for day surgery procedures in some condition. ChatGPT mostly recommended patients for day surgery directly after assessing their physical condition, surgical method, and anesthesia risk. For patients with an ASA ≥ 2, the panel preferred to recommend further examination and treatment before considering the suitability for day surgery. Even for more seriously ill patients, the panel recommended canceling day surgery at a higher rate.

Do these results indicate that ChatGPT is more liberal when considering the risks of anesthesia surgery while the panel is more conservative? We considered the reasons for this difference may be as follows:

  1. 1.

    Possibly related to the working habits of each expert group, with some groups having a stricter level for indication of day surgery, while others do not. Ansell and his colleges carried out a retrospective case-controlled review of 896 ASA III patients who had undergone day case procedures and concluded that with good pre-assessment and adequate preparation, these patients could be treated safely in the day surgery setting (Ansell et al. 2004). Alternatively, Rasmussen considered that fitness for a procedure should relate to the patient’s functional status as determined by a pre-anesthetic assessment, and not by ASA physical status, age, or body mass index (Rasmussen et al. 2015).

  2. 2.

    ChatGPT analyzes a patient’s objective indicators with the conclusion made after synthesizing all these indicators, thereby defining the reference value.

  3. 3.

    The medical staff could not consider which decision was right or wrong, with the actual decision based on either the surgeon or anesthesiologist’s understanding of the guidelines or the patient’s condition.

Although our study showed the benefits of ChatGPT as a tool, there remain problems that need to be considered. Firstly, the correctness and validity of the content must be considered, as incorrect content may mislead the patient. While ChatGPT can provide numerous information and helpful assistance, at present, it cannot completely replace human healthcare workers in all situations. Compared to search engines, it is not possible to find the source of ChatGPT’s information. Taken together, these findings indicate that ChatGPT sometimes answers questions incorrectly, with the information it uses currently not updated since 2021. In addition, ChatGPT cannot access the Internet in real time. To overcome these shortcomings, manual auditing can be used to screen the content generated and allow its accuracy to be judged (Lee et al. 2023).

Secondly, attention should be paid to ethical and privacy issues. During the process of communicating with ChatGPT, patients provide their personal basic information and medical conditions, sometimes including sensitive pictures of their private parts. Although ChatGPT claims that it does not save conversations with users, it needs to be understood that sensitive health information may be damaged or abused during transmission and browsing. It is therefore necessary to implement sound data protection measures, including encryption of sensitive information and secure data transmission. In addition, because ChatGPT has extremely rich emotional value, it is necessary to be careful that anxious patients do not become psychologically dependent on this “friend”. Strict ethical and privacy regulations therefore need to be established to limit the scale of information input and emotional output of ChatGPT.

ChatGPT has the potential to revolutionize the way programs are evaluated for patients by providing accurate and effective clinical help. With any new technology, there are shortcomings that need to be addressed, although the potential benefits of ChatGPT in the field of ambulatory surgical evaluation are enormous. Development is the name of the day, and therefore, healthcare workers need to keep up with this trend and explore this promising area of technology. In this regard, it has been proposed that LLM represented by ChatGPT has the potential to add a new dimension to solving clinical problems. It is also important to realize that ChatGPT is just a machine and cannot replace the humanity and compassion that are so essential to our profession (Odom-Forren 2023).

ChatGPT can therefore be regarded as impartial and potentially offers a sense of reassurance. In a profit-oriented healthcare system, such as that of the USA, it is evident that financial incentives can influence the guidance provided to patients. Therefore, having an independent assessor to oversee medical assessments and decisions would be highly beneficial. As we continue to explore the possibilities of AI in health care, it is important to embrace these new technologies and use them to augment, rather than replace, important clinical work.

While our research offers valuable insights, it is important to acknowledge its limitations. We used simulated patient data, which although closely mirroring real patients inevitably differs in certain aspects. This discrepancy may introduce some degree of error that future studies using real patient data could address. In addition, the limited number of simulated cases restricted the generalizability of our findings. Expanding the case pool would therefore enhance the robustness of our conclusions. Furthermore, our study focused solely on day surgery patients. Further research is needed to assess the applicability of ChatGPT in pre-anesthetic evaluation of major surgery or emergency procedures.

Conclusions

ChatGPT can assist anesthesia practitioners and surgeons by alerting them to the ASA-PS classification and assessing perioperative risk in patients. ChatGPT can also be trusted to answer questions and concerns related to pre-anesthesia, thereby providing important assistance in clinical work.

Data availability

No datasets were generated or analysed during the current study.

Abbreviations

ChatGPT:

Chat Generative Pre-trained Transformer

ASA-PS:

American Society of Anesthesiologists-Physical Status

NLP:

Natural language processing

AI:

Artificial intelligence

LLM:

Large language model

References

Download references

Acknowledgements

The authors would like to express their gratitude to EditSprings (https://www.editsprings.cn ) for the expert linguistic services provided.

Funding

This work was supported by the Shanghai Municipal Jiading District New Key Subject Program (No. 2020-jdyxzdxk-03) and Shanghai Jiading District Health Commission Traditional Chinese Medicine Project (Youth) (2022-QN-ZYY-03).

Author information

Authors and Affiliations

Authors

Contributions

TTC and YL helped design the study, conduct the study, collect data, analyze the data, and prepare the manuscript. JQG helped design the study, analyze the data, and prepare the manuscript. YBH helped collect data and prepare the manuscript. GBHand PPZ helped collect data and prepare the manuscript. SYL and HX helped design the study, conduct the study, collect data, analyze the data, and prepare the manuscript. YB and XJW attests to the integrity of the data, approved the final manuscript, and is the archival author. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Yang Bao or Xuejun Wang.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

13741_2024_469_MOESM1_ESM.docx

Additional file 1: Supplementary Table 1. Summary of the clinical profiles of the 150 patients. Supplementary Table 2. Grading of the responses from ChatGPT on peri-anesthesia concerns.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cheng, T., Li, Y., Gu, J. et al. The performance of ChatGPT in day surgery and pre-anesthesia risk assessment: a case-control study of 150 simulated patient presentations. Perioper Med 13, 111 (2024). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13741-024-00469-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13741-024-00469-6

Keywords