Musing 97: SurgBox: Agent-Driven Operating Room Sandbox with Surgery Copilot
Interesting paper out of the Institute of Automation at the Chinese Academy of Sciences and Peking Union Medical College Hospital
Today’s paper: SurgBox: Agent-Driven Operating Room Sandbox with Surgery Copilot. Wu et al. 6 Dec. 2024. https://arxiv.org/pdf/2412.05187
There were a number of good papers to choose from for today’s substack, and I was initially tempted to go for something I was familiar with, or a piece from a big-name researcher. However, ultimately I decided to go for this piece because, as even the subtitle should make evident, it is covering a potential application that could really revolutionize the operating room in hospitals, and more importantly, at least one author is from a medical establishment. I should state at the outset that the paper is not claiming yet that we can automate surgeons, but what is does propose is a surgery ‘copilot’ that could “systematically enhance the cognitive capabilities of surgeons in immersive surgical simulations.” Let’s get started.
Surgical interventions are among the most complex and high-stakes scenarios in medicine, with outcomes that directly influence treatment effectiveness and patient quality of life. Neurosurgical procedures particularly involve highly intricate workflows and complex decision-making across multiple stages. Procedural complexity can places significant cognitive demands on surgical teams, who must manage multiple streams of information while maintaining precision in their actions. Such cognitive burdens can elevate the risk of surgical errors, resulting in potentially severe consequences for patient outcomes.
The authors propose SurgBox, an innovative agent-driven framework designed to systematically enhance cognitive capabilities of surgeons through immersive surgical simulations. Their framework employs LLM agents enhanced with specialized RetrievalAugmented Generation (RAG) banks to authentically replicate various surgical roles, including chief surgeon, assistant surgeon, nurses, and anesthetists. Jumping ahead, the figure below shows it in action, and its capability compared to models like GPT-4 and Llama:
While the figures below are clunky, they try to express the simulation-like environment that the framework embodies. To amplify the training benefits of SurgBox and further reduce cognitive load during live surgeries, the authors devise the Surgery Copilot, the first AI-driven assistant designed to actively support surgical decision-making and workflow management in real time. This specialized agent helps surgeons maintain situational awareness by effectively coordinating and filtering information streams, providing contextually relevant guidance, and proactively identifying potential risks before they escalate into complications.
As shown in the figure below, SurgBox structures each operation into distinct stages and subtasks to promote collaborative communication, facilitating multiple rounds of interaction between different roles and enabling the proposition and verification of solutions for each stage. The simulation encompasses the entire surgical process from pre-operative to postoperative phases, segmented into three main stages: Preoperative, Intraoperative, and Postoperative. Each stage comprises specific tasks and participants, illustrating the patient’s progression through key phases including transfer, anesthesia, surgical preparation, surgical operation, and postoperative care, with each phase involving the participation of corresponding medical personnel.
Let’s take an example to demonstrate. Consider a simulated neurosurgery procedure in SurgBox: During the preoperative phase, the Chief Surgeon reviews the patient’s MRI scans and medical history, collaborating with the Anesthetist to develop a tailored anesthesia plan. The Scrub Nurse prepares the surgical instruments based on the procedure requirements. In the intraoperative phase, the Chief Surgeon performs the operation step by step according to the surgical plan, communicating with the Surgical Assistant for auxiliary operations. The anesthetist continuously monitors the patient’s vital signs, adjusting anesthesia as needed, while the Scrub Nurse anticipates and provides necessary instruments. Post-surgery, the team transitions the patient to recovery, with Nurses monitoring vital signs and managing pain according to the anesthetist’s instructions. Throughout this process, each role accesses its specialized knowledge base to inform decisions and actions, resulting in a highly realistic and educational simulation of the entire surgical experience.
The Surgery Copilot functions as an intelligent virtual assistant, seamlessly integrating into the SurgBox ecosystem to enhance surgical performance and outcomes. Its primary roles encompass real-time guidance, decision support, and adaptive learning as shown in the figure below. The Copilot continuously monitors the surgical procedure, analyzing the operation of different surgery room roles to provide contextual insights and recommendations. It offers step-by-step guidance, alerts the team to potential risks, and suggests optimal techniques based on the current surgical context and patient-specific factors. It leverages a vast database of surgical experiences, constantly updated with the latest medical research and best practices, to offer evidence-based recommendations tailored to each unique surgical scenario.
To experimentally demonstrate the approach, the authors collect a dataset of 128 real clinical surgery reports, as shown below. To enhance physician agent performance, they enriched the operative reports with additional contextual information, including basic information, patient history, and MRI findings. The authors also use a Simulated Surgery Report Dataset that consists of 1,000 simulated surgical reports generated through multiple SurgBox simulation processes.
The performance of SurgBox is assessed using two specialized evaluation metrics:
Surgical Route Accuracy: This metric measures the capability of LLM-based agents to determine the optimal surgical route for a given patient’s condition. The system’s selections are compared against the judgments of experienced neurosurgeons.
Surgical Plan Accuracy: This metric evaluates the ability of LLM-based agents to accurately plan and execute the entire surgical procedure.
The table below shows that SurgBox consistently maintained a superior completion rate, particularly in Stage 2 and 3. Concurrently, its accuracy remained elevated throughout all stages, with notably high and stable performance in the later stage, indicative of its robustness and reliability in complex surgical scenarios. Generally, the completion rate and accuracy of all models declined as stages progressed, reflecting the escalating complexity and challenges in the later phases of the surgical process. In contrast, while SurgBox exhibits a marginal decrease in completion rate, its accuracy diminished less significantly, demonstrating superior consistency and adaptability.
Despite its potential, Surgery Copilot exhibits limitations, as detailed in Table IV below. A primary concern is the misclassification of the initial surgical approach, reflecting deficiencies in accurately assessing the extent of tumor invasion, patient-specific physiological and anatomical factors, and associated surgical risks. Furthermore, the system demonstrates the difficulty in managing concurrent intraoperative events, highlighting limitations in prioritizing and orchestrating appropriate responses to complex scenarios, exemplified by simultaneous cerebrospinal fluid leaks and hemorrhages. Additionally, the observed phenomenon of hallucinatory diagnoses pertaining to rare pathologies, wherein common intraoperative findings are misclassified, shows limitations in the system’s capacity for interpretation of subtle anatomical details and real-time intraoperative observations.
In closing this musing, I want to note that, despite its promises, this system is still heavily dependent on language. It would be great if some of its findings could be further validated (for example, on bigger datasets), and if some kind of experiment could be done in an embodied setting. Nevertheless, this work shows that LLMs have great potential in medicine, and if the needle can be moved even a little bit, we are going to be better off for it.