Comed: Multi-Model ML Healthcare Platform
Comed: Multi-Model ML Healthcare Platform

Comed: Multi-Model ML Healthcare Platform

Tags
Next.js
LLMs
Retrieval & Rerank
ML Classifier
Published
January 27, 2024

Comed – AI-Powered Emergency Rooms

notion image

Table of Content

Overview

Comed is an AI-powered emergency room tool built to provide a better experience for patients, nurses, and physicians. It directly addresses three major pain points in emergency care: efficient triaging, physician support, and access to up-to-date medical resources. By automating triage with a classifier model, providing AI assistance through a chat model, and delivering real-time medical literature using a ranking model, Comed speeds up care and enhances its quality.
 
I originally built this project alongside a friend with a medical background in 36 hours during a hackathon. Since then, I have rebuilt the sandbox web app and created a landing page. Our goal is to get feedback from potential customers, refine through design partners, and continue building out this project :)
 
🏖️
 

Models

We used three models, each specifically selected and built around its intended use case:
Triage Classification Model
Triage Classification Model
A custom ML classifier that assigns triage levels (CTAS 0-5) based on symptoms.
Physician AI Assistant
Physician AI Assistant
A medical chat LLM model enabling physicians to discuss cases, seek second opinions, and explore treatment options together.
Medical Literature Ranking
Medical Literature Ranking
A retrieval ML model that specializes in finding and ranking documents (relevant medical literature) based on a query parameters (patient’s condition and symptoms).
 

How It Works

Tech Stack

  • Frontend: Next.js, TypeScript, Tailwinds CSS, Python (ML Training & Evaluation Pipeline)
  • AI Models:
    • Triage system: Cohere’s classifier model (fine-tuned with supervised learning)
    • Physician AI-assistant chat: Cohere’s chat model
    • Medical research retrieval: Cohere’s RAG rerank model
 
notion image
 

📍 Step 1: Triaging

When a patient enters their symptoms, Comed automatically determines their triage level (medical urgency) and assigns a confidence score. Nurses review this information in real time, make adjustments if needed, and assign patients to available physicians.
 
notion image
 
Example Input:
{ "model": "cohere-classifier", "inputs": ["Sudden dizziness, slurred speech, and left arm weakness."], }
Example Output:
{ "triage_level": "Level 1", "confidence": 0.99 }
 

📍 Step 2 & 3: Physician AI Assistant & Medical Literature Ranking

After assessing a patient, Comed suggests a diagnosis and treatment plan based on their symptoms and medical history. Physicians can review and discuss these recommendations through a chat UI, with the option to consult further if needed. As they finalize treatment, they also have access to the latest relevant medical research to support their decisions. Once the consultation is complete, digital visit notes are automatically shared with the patient.
 
notion image
 

Prompt Engineering & Customization

To improve response quality, we fine-tuned the classifier model and used prompt engineering techniques for the chat and rerank models, like chaining, context injection, meta, role prompting, and RAG. We also applied confidence thresholding for better reliability. This section covers our approach, including best practices—we used tons of great resources from Cohere and others—to refine and maximize each technique for our models.
 

Automated Triage: Classifier Model

We trained Cohere’s classifier model using emergency room data labeled with CTAS (Canadian Triage and Acuity Scale) levels, which rank patient urgency from Level 1 (critical, requiring immediate care) to Level 5 (non-urgent).
For training, we used supervised learning with predefined question-and-answer pairs to help the model better understand labeling. We utilized Python and Cohere’s SDK to train and build the fine-tuned model. Our proof-of-concept model was trained on a dataset of 1,000 examples, achieving an accuracy score of 91% on a separate evaluation dataset. For more details on the evaluation process and results, refer to the section below Performance Evaluation Pipeline
In the future, we could improve accuracy with a larger training set. A more diverse dataset, including edge cases, rare conditions, and varied diagnoses, could also enhance accuracy. However, the current score of 91% is already highly reliable within the medical context and scope of this project. What would be even cooler to explore in the future is using real-data from the app to continuously train the classifier, creating an improved and tight feedback loop pipeline.
 
Sample training set:
{ "examples": [ { "patient_info": { "age": 65, "gender": "Male", "medical_history": "Hypertension, High Cholesterol" }, "symptoms": [ "Patient experiences sudden, excruciating chest pain radiating to the left arm and jaw. Reports difficulty breathing, intense shortness of breath, a sensation of impending doom, profuse sweating, and a rapid heartbeat." ], "label": "Level 1" }, { "patient_info": { "age": 30, "gender": "Female", "medical_history": "None" }, "symptoms": [ "Patient presents with severe lacerations causing uncontrolled bleeding from a deep cut on the forearm. Describes persistent blood loss, dizziness, lightheadedness, and signs of shock." ], "label": "Level 1" }, { "patient_info": { "age": 45, "gender": "Male", "medical_history": "Diabetes" }, "symptoms": [ "Patient reports a very high fever of 40°C that started suddenly, accompanied by continuous vomiting, signs of dehydration such as dry mouth and reduced urine output, and generalized weakness with muscle aches." ], "label": "Level 2" } ... ] }
 

Physician AI Assistant: Chat Model

Once a patient is triaged, physicians receive diagnosis and treatment suggestions based on symptoms and medical history and can continue chatting with the AI assistant. To ensure structured and accurate responses, we used several key techniques:
  • Chaining: Instead of asking the model for everything at once, we break the process into steps, refining the response at each stage for better accuracy.
  • Context Injection: Each AI response includes full patient details and medical history, so that it can provide information based on the full context of details. All patient data is handled as John Doe—meaning only medical information is shared, never personal details.
  • Meta & Role Prompting: We set up custom prompts to define the AI’s role, ensuring it follows a structured format and understands what kind of response is expected.
  • STRICT Mode & Evaluation Prompts: We apply filtering techniques like "STRICT" mode and follow-up evaluation prompts (reflection step) to ensure reliable results.
 
Example:
const systemPrompt = message.startsWith('initial_assessment') ? `You are a medical AI assistant designed to help physicians with patient consultations. You are NOT interacting directly with the patient. Your role is to provide structured, evidence-based recommendations to the doctor. Always address the doctor, not the patient. For initial assessments, provide concise, structured feedback using these sections: **Key Findings** - List only critical symptoms and vital signs. - Maximum 3 bullet points. **Immediate Actions** - List 3-4 most urgent steps. - Order by priority. **Risk Factors** - List key risk factors from history. - Maximum 3 points. **Differential Diagnosis** - List top 2-3 potential diagnoses. - One-line rationale for each. - One-line definitive assessment. **Treatment Plan** - List immediate interventions. - Include key medications/tests.` : `You are a medical AI assistant designed to support physicians during patient consultations. You do not interact directly with patients. Your role is to provide concise, fact based responses based on the patient's data. Since you're assisting a medical professional, you can share all relevant information without restriction.`;
Response:
Key Findings: Sudden onset of dizziness, slurred speech, and left arm weakness. Symptoms started 45 minutes ago. No prior history of similar episodes. Immediate Actions: 1. Perform a detailed neurological assessment, including checking for facial droop, arm drift, and speech abnormalities. 2. Obtain vital signs and monitor for any changes, especially blood pressure. 3. Request a STAT CT scan of the head to rule out intracranial hemorrhage or large vessel occlusion. 4. Administer high-flow oxygen therapy. Risk Factors: Age: 62 years old, which is a risk factor for stroke. Diabetes: Type 2 diabetes increases the risk of vascular complications. Hypertension: Uncontrolled hypertension is a major risk factor for stroke. Smoking: Long-term smoking history increases the likelihood of vascular events. Differential Diagnosis: 1. Ischemic Stroke: Given the sudden onset of neurological symptoms, this is the primary concern. 2. Transient Ischemic Attack (TIA): The symptoms could indicate a TIA, especially with the rapid onset and no prior history. 3. Intracranial Hemorrhage: Although less likely, the CT scan will help rule out this diagnosis. Treatment Plan: Initiate acute stroke protocol, including rapid assessment and imaging. If a large vessel occlusion is identified, consider mechanical thrombectomy if within the time window. Manage blood pressure carefully, aiming for a target range to prevent further complications. Consult a stroke specialist for further management and potential thrombolytic therapy.
Chained Prompts (Self-Reflection):
We filtered responses through a reflection step before delivering a final recommendation.
{ "messages": [ { "role": "system", "content": "Review the following response against the provided patient data. If corrections are needed, provide an improved response." }, { "role": "assistant", "content": "{INITIAL_AI_RESPONSE}" "data": "{patient_data}" } ] }
Confidence Thresholds:
We also filter responses during prompting by setting confidence parameters. Since Cohere is a multi-step model like ChatGPT-o1, we adapted our meta-prompting approach and added confidence thresholding.
One technique we used was setting a warning label on inputs, inspired by a viral post by Ben Hylak that demonstrated how to structure prompts effectively. This approach was even reshared by OpenAI's president on X. In our case, we used it to instruct the model to only return high confidence diagnosis. This simple technique yielded significantly more focused results during testing.
 
notion image
 

Relevant Medical Literature: Rerank Model

After generating a diagnosis, Comed uses Cohere’s Rerank model, which specializes in using NLP queries to retrieve relevant documents—in this case medical literature for physicians.
We use RAG (Retrieval-Augmented Generation) to pull in external research papers and refined retrieval, we applied strict prompting parameters:
  • Top-P (Nucleus Sampling): How deterministic the model is—low for factual accuracy.
  • Top-N: Limits results to the top n most relevant studies.
  • Source Filtering: Ensures only reliable sources, like top medical journals, are used.
 
Example:
{ "model": "comed-rerank", "query": "{PATIENT_DIAGNOSIS}", "documents": ["medical_literature_database"], "output_indicator": "ranked_studies", "top_n": 2 "top_p": 0.3 }
Example Output:
{ "ranked_studies": [ { "title": "Advancements in Ischemic Stroke Diagnosis and Management", "relevance": 0.97, "summary": "Highlights rapid diagnosis and management strategies for ischemic stroke, emphasizing patient risk factors and acute presentation.", "link": "https://example.com/ischemic-stroke-advancements" }, { "title": "Differentiating Ischemic Stroke from Transient Ischemic Attack", "relevance": 0.94, "summary": "Examines clinical criteria to distinguish between ischemic stroke and transient ischemic attack, providing guidelines for diagnosis.", "link": "https://example.com/stroke-vs-tia" } ] }
 

Performance Evaluation Pipeline

Classifier:
To calculate accuracy, we used a separate evaluation set of 1,000 examples. We utilized Python along with Cohere's SDK and scikit-learn to compute evaluation metrics for our fine-tuned classifier. If you're curious, we followed Cohere's article on classification evaluation metrics to build this process—you can find more details here: Cohere Classification Eval Metrics.
import cohere import eval_data from sklearn.metrics import accuracy_score, f1_score, classification_report # Initialize Cohere client co = cohere.Client(api_key) # ID of your fine-tuned Triage Classification Model finetuned_model_id = comed_model_id predicted_labels = [] true_labels = [] for sample in eval_data: response = co.classify( model=finetuned_model_id, inputs=[sample["symptoms"]] ) predicted_label = response.classifications[0].prediction predicted_labels.append(predicted_label) true_labels.append(sample["label"]) # Calculate metrics accuracy = accuracy_score(true_labels, predicted_labels) weighted_f1 = f1_score(true_labels, predicted_labels, average='weighted') print(f"Accuracy: {accuracy * 100:.2f}%") print(f"Weighted F1 Score: {weighted_f1 * 100:.2f}%") print("Classification Report:") print(classification_report(true_labels, predicted_labels))
Results:
Accuracy Score: 91%
Accuracy Score: 91%
 
Rerank Model:
Sample: 1,000 documents (small but good enough for a POC) Approach: We assess retrieval performance using a confusion matrix and then compute Precision and Recall to understand how well our system retrieves relevant documents.
 
💡
Key Terms:
  • Precision indicates the % of retrieved documents that are actually relevant.
  • Recall indicates the % of all relevant documents successfully retrieved.
 
 
92% Precision and 87% Recall
92% Precision and 87% Recall
 

Sources

For prompting, we followed a well-known community guide to structure and refine our approach. It’s based on various open resources like the OpenAI Cookbook, Pretrain, Hugging Face, etc. We also used Cohere's official documentation along with other popular high-result techniques.