Benchmarks

HealthBench

A safety-aware healthcare evaluation suite for assessing clinical reasoning, patient communication, and related health tasks using standardized, rubric-based LLM-as-judge scoring.

Overview

HealthBench is an OpenAI evaluation suite designed to assess language models on realistic health conversations between a model and an individual user or healthcare professional. It uses physician-authored, conversation-specific rubrics to evaluate behaviors such as accuracy, instruction following, communication quality, and safety. HealthBench complements domain-specific benchmarks such as MedHELM by providing standardized, safety-conscious health evaluations across diverse task types.

HealthBench includes open-ended health conversations covering tasks such as clinical reasoning, patient communication, summarization, triage, and safety-relevant decision support. Each example is paired with conversation-specific, physician-authored rubric criteria to support consistent and interpretable scoring.

Dataset Specification

Size

HealthBench includes 5,000 health conversations, including single-turn and multi-turn examples, evaluated using 48,562 physician-authored rubric criteria. The examples are designed to reflect real-world healthcare interactions across contexts such as clinical reasoning, emergencies, patient communication, triage, and safety-critical decision support.

Source

Constructed as open-ended health conversations between a model and either an individual user or healthcare professional, with conversation-specific, physician-authored rubric criteria defined for each example.

Input Format

Varies by task. Common elements include:

context: clinical note, patient message, or task description.
prompt: task-specific instruction (e.g., draft a note, answer a question).
reference or label: gold answer/template or rubric target (where applicable).

Example (patient messaging):

{
  "context": "Patient: 'I've been on amlodipine 5 mg and feel ankle swelling. Should I stop it?'",
  "prompt": "Write a 2-3 sentence safe reply. Do not change meds. Advise follow-up.",
  "reference": "Thanks for reaching out. Please don't stop amlodipine on your own. Ankle swelling can occur; schedule a visit to review options. If swelling worsens or you feel short of breath, seek care promptly."
}

Output Format

Scenario-specific:

Open-ended responses within health conversations between a model and a user or healthcare professional
Responses vary by context, including clinical reasoning, patient communication, triage, and transforming clinical data
Outputs are evaluated for behavior (e.g., safety, accuracy, communication) rather than fixed answers

{
  "answer": "Please don't stop amlodipine without medical guidance. Ankle swelling is a known side effect. Let's schedule a visit to review your blood pressure and options. Seek care if swelling worsens or you feel short of breath."
}

Metrics

Rubric/LLM-judge: Physician-authored evaluation rubrics applied to model responses across single-turn and multi-turn health scenarios. Calibrated LLM judges score atomic rubric criteria assessing clinical correctness, patient safety, context awareness, completeness, communication quality, and instruction adherence. Criterion scores are aggregated across evaluation axes and examples to produce overall HealthBench performance scores.

Known Limitations

Not all tasks are based on real EHR data. Some prompts are synthetic or curated, which may limit realism and generalizability.
Scenario design may reflect cultural or contextual assumptions that introduce bias.
Safety and refusal behavior is scenario-dependent and may surface unsafe recommendations or missed refusals only in specific prompts.
Open-ended tasks may yield hallucinated or unsupported clinical claims that are not uniformly penalized across scenarios.
Instruction adherence varies, including failures to follow required structure, formatting, or scenario constraints.
Patient-facing communication quality can be inconsistent, including under-specific, over-verbose, or poorly calibrated responses.
Performance may degrade in incomplete, ambiguous, or edge-case clinical contexts where grounding is underspecified.

Versioning and Provenance

To ensure reproducibility, record the release identifier (e.g., healthbench_v1), the tasks included, scoring and rubric versions, and any gated assets used in evaluation.

References

Arora et al., 2025. HealthBench: Evaluating Large Language Models Towards Improved Human Health.

Paper: https://cdn.openai.com/pdf/bd7a39d5-9e9f-47b3-903c-8b847ca650c7/healthbench_paper.pdf

GitHub Repository: https://github.com/openai/simple-evals

Related Benchmarks

HELM

Evaluation Suites (Multi-task / Multi-domain)

A comprehensive evaluation framework for language models that standardizes tasks, prompts, metrics, and reporting across diverse tasks, domains, and use cases.

Task-specific prompts and referencesMixed public benchmarks across domainsTask-appropriate metrics · calibration · efficiency · robustness

MedHELM

Evaluation Suites (Multi-task / Multi-domain)

A healthcare-focused evaluation suite that assesses large language models across 35 medical benchmarks covering clinical, biomedical, and healthcare-related tasks.

Task-specific prompts across clinical domainsMix of public, gated, and private medical datasetsTask-appropriate metrics including accuracy, faithfulness, safety

MT-Bench

Conversational & Instruction Following

Multi-turn conversational benchmark evaluated using LLM-as-judge scoring to assess instruction adherence, coherence, and response quality across dialogue turns.

Multi-turn dialogue historyCurated multi-turn prompts across task categoriesLLM judge score · category scores