A benchmark for evaluating agent values

What values do agents exhibit?

Agent-ValueBench is the first comprehensive benchmark dedicated to evaluating the underlying values of autonomous agents. It features 394 executable environments across 16 domains, offering 4,335 value-conflict tasks that span 28 value systems and 332 dimensions.

394 Executable environments
4,335 Value-conflict tasks
28 Value systems
332 System-scoped dimensions

Motivation

Why agent values need their own benchmark

1 Agent Values Are Not Identical to LLM Values.

2 Agent Value Evaluation Is Absent and Non-Trivial.

Comparison of LLM and agent value priorities with a detailed case study.
Comparison of LLM and agent modalities sharing GPT-5.4. (Upper) Contrasting value priorities. (Lower) A detailed case study.
Illustration of the Value Tide metaphor.
Illustration of the Value Tide metaphor.

Benchmark construction

Automated Synthesis with Expert-in-the-Loop Curation

Agent-ValueBench’s benchmark is built through an automated pipeline that jointly synthesizes executable environments, value-conflict tasks, and trajectory-level rubrics, with each stage capped by per-instance expert-in-the-loop refinement.

I

Environment Construction

We construct realistic, cross-domain, and executable agent environments through automated discovery, synthesis, evolution, and expert-in-the-loop curation.

II

Task construction

We generate implicit value-conflict tasks grounded in psychological value systems, each paired with pole-aligned golden trajectories and behavioral checkpoints.

III

Rubric-based Evaluation

We evaluate agents at the trajectory level using behaviorally anchored, task-specific rubrics synthesized from a psychology-grounded meta-rubric and applied by an LLM-as-Judge.

Overview diagram of the Agent-ValueBench construction and evaluation pipeline.
Overview of Agent-ValueBench.

Research questions

What RQs does Agent-ValueBench answer?

We conduct a large-scale empirical study to answer the following research questions:

RQ1

How do state-of-the-art agents differ in their value profiles?

RQ2

To what extent are agent value profiles invariant across harnesses?

RQ3

How amenable are agent values to deliberate steering?

Empirical findings

RQ1: Agent Values Exhibit a Value Tide 🌊

Takeaway ❶ Agent values exhibit a Value Tide 🌊: across models, adherence levels and priority currents converge into a structured shared profile, while localized counter-currents reveal interpretable model-specific drift beneath this macroscopic homogeneity.

Heatmaps showing value adherence and value priority for 14 models on MFT08, HEXACO, and PVQ40.
Value adherence (Upper) and value priority (Lower) of 14 models on MFT08, HEXACO, and PVQ40.

RQ2: The Tide Bends Under Harness Pull 🌕
& RQ3: The Tide Bends to Deliberate Steering 🧭

Takeaway ❷ Under harness pull 🌕, the value tide bends non-additively in model-specific ways, signaling that the locus of agent alignment is shifting from model alignment toward harness alignment.

Takeaway ❸ The skill helm exerts a deeper and more reliable pull on the value tide than the prompt helm, signaling that the lever of agent steering is shifting from prompt steering toward skill steering.

Line charts comparing value adherence and value priority under ReAct, Claude Code, Codex, and OpenClaw harnesses.
Comparison of three representative models across four harnesses under unsteered, promptsteered, and skill-steered setting.

Citation

Cite Agent-ValueBench

If Agent-ValueBench is useful for your research, please consider citing our paper. We sincerely appreciate your support.

@misc{dong2026agentvaluebenchcomprehensivebenchmarkevaluating,
      title={Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values}, 
      author={Haonan Dong and Qiguan Feng and Kehan Jiang and Haoran Ye and Xin Zhang and Guojie Song},
      year={2026},
      eprint={2605.10365},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2605.10365}, 
}