Benchmark study · validated against real-world data

1 June 2026·9 min read

The Surprising Boundary Between Psychology and Culture

We benchmarked synthetic respondents against real World Values Survey populations in the US and GB. They carry real, structured signal — but the variance collapses, and one finding redrew where we think the technology's edge actually lies.

What a benchmark against real human populations taught us about synthetic respondents.

For the past year we've been building synthetic respondents.

Not synthetic survey answers. Not AI summaries. Not personas.

Synthetic people — with different motivations, different priorities, different ways of interpreting the world, and different psychographic structures.

And the question we hear more than any other is exactly the right one:

How do you know they're not just averages produced by a language model?

It's a fair challenge. In fact, it's probably the most important challenge facing synthetic research today.

So we decided to test it. Not with a demo. Not with a client case study. Not with a marketing benchmark. With public data.

And the results taught us something we weren't expecting.

The test

We used the World Values Survey (WVS), Wave 7 — one of the most respected datasets in social science — as ground truth.

We generated synthetic populations for the United States and Great Britain, using demographic quotas aligned with national population structures. We then compared their responses against the real WVS distributions across five non-sensitive attitudes:

Interpersonal trust
Life satisfaction
Happiness
Attitudes toward competition
Importance of work

Our goal was not simply to reproduce averages. A model can land an average by accident. What matters is whether it reproduces variation.

Human populations disagree. They contain minorities, contradictions, and extremes. If synthetic populations collapse toward the center, they may look realistic while missing what actually makes people different.

So we evaluated four things: distributional convergence, mean alignment, variance preservation, and the differences between the two countries.

The numbers

Here are the actual results. No cherry-picking. No selective reporting. No hidden failures.

United States

Item	Synthetic	Human	Convergence	Variance ratio
Trust	1.72	1.63	90.5%	0.87
Work importance	1.57	1.88	83.2%	0.37
Happiness	2.22	1.84	71.6%	0.48
Competition	4.07	3.31	50.0%	0.15
Life satisfaction	7.02	7.28	38.1%	0.06

Great Britain

Item	Synthetic	Human	Convergence	Variance ratio
Trust	1.72	1.54	81.6%	0.80
Happiness	2.18	1.78	67.4%	0.38
Competition	4.25	3.78	62.9%	0.23
Work importance	1.85	2.05	59.1%	0.18
Life satisfaction	6.95	7.34	47.1%	0.13

How to read this

Convergence is the overlap between the synthetic and human answer distributions, reported as 1 minus total variation distance (higher is better). Variance ratio is synthetic variance divided by human variance (1.0 = same spread as real humans; below 1.0 = narrower).

Scale directions differ by item: for trust and happiness, lower means more trusting or happier; for life satisfaction (0–10), higher means more satisfied; competition runs 1 (good) to 10 (harmful); work importance runs 1 (very important) to 4 (not at all).

Average distributional convergence was approximately 65%. That is clearly above noise — the synthetic populations were not generating random answers. They were capturing real structure.

But the benchmark also exposed two important limitations. And those limitations turned out to be the most valuable findings.

Finding 1: The variance collapsed

Look at the variance ratios.

Interpersonal trust preserved variance reasonably well (0.87 and 0.80). But most other variables did not. Life satisfaction is the clearest case: in the United States, the variance ratio was only 0.06. Real humans spread across the full scale; synthetic respondents clustered tightly around the center.

The synthetic populations disagreed with each other less than real humans do.

This is a known tendency of language models — they regress toward the most probable answer, producing narrower distributions, fewer extremes, fewer outliers, less disagreement. Knowing the mechanism doesn't excuse the result.

It doesn't invalidate the benchmark. But it reveals a genuine technical challenge: if synthetic respondents are going to become a serious research methodology, preserving human variance may be one of the most important problems to solve. It's one we're now working on directly.

Finding 2: We reproduced trust levels, but not trust differences

The second finding surprised us more.

Interpersonal trust is one of the classic examples of cross-cultural variation: in the WVS data, Great Britain reports higher trust than the United States, and the gap is well documented.

Our synthetic populations reproduced the overall trust level within each country reasonably well — that's the 90.5% and 81.6% convergence. But they failed to reproduce the gap between them. The synthetic US and GB populations returned almost identical trust scores (1.72 in both). The real populations did not.

At first glance, that looks like a failure. We think it reveals something more interesting.

The boundary between psychology and culture

The benchmark forced a more fundamental question: what is our system actually modelling?

The answer appears to be psychology more than culture.

Our architecture is designed to create variation between individuals — different motivations, priorities, psychographic structures, and ways of making sense of the world. What it models less strongly are the institutional and historical forces that shape entire societies.

Trust isn't purely a psychological variable. It's also cultural and institutional — it emerges from social norms, civic institutions, historical experience, collective expectations, shared narratives. Those forces exist above the individual level, and our benchmark suggests they should be treated as a separate modelling layer rather than assumed to emerge automatically from individual psychology.

This isn't something we discovered by accident. It's a boundary condition. And understanding boundary conditions is part of taking a methodology seriously.

Why this matters more than it seems

If our goal had been to prove perfect population realism, this benchmark would be disappointing. Fortunately, that wasn't the most important thing we learned.

Most commercial decisions are not made between countries. They're made between customer types.

Different motivations, anxieties, priorities, and decision styles. A product team rarely needs the average trust score of a nation. It needs to understand why one segment embraces a proposition while another rejects it.

That is fundamentally a problem of psychological variation, not national averages — and this benchmark suggests psychological variation may be precisely where synthetic respondents are strongest.

That's the test we're running next: whether different psychographic profiles diverge the way established theory predicts. We have not published that result yet, so we won't claim it here — but we'll report it with the same transparency as this one, including if it disappoints us.

What this benchmark does not prove

This benchmark does not demonstrate predictive validity. It does not prove synthetic respondents can predict purchase intent, willingness to pay, brand choice, or product adoption. Nor does it demonstrate commercial usefulness. Those questions require prospective studies, parallel human-synthetic comparisons, and real-world decision outcomes.

It addresses a more basic question: do synthetic populations exhibit human-like structure? The answer appears to be partially, yes — with the important caveats above.

A better validation framework

This project also changed how we think about validation. Population realism shouldn't be the finish line. It should be the starting point. A more useful hierarchy:

Level 1 — Population realism. Do synthetic populations resemble real ones?
Level 2 — Psychographic realism. Do different psychological profiles diverge coherently?
Level 3 — Reproducibility. Do repeated runs produce stable results?
Level 4 — Human convergence. Do findings align with parallel human research?
Level 5 — Predictive utility. Do the resulting insights improve decisions?

Only the final two establish commercial value. But the first three establish credibility — and credibility is the resource synthetic research needs most right now.

A note on transparency

World Values Survey Wave 7 predates today's frontier models and may be partially represented within their training corpora. For that reason we do not treat this as proof of predictive validity — a model can score well by memory rather than fidelity. We treat it as a population-realism benchmark: useful and necessary, but not sufficient.

We're publishing the methodology, and we invite anyone building synthetic respondents to run the same test and post their own numbers — variance ratios included. The point isn't to win an argument. It's to give the category a shared, reproducible way to ask how close is close enough.

What we learned

We started this benchmark trying to answer a simple question: are synthetic respondents just averages?

The evidence suggests the answer is no — they carry real, structured signal. But it also revealed something more interesting: they model meaningful psychological variation well, and they reconstruct the institutional and cultural forces that shape whole societies far less well.

That distinction matters. It tells us where the technology is strong, where it still needs work, and where the future of synthetic research may actually lie.

Synthetic research won't become credible because it gets everything right. It will become credible when we understand precisely where it gets things wrong.

References and further reading

Argyle, L. P., Busby, E. C., Fulda, N., Gubler, J. R., Rytting, C., & Wingate, D. (2023). Out of One, Many: Using Language Models to Simulate Human Samples. Political Analysis, 31(3), 337–351.
Park, J. S., Zou, C. Q., Shaw, A., et al. (2024). Generative Agent Simulations of 1,000 People. Stanford University.
Larooij, M., & Törnberg, P. (2025). Validation is the Central Challenge for Generative Social Simulation. AI Review (Springer).
Morocho, E. E. T., Cima, L., Cresci, S., et al. (2026). Assessing the Reliability of Persona-Conditioned LLMs as Synthetic Survey Respondents. WWW 2026 Companion.
Bisbee, J., Kennedy, R., & Goel, S. Synthetic Replacements for Human Survey Data? The Perils of Large Language Models. Political Analysis.
Haerpfer, C., et al. (eds.). World Values Survey: Round Seven. WVS Association.

Want to run the same benchmark — or design a study around psychographic variation? Let's talk.

QualiSynth

Back to blog QualiSynth