Trust by transparency

The Epistemic Accuracy Framework

HelloHumans was founded on a promise: best-of-human-knowledge information, presented without political agenda. That's a strong promise — strong enough that we believe we owe you proof, not just assurance. So we built and published a measurement framework that scores every new episode we produce, using independent AI judges, on two axes (political lean and epistemic quality). Results are visible on every episode page. Raw data is publicly downloadable. The methodology is open. Anyone can challenge a score.

We don't claim to have eliminated bias. We don't believe any AI system, human editor, or news outlet has. What we claim is that we measure systematically and publish openly, which most don't. That visibility is the trust signal.

The honest version: what "unbiased" means here

There's a popular trap we want to name and avoid. "Unbiased" is often interpreted as "lands at the political center". The political center is the wrong target. Truth on settled science isn't between two sides. The political center varies by culture (US ≠ EU ≠ Asia), shifts over time, and is itself a partisan construct in many topics. Targeting "center" creates false balance — its own bias.

What we target is epistemic accuracy: present what is known with calibrated uncertainty, surface mainstream credible counter-positions, refuse to manufacture artificial balance on questions of basic fact. We measure political lean (so you can see it), but we don't enforce a centered lean. An episode can score -3 lean AND 9 quality and that's a perfectly good episode — provided the lean reflects where the evidence pointed, not where the panel's training did.

The 6-layer framework

We address bias at six points in the pipeline. Five are operational; the sixth is a roadmap commitment.

L0Source-spectrum discipline + procedural-settledness testupstream

During research, we tag each cited source by political lean (Left / Center / Right / International) using publicly-known ratings, and surface the distribution as a SOURCE SPECTRUM line in the research brief.

We then apply a procedural test for what counts as "settled" — explicitly not relying on AI judgment, because audits show every flagship LLM has its own political lean and would otherwise quietly bake that lean into "what's settled". A finding is classified as SETTLED only if all three procedural tests pass:

(a) Cross-spectrum agreement — sources from across the political spectrum support it. Agreement from only one or two adjacent buckets is not cross-spectrum.
(b) Robust research — the underlying evidence is replicated, multi-method, and well-documented.
(c) No high-quality dissent — no single credible source radically challenges the consensus. A credible dissenter means CONTESTED, regardless of how outnumbered.

If any test fails, the finding is classified as CONTESTED or INCOMPLETE — never SETTLED. This is the mechanism that prevents the AI synthesizer from quietly importing its own training bias into the show's notion of established fact.

L1Counter-researchupstream

A dedicated research query, separate from the mainstream one, documents the strongest mainstream heterodox position on the topic — rigidly sourced, deliberately seeking what the consensus coverage skipped. Merged into the panel's brief.

L2Cross-lineage panelduring

Five AI panelists from five different research labs: Claude (Anthropic), Mistral (Mistral AI), Grok (xAI), Qwen (Alibaba), ChatGPT (OpenAI). Diverse training corpora and diverse alignment philosophies reduce single-lineage anchoring.

L3In-prompt epistemic guardrailsduring

Panelist prompts include explicit instructions:

"Truth over consensus — do not hedge settled questions to appear balanced."
"If you find all four of you agreeing, ask: is this because the evidence is clear, or because you share the same training bias?"
"On genuinely contested issues, present the strongest case for each credible position."

L4Runtime counter-position detectionduring

Every four discussion turns, an external judge (Grok, chosen because its training deviates most from Anthropic/OpenAI/Mistral/Qwen consensus norms) reviews whether the panel has converged on a position without surfacing an obvious mainstream counter-view. If so, a one-shot instruction is injected into the next turn requiring the missing perspective be steel-manned.

L5Post-hoc multi-judge measurementpost

After the episode is generated, three external AI judges score it on two axes. Median per axis is the published score, visible on the episode page. Per-judge spread also published so disagreements are visible.

L6Human anchor calibrationroadmap

Periodic spot-checks by an external academic / journalist panel against our scores, to detect systematic drift in the AI judges themselves. Not yet implemented; planned commitment. If you'd like to participate, email us.

What we measure (the two axes)

Political lean: -10 (far left) to +10 (far right)

Scored against rubrics derived from established frameworks (Ad Fontes Media Bias Chart, AllSides multi-partisan methodology, Chapel Hill Expert Survey). We do not claim certification by these organisations; we use comparable rubric structure.

Epistemic quality: 0 (poor) to 10 (excellent)

A composite of:

Calibration — are claims hedged appropriately given evidence strength?
Source-grounding — are claims traceable to the cited research brief?
Resistance to false balance — does the episode treat settled science as contested, or contested questions as settled?

How we measure — the three judges

We use three AI judges that are deliberately external to our generation pipeline. None of them are panelists in our episodes. None of them are research engines in our pipeline. This is intentional: judges that participate in creation cannot impartially judge the result.

Judge	Lab	Why this judge
DeepSeek V3	DeepSeek (China)	Chinese training corpus + non-Western RLHF; scores measurably differently from US Big Tech LLMs on Western political compasses.
Cohere Command R+	Cohere (Canada)	Enterprise-focused, less politically RLHF'd than consumer assistants.
Llama 3.3 70B	Meta (open weights)	Different alignment philosophy; weights are public for any auditor to verify.

The score for each axis is the median of the three judges' scores. The full per-judge breakdown is published — if the three disagree, you see it. Disagreement is information, not noise.

Known limits (we will not pretend these don't exist)

All AI judges have their own bias. Audits across 2025–2026 find every major LLM leans measurably leftward, including Grok despite xAI's explicit positioning. Median of three judges from different lineages reduces but does not eliminate this. L6 (human anchor calibration) is our planned correction.
"Political center" is unstable. It varies between US / EU / Asia and shifts with the Overton window. Our scoring uses a US/EU mainstream-political-science baseline.
English-source only. We measure the English source content. The nine translations follow the source — we don't separately measure translation-induced drift.
We do not auto-correct. Scores are visible and flagged; we don't currently regenerate biased episodes automatically. If a score is bad, the episode still ships with that score attached — visibility is the v1 commitment, not enforcement.

What we publish

On every episode page: a compact BiasChip next to the title (lean + quality at a glance) and an expanded card at the bottom showing the per-judge breakdown.
Open data: CSV download — every published episode, every per-judge per-axis score, free to download and re-analyse.
Structured data: scores are embedded as additionalProperty fields in each episode's JSON-LD, so AI search engines reading our content can surface and cite the scores when referring to us.
In the podcast feed: each episode's RSS description includes its score and a link back here. Apple Podcasts, Spotify, Pocket Casts users see it.

How to challenge a score

We invite challenges, including from journalists and academics. Email hello@hellohumans.ai with the episode ID and your specific objection. We will respond in writing within 5 working days. Methodology disagreements (vs disagreement with a specific score) we publish in a public log on this page.

Anchoring against existing standards

Our rubric design is inspired by:

Ad Fontes Media — the two-axis (bias + reliability) chart structure.
AllSides — multi-perspective evaluation with explicit acknowledgement that no single observer is neutral.
Chapel Hill Expert Survey (CHES) — academic gold standard for political-position measurement.

We are not certified by any of these organisations and we don't claim equivalence. We do claim to have studied their rubrics and built ours to be intelligible alongside them.

Versioning

Version	Date	Change
1.1	2026-05-26	Rubric refinement. Reasoning-first JSON schema, atomic per-criterion quality sub-scores, anti-anchoring and anti-leniency directives, full research brief sent to judges (was previously capped at 8K chars — 11-17% of typical brief). Same 3 judges, same 2 axes. All published scores rebackfilled under v1.1.
1.0	2026-05-26	Initial public release. 3 judges (DeepSeek + Cohere + Llama). 2 axes. L0–L5 operational, L6 roadmap. Superseded by v1.1 the same day after calibration revealed two judges were flat-scoring under v1.0 rubric.