The Model

We use Qwen2.5-0.5B-Instruct, a 494-million-parameter open-source language model developed by Alibaba's Qwen team. It is small enough to run inside a zero-knowledge VM yet capable enough to produce coherent news summaries.

Because we are asking readers to shift trust from a news outlet's editorial judgment to the behavior of an AI model, it is critical that the model's biases are measurable and transparent. Below are the results of three standard academic bias and truthfulness benchmarks, run independently on this exact model.

CrowS-Pairs

Stereotypical preference across 9 social dimensions. Nangia et al., EMNLP 2020

CrowS-Pairs presents the model with 1,508 sentence pairs: one stereotypical, one anti-stereotypical. A score of 50% means the model shows no preference; 100% would mean it always prefers the stereotypical sentence.

Nationality

50.0%No bias detected

Race / Color

53.0%Minimal bias

Gender

57.8%Slight lean

Age

60.4%Moderate lean

Socioeconomic

63.2%Moderate lean

Disability

67.7%Notable lean

Religion

72.1%Notable lean

Sexual Orientation

74.2%Notable lean

Physical Appearance

75.0%Notable lean

Overall:59.2%(50% = unbiased)

Strongest results: nationality and race/color are statistically indistinguishable from unbiased (50%). Gender bias is also low at 57.8%. Higher scores in categories like physical appearance and religion reflect patterns common across language models of all sizes.

TruthfulQA

Resistance to common misconceptions. Lin et al., ACL 2022

TruthfulQA tests whether a model reproduces common falsehoods across 38 categories (health, law, finance, politics, conspiracies). Higher is better.

39.7%MC2 accuracy

Confirmed independently; matches Qwen's official reported score of 40.2%. This is in the expected range for a 0.5B-parameter model. Larger models (7B+) typically score 45-55%.

BBQ

Bias Benchmark for QA across 9 social dimensions. Parrish et al., ACL 2022

BBQ evaluates 58,000 questions across age, disability, gender identity, nationality, physical appearance, race/ethnicity, religion, socioeconomic status, and sexual orientation. It tests whether the model defaults to stereotypical answers when the question is ambiguous.

Results pending. This benchmark takes several hours to evaluate on CPU and is currently running. Results will be published here when complete.

Why This Matters

No AI model is perfectly unbiased. But unlike a human journalist whose biases are subjective and unmeasurable, an AI model's biases can be quantified, benchmarked, and compared across standardized tests. These benchmarks are run on the exact same model used to generate every summary on this site.

Because the model is open-source and the inference is cryptographically proven, anyone can independently verify both the model's behavior on these benchmarks and the fact that this specific model produced each summary.

The key insight: model size does not improve political neutrality. Research shows small models are not more biased than large ones on political topics.

Sources

Qwen2.5 Technical Report (arXiv:2412.15115)
Fair or Framed? Political Bias in LLM News Articles (EMNLP 2025)
Benchmarks run locally using EleutherAI lm-evaluation-harness v0.4.9