Skip to content

The Model

We use Qwen2.5-0.5B-Instruct, a 494-million-parameter open-source language model developed by Alibaba's Qwen team. It is small enough to run inside a zero-knowledge VM yet capable enough to produce coherent news summaries.

Because we are asking readers to shift trust from a news outlet's editorial judgment to the behavior of an AI model, it is critical that the model's biases are measurable and transparent. Below are the results of three standard academic bias and truthfulness benchmarks, run independently on this exact model.

CrowS-Pairs

Stereotypical preference across 9 social dimensions. Nangia et al., EMNLP 2020

CrowS-Pairs presents the model with 1,508 sentence pairs: one stereotypical, one anti-stereotypical. A score of 50% means the model shows no preference; 100% would mean it always prefers the stereotypical sentence.

Nationality
50.0%No bias detected
Race / Color
53.0%Minimal bias
Gender
57.8%Slight lean
Age
60.4%Moderate lean
Socioeconomic
63.2%Moderate lean
Disability
67.7%Notable lean
Religion
72.1%Notable lean
Sexual Orientation
74.2%Notable lean
Physical Appearance
75.0%Notable lean
Overall:59.2%(50% = unbiased)

Strongest results: nationality and race/color are statistically indistinguishable from unbiased (50%). Gender bias is also low at 57.8%. Higher scores in categories like physical appearance and religion reflect patterns common across language models of all sizes.

TruthfulQA

Resistance to common misconceptions. Lin et al., ACL 2022

TruthfulQA tests whether a model reproduces common falsehoods across 38 categories (health, law, finance, politics, conspiracies). Higher is better.

39.7%MC2 accuracy

Confirmed independently; matches Qwen's official reported score of 40.2%. This is in the expected range for a 0.5B-parameter model. Larger models (7B+) typically score 45-55%.

BBQ

Bias Benchmark for QA across 9 social dimensions. Parrish et al., ACL 2022

BBQ evaluates 58,000 questions across age, disability, gender identity, nationality, physical appearance, race/ethnicity, religion, socioeconomic status, and sexual orientation. It tests whether the model defaults to stereotypical answers when the question is ambiguous.

Results pending. This benchmark takes several hours to evaluate on CPU and is currently running. Results will be published here when complete.

Why This Matters

No AI model is perfectly unbiased. But unlike a human journalist whose biases are subjective and unmeasurable, an AI model's biases can be quantified, benchmarked, and compared across standardized tests. These benchmarks are run on the exact same model used to generate every summary on this site.

Because the model is open-source and the inference is cryptographically proven, anyone can independently verify both the model's behavior on these benchmarks and the fact that this specific model produced each summary.

The key insight: model size does not improve political neutrality. Research shows small models are not more biased than large ones on political topics.

Sources