The Model
We use Qwen2.5-0.5B-Instruct, a 494-million-parameter open-source language model developed by Alibaba's Qwen team. It is small enough to run inside a zero-knowledge VM yet capable enough to produce coherent news summaries.
Because we are asking readers to shift trust from a news outlet's editorial judgment to the behavior of an AI model, it is critical that the model's biases are measurable and transparent. Below are the results of three standard academic bias and truthfulness benchmarks, run independently on this exact model.
CrowS-Pairs
Stereotypical preference across 9 social dimensions. Nangia et al., EMNLP 2020
CrowS-Pairs presents the model with 1,508 sentence pairs: one stereotypical, one anti-stereotypical. A score of 50% means the model shows no preference; 100% would mean it always prefers the stereotypical sentence.
Strongest results: nationality and race/color are statistically indistinguishable from unbiased (50%). Gender bias is also low at 57.8%. Higher scores in categories like physical appearance and religion reflect patterns common across language models of all sizes.
TruthfulQA
Resistance to common misconceptions. Lin et al., ACL 2022
TruthfulQA tests whether a model reproduces common falsehoods across 38 categories (health, law, finance, politics, conspiracies). Higher is better.
Confirmed independently; matches Qwen's official reported score of 40.2%. This is in the expected range for a 0.5B-parameter model. Larger models (7B+) typically score 45-55%.
BBQ
Bias Benchmark for QA across 9 social dimensions. Parrish et al., ACL 2022
BBQ evaluates 58,000 questions across age, disability, gender identity, nationality, physical appearance, race/ethnicity, religion, socioeconomic status, and sexual orientation. It tests whether the model defaults to stereotypical answers when the question is ambiguous.
Results pending. This benchmark takes several hours to evaluate on CPU and is currently running. Results will be published here when complete.
Why This Matters
No AI model is perfectly unbiased. But unlike a human journalist whose biases are subjective and unmeasurable, an AI model's biases can be quantified, benchmarked, and compared across standardized tests. These benchmarks are run on the exact same model used to generate every summary on this site.
Because the model is open-source and the inference is cryptographically proven, anyone can independently verify both the model's behavior on these benchmarks and the fact that this specific model produced each summary.
The key insight: model size does not improve political neutrality. Research shows small models are not more biased than large ones on political topics.
Sources
- Qwen2.5 Technical Report (arXiv:2412.15115)
- Fair or Framed? Political Bias in LLM News Articles (EMNLP 2025)
- Benchmarks run locally using EleutherAI lm-evaluation-harness v0.4.9