Welcome to Kubi's Benchmark
Hi,
There are plenty of benchmarks out there, and I understand why many people are cautious about them. I shared that skepticism, which is why I decided to build one myself. Everything here from the questions to the evaluation scripts was created from scratch by me (with some help from Claude of course). While the internet influenced some question ideas, nothing was directly reused. I plan to rerun this benchmark on the first day of each month, testing only newly released models and questions and replacing the published questions with new ones. Any major model release will be evaluated as soon as possible.
The "Bad" Stuff
This benchmark does not currently include a coding category. I first added coding questions and set up an evaluation pipeline, but the scoring had to be done manually and took a huge amount of time even for one model, so I ended up removing it. All remaining questions are evaluated automatically.
The Exciting Stuff
I am working on a separate project focused entirely on benchmarking models through coding game agents. It will be competitive, with models playing against each other, and should be much more engaging. That will be released later, probably next week.
What Sets It Apart
Mix of X instead of Best of X
Many benchmarks generate multiple outputs per question and mark the result as a pass if any one output is correct ("best of X"). Here, scores are averaged across all runs. For example, if a question is worth 5 points and four runs score 5, 0, 0, and 4, the final score for that question is 9/4 = 2.25.
Two evaluation methods
Questions are evaluated either by a judge LLM or by a custom verifier script. The judge LLM (Gemini 3.0 Flash in my case) has access to the ground truth and marks answers as pass or fail. Verifier scripts are written specifically for individual questions and programmatically check the model's output.
Partial credit
Some questions support partial points, but only when evaluated by verifier scripts. I don't rely on judge LLMs for partial scoring. With script-based verification, partial credit has been reliable.
Token limits tied to question value
Each question has a point value, and the maximum token limit scales with it. A 1-point question uses a base limit of 8,196 tokens, while a 5-point question allows up to roughly 40k tokens. Harder questions are given more room for reasoning.
Gradual release of questions
The repository is open source, but the full question set is not publicly available yet. This is to avoid future models training directly on the benchmark. Instead, I will release questions worth about 10% of the total points each month when I run new evaluations and replace them with new questions. The first batch is already published on the website.
Dynamic point adjustment
After initial runs, I noticed that some questions were misweighted. To reduce personal bias, I introduced an automatic adjustment system. If all models fully solve a question, its point value is reduced. If none succeed, the value increases. Intermediate outcomes are adjusted proportionally. A secondary leaderboard based on this dynamic scoring is also available.
Controlled model and provider selection
OpenRouter models are used with at least FP8 quantization for open-source models. Providers were selected based on accumulated community feedback and broader observations. Certain providers were excluded due to consistently poor API performance.
Varied and original questions
- Basic Mix: Simple tasks or altered well-known questions to test overfitting.
- General Knowledge: Deep knowledge checks and "future prediction" questions (events that happened after model cutoff).
- Math: Medium to hard problems sourced from private sources.
- Reasoning: Logic and puzzle-based questions, including chess and word puzzles.
Broad model coverage
The benchmark includes leading proprietary models, strong open-source options, and models that can realistically run on consumer GPUs. I'm open to suggestions for missing models.
High reasoning effort
All requests are sent with reasoning effort set to high, where supported by the model.
Lastly, wait for the more exciting competitive game agent coding benchmark league that will be published by me soon!
Configuration
Configuration
These are OpenRouter settings, used for the benchmark.
Models Used
- google/gemini-3-flash-preview
- google/gemini-3-pro-preview
- deepseek/deepseek-v3.2@preset/fp8
- openai/gpt-5.2
- qwen/qwen3-max
- openai/gpt-5-mini
- anthropic/claude-opus-4.5
- anthropic/claude-sonnet-4.5
- anthropic/claude-haiku-4.5
- x-ai/grok-4
- x-ai/grok-4.1-fast
- openai/gpt-oss-120b@preset/fp8
- z-ai/glm-4.7@preset/fp8-speedy
- z-ai/glm-4.7-flash@preset/fp8-speedy
- moonshotai/kimi-k2.5
- moonshotai/kimi-k2-thinking
- nvidia/nemotron-3-nano-30b-a3b@preset/fp8
- meta-llama/llama-4-scout@preset/fp8
- minimax/minimax-m2.1@preset/fp8
- qwen/qwen3-235b-a22b-thinking-2507@preset/fp8
- qwen/qwen3-next-80b-a3b-thinking@preset/fp8
- qwen/qwen3-32b@preset/fp8
- xiaomi/mimo-v2-flash
- google/gemma-3-27b-it@preset/fp8
- mistralai/mistral-large-2512
Preset Configs
Provider Preferences
{
"sort": {
"by": "price",
"partition": null
},
"quantizations": [
"fp8",
"fp16",
"bf16"
],
"allow_fallbacks": true,
"data_collection": "allow"
}
Parameters
{
"reasoning": {
"effort": "high",
"enabled": true
}
}
Provider Selection Settings
Justification: Some providers are known to be less performant than others, so I excluded them.
Select a Run
Loading...
Published Questions
View the library of benchmark questions
Question Library
This page will contain the full list of questions used in the benchmark, including prompt details, reference answers, and judging criteria.