Welcome to Kubi's Benchmark

Hi,

There are plenty of benchmarks out there, and I understand why many people are cautious about them. I shared that skepticism, which is why I decided to build one myself. Everything here from the questions to the evaluation scripts was created from scratch by me (with some help from Claude of course). While the internet influenced some question ideas, nothing was directly reused. I plan to rerun this benchmark on the first day of each month, testing only newly released models and questions and replacing the published questions with new ones. Any major model release will be evaluated as soon as possible.

Remaining time to next run

--:--:--:--

(Or next big model release)

The "Bad" Stuff

This benchmark does not currently include a coding category. I first added coding questions and set up an evaluation pipeline, but the scoring had to be done manually and took a huge amount of time even for one model, so I ended up removing it. All remaining questions are evaluated automatically.

The Exciting Stuff

I am working on a separate project focused entirely on benchmarking models through coding game agents. It will be competitive, with models playing against each other, and should be much more engaging. That will be released later, probably next week.

What Sets It Apart

Mix of X instead of Best of X

Many benchmarks generate multiple outputs per question and mark the result as a pass if any one output is correct ("best of X"). Here, scores are averaged across all runs. For example, if a question is worth 5 points and four runs score 5, 0, 0, and 4, the final score for that question is 9/4 = 2.25.

Two evaluation methods

Questions are evaluated either by a judge LLM or by a custom verifier script. The judge LLM (Gemini 3.0 Flash in my case) has access to the ground truth and marks answers as pass or fail. Verifier scripts are written specifically for individual questions and programmatically check the model's output.

Partial credit

Some questions support partial points, but only when evaluated by verifier scripts. I don't rely on judge LLMs for partial scoring. With script-based verification, partial credit has been reliable.

Token limits tied to question value

Each question has a point value, and the maximum token limit scales with it. A 1-point question uses a base limit of 8,196 tokens, while a 5-point question allows up to roughly 40k tokens. Harder questions are given more room for reasoning.

Gradual release of questions

The repository is open source, but the full question set is not publicly available yet. This is to avoid future models training directly on the benchmark. Instead, I will release questions worth about 10% of the total points each month when I run new evaluations and replace them with new questions. The first batch is already published on the website.

Dynamic point adjustment

After initial runs, I noticed that some questions were misweighted. To reduce personal bias, I introduced an automatic adjustment system. If all models fully solve a question, its point value is reduced. If none succeed, the value increases. Intermediate outcomes are adjusted proportionally. A secondary leaderboard based on this dynamic scoring is also available.

Controlled model and provider selection

OpenRouter models are used with at least FP8 quantization for open-source models. Providers were selected based on accumulated community feedback and broader observations. Certain providers were excluded due to consistently poor API performance.

Varied and original questions

Basic Mix: Simple tasks or altered well-known questions to test overfitting.
General Knowledge: Deep knowledge checks and "future prediction" questions (events that happened after model cutoff).
Math: Medium to hard problems sourced from private sources.
Reasoning: Logic and puzzle-based questions, including chess and word puzzles.

Broad model coverage

The benchmark includes leading proprietary models, strong open-source options, and models that can realistically run on consumer GPUs. I'm open to suggestions for missing models.

High reasoning effort

All requests are sent with reasoning effort set to high, where supported by the model.

Lastly, wait for the more exciting competitive game agent coding benchmark league that will be published by me soon!

Configuration

These are OpenRouter settings, used for the benchmark.

Models Used

google/gemini-3-flash-preview
google/gemini-3-pro-preview
deepseek/deepseek-v3.2@preset/fp8
openai/gpt-5.2
qwen/qwen3-max
openai/gpt-5-mini
anthropic/claude-opus-4.5
anthropic/claude-sonnet-4.5
anthropic/claude-haiku-4.5
x-ai/grok-4
x-ai/grok-4.1-fast
openai/gpt-oss-120b@preset/fp8
z-ai/glm-4.7@preset/fp8-speedy
z-ai/glm-4.7-flash@preset/fp8-speedy
moonshotai/kimi-k2.5
moonshotai/kimi-k2-thinking
nvidia/nemotron-3-nano-30b-a3b@preset/fp8
meta-llama/llama-4-scout@preset/fp8
minimax/minimax-m2.1@preset/fp8
qwen/qwen3-235b-a22b-thinking-2507@preset/fp8
qwen/qwen3-next-80b-a3b-thinking@preset/fp8
qwen/qwen3-32b@preset/fp8
xiaomi/mimo-v2-flash
google/gemma-3-27b-it@preset/fp8
mistralai/mistral-large-2512

Preset Configs

Provider Preferences

{
  "sort": {
    "by": "price",
    "partition": null
  },
  "quantizations": [
    "fp8",
    "fp16",
    "bf16"
  ],
  "allow_fallbacks": true,
  "data_collection": "allow"
}

Parameters

{
  "reasoning": {
    "effort": "high",
    "enabled": true
  }
}

Note: fp8+speedy is exactly same as fp8 with only difference is most fastest provider is selected instead of the cheapest. (So, "by": "throughput")

Provider Selection Settings

Justification: Some providers are known to be less performant than others, so I excluded them.

Welcome to Kubi's Benchmark

The "Bad" Stuff

The Exciting Stuff

What Sets It Apart

Mix of X instead of Best of X

Two evaluation methods

Partial credit

Token limits tied to question value

Gradual release of questions

Dynamic point adjustment

Controlled model and provider selection

Varied and original questions

Broad model coverage

High reasoning effort

Configuration

Configuration

Models Used

Preset Configs

Provider Preferences

Parameters

Provider Selection Settings

Select a Run

Dynamic Point Adjustment

Published Questions

Question Library