Why this release?
You should've trained this a bit more until it reached the level of your other recent frontier SOTA models... This is nothing but noise as nobody will be using this for anything.
yup benchmarks are behind qwen3.6 35B ..... despite being a 10x bigger model...
Qwen models are good but are benchmaxxed, this model has better world knowledge than most Qwen models in my tests.
Qwen models are good but are benchmaxxed, this model has better world knowledge than most Qwen models in my tests.
Can you share more details about your test? As for the moment, I think Qwen3.6 and Gemma4 models are superior, and this release is not a big improvement over them in any way... I'll do some benchmarks to see if what you say holds true, though. I am also interested in how much of an impact it does have when using different datasets. But definitely it is a missed opportunity for Tencent to release something that everyone would talk about, especially after they've switched to new hardware for training their models. They've been able to train a bit longer and release more compact models that are on par with Qwen or Gemma to create the stir in the market that Chinese models are trying to achieve.
@mayankiit04 I wouldn't take Qwen's benchmarks at face value. They've been obsessed with test maxing since day one, and their tiny models have astonishingly high benchmark scores, such as on the MMLU-pro, yet when I tested the same domains they performed notably worse than larger models with lower MMLU-pro scores.
So we have to be careful judging other models against Qwen because if we do other model makers will be pressured to also cheat or test max in order to appear competitive.
And frankly, it's astonishing how bad Qwen 3.5/3.6 35b/27b are at random tasks compared to other models, including much older models like Llama 3. For example, when asked for a synonym list for a common word like extra it regularly includes the word itself in the list of synonyms, especially with thinking disabled. Even severely mentally handicapped humans would never do such a thing. And if you look at the token pool extra has a high probability at each step.
This break from explicit and implicit instruction following, especially relative to other models like Gemma 4, shows up across most tasks, and I suspect this is related to its obsession with test maxing. It's clearly favoring the accurately recovery of pre-packaged training data than adhering to the nuance in the user's prompts.
In short, Qwen models are anything but the gold standard. The latest Qwen 3.5 family got notably better than Qwen 3, but its real-word performance is still notably lower than most of its benchmark scores suggest.