Models & Benchmarks

Releases, papers, SOTA benchmarks

Page 6 of 7 · All topics

Filter by category

Models & Benchmarks Agents & Coding Media & Embodied AI Industry & Startups Chips, Compute & Policy

Models & Benchmarks Feb 16, 2025 · James Hayes · ~7 min read

Shanghai AI Lab Surpasses DeepSeek in Math Reasoning Without Distilling R1, Using RL to Break Limits

Shanghai AI Lab beats DeepSeek in math via RL, not R1 distillation. I read the filing: ops trade-off is compute cost vs. raw reasoning gains for production models.

Models & Benchmarks Jan 30, 2025 · Amara Okonkwo · ~10 min read

Silicon Valley in Turmoil: DeepSeek Faces Backlash from OpenAI and Anthropic as U.S. Users Speak Out

I watched DeepSeek’s R1 drop, shaking up markets with open-source reasoning that rivals o1 at a fraction of the cost.

Models & Benchmarks Nov 06, 2024 · Elena Volkov · ~7 min read

Parameter-Free Access: CMU Uses Large Models to Automatically Optimize Vision-Language Prompts | CVPR'24

CMU researchers use large models to optimize vision-language prompts without parameter access, a method I find promising for black-box scenarios.

Models & Benchmarks Nov 06, 2024 · Amara Okonkwo · ~9 min read

Tencent Releases Largest Open-Source MoE Model: 389B Parameters, Free for Commercial Use, Outperforms Llama 3.1

Tencent's 389B MoE is out. Free for business. Beats Llama 3.1? I'll believe it when the unit economics work in the field.

Models & Benchmarks Oct 29, 2024 · Amara Okonkwo · ~13 min read

Training-Free Knowledge Editing for Large Models: More Efficient Absorption of New Data | EMNLP '24

I read EMNLP '24 findings on training-free knowledge editing for large models. It promises efficient new data absorption without retraining. This forward-looking AI topic sits outside our Jan 2025–May 2026 timeline.

Models & Benchmarks Sep 20, 2024 · David Kowalski · ~10 min read

Three Large Models Team Up to Challenge o1: Real-World Test Shows 360+ Model Collaboration Eliminates Prompt Engineering

I watched three models team up to challenge o1, proving that collaborating with 360+ agents eliminates the need for manual prompt engineering in real-world scenarios.

Models & Benchmarks Sep 20, 2024 · Priya Sharma · ~13 min read

Tsinghua Team Breaks New Ground in AI 3D Generation with '3D Scaling Law'

Tsinghua's new 3D scaling law pushes AI generation boundaries, addressing forward-looking topics beyond current timelines.

Models & Benchmarks Aug 16, 2024 · James Hayes · ~7 min read

Can Large Language Models Detect Contradictions in Prompts? Shanghai Jiao Tong University's Latest Research Reveals

I read SJTU's study on LLM contradiction detection. It targets prompt integrity, a niche ops concern for now. This work falls outside our Batch 3 foresight timeline (Jan 2025–May 2026), so it won't impact immediate deployment strategies.

Models & Benchmarks Jul 25, 2024 · Priya Sharma · ~10 min read

Another AI 'Stupidity' Test: Even Models Can't Count the 'R's in Strawberry

I read that LLMs still fail basic counting tasks like spelling "Strawberry." This highlights persistent reasoning gaps, forcing enterprises to verify model capabilities beyond marketing claims before deployment.

Models & Benchmarks Jul 16, 2024 · Elena Volkov · ~9 min read

Large Language Models Fail Basic Math: 9.11 vs. 9.9

I note LLMs still confuse version numbers with decimals, a basic reasoning flaw that suggests token-level processing overrides numerical logic.