Tencent Launches ArtifactsBench to Raise the Bar for Testing Creative AI Models

Helga Ivv

10 Jul 2025 • 2 min read

When it comes to evaluating AI-generated code, functional accuracy has long been the gold standard. But what about user experience—the look, feel, and usability that make digital tools actually enjoyable? Tencent thinks it's time for a change, and it just introduced a new benchmark, ArtifactsBench, to fill that gap.

ArtifactsBench is Tencent's answer to a growing problem: AI-generated applications that technically work but feel clunky, awkward, or visually off. Whether it's misaligned buttons, garish color choices, or jarring animations, many AIs today still struggle to grasp the finer points of good design. That’s where ArtifactsBench comes in—not as just another code test, but as a sophisticated evaluator of the user experience.

🚀Thrilled to introduce #ArtifactsBench! We're bridging the visual-interactive gap in code generation evaluation.

Our benchmark uses a novel automated, multimodal pipeline to assess LLMs on 1,825 diverse tasks. An MLLM-as-Judge evaluates visual artifacts, achieving 94.4% ranking… pic.twitter.com/84xClcnNyS
— Hunyuan (@TencentHunyuan) July 9, 2025

Here’s how it works: The benchmark tasks AI models with more than 1,800 creative challenges, ranging from building interactive web apps and charts to mini-games. Once the AI produces its output, ArtifactsBench automatically runs the code in a secure environment and captures screenshots as it executes—tracking animations, button clicks, and other interactive elements.

Then comes the unique part. The entire package—the prompt, the code, and the visual output—is passed to a Multimodal Large Language Model (MLLM) that serves as a judge. Unlike previous automated benchmarks that relied on limited checks or surface-level assessments, this MLLM uses a ten-point checklist to evaluate functionality, user experience, and even visual appeal.

And the results speak volumes. ArtifactsBench scored a 94.4% alignment with WebDev Arena, a human-voted platform often considered the gold standard for assessing creative AI output. Previous benchmarks only hit around 69.4% consistency by comparison. It also agreed with professional developers over 90% of the time—suggesting this automated judge might just have a solid sense of "taste."

Tencent didn’t stop there. It tested over 30 top-performing AI models from around the world. Interestingly, generalist models outperformed specialized ones. For example, Qwen-2.5-Instruct, a general-purpose AI, beat both its code-specific and vision-specialized counterparts. The takeaway? Successful creative applications require more than code competence—they demand reasoning, flexible understanding, and an intuitive grasp of design principles.

The broader implication is clear: as AI tools become more embedded in how we build and experience digital content, their ability to produce not just working outputs but genuinely usable and appealing ones is becoming essential. Tencent hopes ArtifactsBench can serve as a reliable benchmark for tracking this kind of progress—where AIs aren’t just getting the job done, but doing it with style.

Sign up for more like this.