This is some text inside of a div block.
This is some text inside of a div block.
Learn More
×
HomeSolutionCopilotCase StudyPricingFAQ
contact
new
Licensing
style guide
sd
sd
sd
Book a Demo

The Agentic AI Evaluation Playbook: How We Compared GPT-5.2, Claude Sonnet 4.5, and Qwen for Enterprise Deployment [Compressed Version]

ChatGenie Engineering

May 20, 2026 9:25 AM

‍

Compressed version

This is a condensed summary of the full article. It covers the core framework and key findings. For the complete analysis — including the full model scorecard, detailed evaluation criteria, and model-by-model breakdown — read the complete version here →

ChatGenie's production customer support automation platform was built on Microsoft Azure. Our Orchestrator Agent ran on GPT-5.2 — and it had earned its place: 99% resolution accuracy, a 77% reduction in support OPEX, and a customer team rightsized from 39 to 9 agents for our flagship enterprise client. When we migrated to Amazon EKS and AWS Bedrock, we lost access to Azure AI Foundry and with it, our path to GPT-5.2. We needed a replacement sourced entirely from Amazon Bedrock. This article documents what we found.

‍

Why standard benchmarks don't apply here

Evaluating LLMs for an agentic workflow is not the same as evaluating them for general use. In a multi-agent system, the model isn't conversing — it's executing. It receives a system prompt, a narrow behavioral mandate, and real customer inputs that include adversarial phrasing, emotionally charged language, and code-switching between Filipino and English. Failure isn't a bad response; it's a hallucinated process, a fabricated contact, or an out-of-scope assertion that reaches a live customer. Our evaluation was designed around those failure modes, not general capability benchmarks.

"Failure isn't a bad response; it's a hallucinated process, a fabricated contact, or an out-of-scope assertion that reaches a live customer."

‍

What we evaluated and how we scored it

We tested four Bedrock-available models against GPT-5.2 as the baseline: Claude Sonnet 4.5, Claude Haiku 4.5, Amazon Nova Pro v1, and Qwen 3 Next 80b a3b. Each model ran against 102 scenarios drawn from real production conversations, starting with our GPT-5.2-tuned prompt adapted for the target model. We ran up to four evaluation iterations per model — each round identifying specific failure patterns from our LLM-as-a-judge rubric and addressing them with targeted prompt corrections.

Scoring used four weighted dimensions. Quality dimensions carry 75% of the total weight; operations carry 25%.

The Leaderboard

‍

Graphic 2 — Weighted Composite Leaderboard
Composite leaderboard
#1 Qwen 3 80b
94.9
Ref GPT-5.2
93.1

#2 Sonnet 4.5
90.3
#3 Nova Pro
89.9
#4 Haiku 4.5
88.8

GPT-5.2 is the Azure baseline being replaced and is not a Bedrock candidate. The composite leaderboard and the deployment recommendation answer different questions.

‍

Graphic 3 — Raw Evaluation Data
Raw accuracy and operational data
Metric GPT-5.2 (Baseline) Sonnet 4.5 Qwen 3 80b Haiku 4.5 Nova Pro v1
Best accuracy 99% 95.10% 93.14% 88.24% 84.31%
Starting accuracy 93% 93.14% 88.24% 86.27% 80.39%
Total accuracy gain +6.00pp +1.96pp +4.90pp +1.97pp +3.92pp
Avg generation latency ~16s ~23–24s ~12–13s ~15s ~13s
Avg cost / interaction ~$0.026 ~$0.028 ~$0.0011 ~$0.011 ~$0.0005
Language match score 92.29% 100% 91.06% 96.36% 91.23%
Multilingual quality 98.90% 95.68% 93.02% 89.13% 85.77%
Path to 99% Achieved 1–2 iters 2–3 iters 4–5 iters Not via prompting

‍

Model Verdicts

Four Things This Evaluation Taught Us

‍

Key Insights
1

Migrating cloud providers is also a model selection exercise.

Losing access to GPT-5.2 forced a re-evaluation that surfaced a genuine improvement: a Bedrock model that achieves a perfect multilingual score that our production model does not. The migration created evaluation pressure that led to a better outcome.

2

Prompt engineering has a ceiling — and recognizing it saves engineering time.

Sonnet and Qwen absorbed corrections cleanly. Nova Pro resisted corrections to its core failure mode even when given explicit, named rules targeting exact failure patterns. When a model doesn't respond to prompt engineering, that's diagnostic: the failure is in the model's generative prior, not a missing instruction.

3

Cost per resolved interaction — not cost per token — is the right operating metric.

Nova Pro is approximately 50× cheaper per interaction than Sonnet on a token-cost basis. But a model with 84% accuracy generates more escalations, more repeat interactions, and more human agent involvement — all of which carry costs that token pricing does not capture.

4

The leaderboard and the deployment recommendation answer different questions.

Qwen ranks #1 on the composite because it's the most well-rounded model. Sonnet is the recommended primary Orchestrator because it has the highest current accuracy ceiling and the shortest path to 99%. Both facts are true, and neither contradicts the other.

‍

Our Recommendation

For accuracy-critical enterprise deployments, Claude Sonnet 4.5 is the primary Orchestrator. For cost-optimized or high-volume deployments, Qwen 3 Next 80b a3b is the primary candidate — with continued iteration toward 99% over 2–3 more rounds. A tiered architecture combining both, with Haiku 4.5 handling routine high-volume intents, is supported by the data and is what we are designing toward.

The production outcomes from our flagship deployment — 77% OPEX reduction, team from 39 to 9 agents, 98–99% accuracy — were achieved on GPT-5.2. Matching and improving on those outcomes through the Bedrock migration is the commitment this evaluation was designed to support.

‍

Graphic 6 — Key Takeaways

Key takeaways

1
Benchmark scores do not predict production performance. Evaluate every model inside your actual architecture, against real workflow inputs.
2
Qwen is the strongest overall value model on Bedrock. #1 composite, lowest latency, 25× cheaper than Sonnet, and the only model above 90 on all four dimensions.
3
Sonnet 4.5 is the right choice when you cannot compromise on accuracy or multilingual fidelity. Perfect language match, highest accuracy ceiling, 1–2 iterations to 99%.
4
Nova Pro's accuracy gap is not closeable through prompt engineering. The failure is in the generative prior. Operational cost advantages do not offset a 16% error rate in enterprise customer support.
5
LLM-as-a-judge requires calibration before its results are credible. An uncalibrated judge returned 64% on a model that later scored 92%+ after calibration. Verify your measurement instrument before you trust the measurement.

‍

Want the full analysis?

Includes model-by-model breakdowns, the multilingual evaluation methodology, all graphics, and the complete data tables. read the complete article here →

‍

ChatGenie is an enterprise Agentic AI platform for customer support automation. ChatGenie is ISO/IEC 27001:2022 certified and deployed on AWS infrastructure.

Back to Blog
latest news

Related Post

The Agentic AI Evaluation Playbook: How We Compared GPT-5.2, Claude Sonnet 4.5, and Qwen for Enterprise Deployment [Compressed Version]

May 20, 2026 9:25 AM

ChatGenie's production customer support automation platform was built on Microsoft Azure. Our Orchestrator Agent ran on GPT-5.2 — tuned from a GPT-4o baseline — and it had earned its place: 99% resolution accuracy, a 77% reduction in support OPEX, and a customer team rightsized from 39 to 9 agents for our flagship enterprise client. When we migrated to Amazon EKS and AWS Bedrock, we lost access to Azure AI Foundry and with it, our path to GPT-5.2. We needed a replacement sourced entirely from Amazon Bedrock. This article documents what we found.

The Agentic AI Evaluation Playbook: How We Compared GPT-5.2, Claude Sonnet 4.5, and Qwen for Enterprise Deployment [Complete Version]

May 20, 2026 9:25 AM

When we set out to evaluate large language models for ChatGenie's core agentic workflow, we quickly realized that the benchmarks published by model providers — MMLU, HumanEval, MATH — told us almost nothing useful. We were not building a trivia assistant. We were running a production customer support automation system with real enterprise clients, live conversation volume, and service-level commitments that could not absorb experimental failure. The question we needed to answer was not "which model is smartest?" It was "which model behaves most reliably inside a constrained, multi-agent pipeline operating under production conditions?

How We Cut Latency by 50% by Simplifying Our Agentic Architecture

January 22, 2026 2:04 PM

When we first designed ChatGenie's agentic system for customer chat operations, we followed a principle that seemed intuitive: separate concerns into separate agents. Intent classification? That's one agent. Policy enforcement? Another agent. Response generation? Yet another.The result was a five-agent core chain that was clean, modular, and easy to reason about. It was also slow.Each agent in the chain required a separate LLM call. Five agents meant five round-trips to the model. In customer chat, where users expect near-instant responses, this cumulative latency was becoming a problem. Users would see typing indicators for seconds before receiving a response. Containment rates suffered as impatient users escalated to human agents.We needed to rethink our architecture.

View Blog

Sign Up For our Newsletter

Let’s talk all things business. Never miss an update or tip from us, subscribe to our newsletter!

Sign Up For our Newsletter

Let’s talk all things business. Never miss an update or tip from us, subscribe to our newsletter!

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Company
Why ChatGenie Is Different?Plans
Resources
BlogYoutubePress and Media CenterTerms Of Use Privacy Policy
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
© Copyright 2026. Gorated Innovation Labs, Inc.. All rights reserved.
Follow Us