Compressed version
This is a condensed summary of the full article. It covers the core framework and key findings. For the complete analysis — including the full model scorecard, detailed evaluation criteria, and model-by-model breakdown — read the complete version here →
ChatGenie's production customer support automation platform was built on Microsoft Azure. Our Orchestrator Agent ran on GPT-5.2 — and it had earned its place: 99% resolution accuracy, a 77% reduction in support OPEX, and a customer team rightsized from 39 to 9 agents for our flagship enterprise client. When we migrated to Amazon EKS and AWS Bedrock, we lost access to Azure AI Foundry and with it, our path to GPT-5.2. We needed a replacement sourced entirely from Amazon Bedrock. This article documents what we found.
Why standard benchmarks don't apply here
Evaluating LLMs for an agentic workflow is not the same as evaluating them for general use. In a multi-agent system, the model isn't conversing — it's executing. It receives a system prompt, a narrow behavioral mandate, and real customer inputs that include adversarial phrasing, emotionally charged language, and code-switching between Filipino and English. Failure isn't a bad response; it's a hallucinated process, a fabricated contact, or an out-of-scope assertion that reaches a live customer. Our evaluation was designed around those failure modes, not general capability benchmarks.
"Failure isn't a bad response; it's a hallucinated process, a fabricated contact, or an out-of-scope assertion that reaches a live customer."
What we evaluated and how we scored it
We tested four Bedrock-available models against GPT-5.2 as the baseline: Claude Sonnet 4.5, Claude Haiku 4.5, Amazon Nova Pro v1, and Qwen 3 Next 80b a3b. Each model ran against 102 scenarios drawn from real production conversations, starting with our GPT-5.2-tuned prompt adapted for the target model. We ran up to four evaluation iterations per model — each round identifying specific failure patterns from our LLM-as-a-judge rubric and addressing them with targeted prompt corrections.
Scoring used four weighted dimensions. Quality dimensions carry 75% of the total weight; operations carry 25%.

The Leaderboard
Model Verdicts
.png)
Four Things This Evaluation Taught Us
Our Recommendation
For accuracy-critical enterprise deployments, Claude Sonnet 4.5 is the primary Orchestrator. For cost-optimized or high-volume deployments, Qwen 3 Next 80b a3b is the primary candidate — with continued iteration toward 99% over 2–3 more rounds. A tiered architecture combining both, with Haiku 4.5 handling routine high-volume intents, is supported by the data and is what we are designing toward.
The production outcomes from our flagship deployment — 77% OPEX reduction, team from 39 to 9 agents, 98–99% accuracy — were achieved on GPT-5.2. Matching and improving on those outcomes through the Bedrock migration is the commitment this evaluation was designed to support.
Want the full analysis?
Includes model-by-model breakdowns, the multilingual evaluation methodology, all graphics, and the complete data tables. read the complete article here →
ChatGenie is an enterprise Agentic AI platform for customer support automation. ChatGenie is ISO/IEC 27001:2022 certified and deployed on AWS infrastructure.




.png)




