The Agentic AI Evaluation Playbook: How We Compared GPT-5.2, Claude Sonnet 4.5, and Qwen for Enterprise Deployment [Complete Version]

ChatGenie Engineering

May 20, 2026 9:25 AM

There is a version of this article that begins with a benchmark table. This is not that article.

When we set out to evaluate large language models for ChatGenie's core agentic workflow, we quickly realized that the benchmarks published by model providers — MMLU, HumanEval, MATH — told us almost nothing useful. We were not building a trivia assistant. We were running a production customer support automation system with real enterprise clients, live conversation volume, and service-level commitments that could not absorb experimental failure. The question we needed to answer was not "which model is smartest?" It was "which model behaves most reliably inside a constrained, multi-agent pipeline operating under production conditions?"

What forced us to ask that question was a platform migration. ChatGenie's production infrastructure was originally built on Microsoft Azure. Our Orchestrator Agent ran on GPT-4o — and by the time we began this evaluation, we had already upgraded that backbone to GPT-5.2, which reached 99% accuracy on our evaluation benchmark after just two prompt iterations. Then we moved to AWS. The migration to Amazon EKS and AWS Bedrock was the right architectural decision. But it came with an immediate consequence: we would no longer have access to Azure AI Foundry, and with it, our direct path to GPT-5.2.

That created a non-negotiable requirement: find a replacement that could match or meaningfully approach GPT-5.2's agentic performance, sourced entirely from models available through Amazon Bedrock. This article documents that evaluation — what we tested, how we scored it, and what we found. We are sharing it because the field needs more published accounts from teams who have actually deployed agentic AI at scale — not from teams who have tested it in a sandbox.

"The question was not 'which model is smartest?' It was 'which model behaves most reliably inside a constrained, multi-agent pipeline under production conditions?'"

‍

Why agentic evaluation is a different problem

Most published model comparisons optimize for general capability: reasoning quality, factual accuracy, coding performance. These are meaningful metrics for general-purpose assistants. They are incomplete metrics for agentic systems.

In a multi-agent architecture, a model is not asked to converse. It is being asked to execute. It receives a structured system prompt, a defined set of tools, a conversation history, and a narrow behavioral mandate. Its job is to follow instructions precisely, stay within the behavioral scope of its persona, and produce outputs that the downstream system can process without intervention. Deviation from this contract — hallucinated processes, fabricated contact details, inconsistent output structure — is not a minor quality issue. It is a system failure.

ChatGenie's production architecture pairs a Guard Agent, which handles pre-orchestration intent classification and safety filtering, with an Orchestrator Agent, which manages retrieval, tool routing, and response generation. Both agents are exposed to real customer inputs: adversarial phrasing, ambiguous requests, emotionally charged language, and code-switching between Filipino and English — often within the same message. These constraints defined our evaluation criteria before we tested a single model.

‍

GPT-5.2 as the incumbent baseline

GPT-5.2, running on Azure AI Foundry, was our production Orchestrator at the time of this evaluation. Its prompts had been originally written for GPT-4o, and even so, GPT-5.2 reached 93% accuracy out of the box on our 102-scenario evaluation benchmark — and 99% accuracy after just two targeted prompt re-alignment iterations. That convergence speed was the fastest of any model we evaluated.

GPT-5.2's average generation latency was approximately 16 seconds with an average per-interaction cost of approximately $0.026 (combining $0.021 average input cost and $0.005 average output cost, across an average of 12,007 input tokens and 365 output tokens per interaction). Its one meaningful gap was multilingual coverage: on our Filipino/Cebuano language-match evaluation, GPT-5.2 scored 92.29%, reflecting a default-to-Tagalog tendency on Cebuano inputs.

Every Bedrock candidate was assessed against the same question: how closely does this model approach GPT-5.2's 99% ceiling, and at what cost and latency profile?

‍

Our evaluation setup

Our candidate pool was bounded by a hard constraint: all models had to be available through Amazon Bedrock. We evaluated four candidates: Claude Sonnet 4.5 and Claude Haiku 4.5 (Anthropic, via Bedrock), Amazon Nova Pro v1 (Bedrock-native), and Qwen 3 Next 80b a3b (via Bedrock's third-party catalog). GPT-5.2 scores from production evaluation logs served as the reference baseline throughout.

Each model ran against a benchmark of 102 scenarios derived from real customer interactions in our flagship enterprise deployment — covering routine inquiries, escalation scenarios, edge cases, and adversarial inputs — all grounded in our actual knowledge base. We ran up to four evaluation iterations per model, updating the prompt after each run based on specific failure patterns surfaced by our LLM-as-a-judge rubric. This methodology mirrors the actual deployment process: prompts in production are living documents that evolve as edge cases surface.

The composite leaderboard

Among Bedrock-only candidates, the ranking is: Qwen first, Sonnet second, Nova Pro third, Haiku fourth. The composite leaderboard and the deployment recommendation answer different questions — the leaderboard measures overall value profile; the recommendation accounts for where each model's accuracy ceiling sits and how quickly it gets there.

‍

Graphic 2 — Weighted Composite Leaderboard

Weighted composite leaderboard

#1 Qwen 3 80b

94.9

Ref GPT-5.2

93.1

#2 Sonnet 4.5

90.3

#3 Nova Pro

89.9

#4 Haiku 4.5

88.8

GPT-5.2 is the Azure baseline being replaced and is not a Bedrock candidate. Instruction adherence and multilingual scores are proportional to the best performer in the cohort. Latency and cost are scored against absolute real-world thresholds — no model's score changes based on who else is in the cohort.

‍

Graphic 3 — Raw Evaluation Data

Raw evaluation data — all models

Metric	GPT-5.2 (Baseline)	Sonnet 4.5	Qwen 3 80b	Haiku 4.5	Nova Pro v1
Best accuracy	99%	95.10%	93.14%	88.24%	84.31%
Starting accuracy	93%	93.14%	88.24%	86.27%	80.39%
Total accuracy gain	+6.00pp	+1.96pp	+4.90pp	+1.97pp	+3.92pp
Avg generation latency	~16s	~23–24s	~12–13s	~15s	~13s
Avg cost / interaction	~$0.026	~$0.028	~$0.0011	~$0.011	~$0.0005
Language match score	92.29%	100%	91.06%	96.36%	91.23%
Multilingual quality	98.90%	95.68%	93.02%	89.13%	85.77%
Path to 99%	Achieved	1–2 iters	2–3 iters	4–5 iters	Not via prompting

‍

What we found: a model-by-model account

Claude Sonnet 4.5

Sonnet 4.5 was the strongest Bedrock candidate in our evaluation by a meaningful margin, and the model we recommend as the most defensible replacement for GPT-5.2 in a production agentic context.

Sonnet 4.5 entered the evaluation at 93.14% — already matching GPT-5.2's starting accuracy before any tuning. Over four iterations, it finished at 95.10%. Its seven initial failures clustered around three specific patterns: first-person promise of action (implying it would personally check a live status without a real API call), hallucinated contact details, and premature "no information" bailouts despite adjacent KB content. Each had a single, traceable root cause. The same playbook that took GPT-5.2 from 93% to 99% in two iterations applies directly. Sonnet 4.5 is on the same trajectory — assessed at 1–2 additional iterations to reach 99%.

Sonnet 4.5 achieved a perfect 100% language match score across both multilingual evaluation runs — the only model in the cohort to do so, and the clearest outperformance of GPT-5.2's 92.29% baseline. Its average per-interaction cost is approximately $0.028, with an average latency of ~23–24 seconds — the highest in the Bedrock cohort. A model that resolves more conversations correctly on the first turn compresses the effective per-resolution cost.

‍

Qwen 3 Next 80b a3b

Qwen was the most surprising performer in the evaluation — and the result that most challenged our prior assumptions going in.

Qwen entered at 88.24% and reached 93.14% across four iterations — a gain of 4.90 percentage points, the largest improvement of any model in the cohort. More importantly, every iteration produced measurable gains with no regressions. Its 14 initial failures were distributed across five distinct categories: verification hallucination, process invention, KB contradiction, wrong FAQ application, and promise of action. This variety is encouraging — it indicates a model that adapts to targeted corrections rather than one with a single deep-seated failure mode. Projection: 2–3 additional iterations to reach 99%.

Qwen's cost profile is where it most dramatically separates itself. At approximately $0.0011 per interaction and an average latency of ~12–13 seconds — the lowest in the cohort — it is approximately 25× cheaper than Sonnet 4.5 and 24× cheaper than GPT-5.2 at the interaction level. It is the only Bedrock model to score above 90 across all four evaluation dimensions simultaneously.

‍

Claude Haiku 4.5

Haiku challenges the assumption that you must sacrifice reliability for speed and cost. Its results also carry the most operationally specific lesson from the entire evaluation: for smaller models, prompt format matters as much as prompt content.

Haiku entered at 86.27% and moved to 88.24% across four iterations. Its first iteration produced no improvement — the added prompt corrections were correct in content but made the prompt too verbose, causing the model to lose focus on earlier rules. No change in score was recorded. Improvement resumed only after the team shifted to surgical, single-line corrections addressing one failure pattern at a time. Projection: 4–5 additional iterations with strict prompt discipline.

Haiku scored 96.36% on language match — second-best in the cohort, trailing only Sonnet 4.5's perfect score. At approximately $0.011 per interaction and ~15 seconds average latency, Haiku is the natural lower tier in a tiered architecture — handling high-volume, lower-complexity intents and routing escalation-risk inputs to Sonnet capacity.

‍

Amazon Nova Pro v1

Nova Pro entered with structural advantages — Bedrock-native, lowest cost in the cohort, and competitive latency. The evaluation data did not support its selection for a primary Orchestrator slot.

Nova Pro started at 80.39% and moved to 84.31% across four iterations. The more important observation is why the model resisted faster improvement: its core failure patterns persisted across iterations despite explicit guard rules naming the exact patterns with concrete CORRECT/WRONG example pairs. When a rule says do not invent processes not in the knowledge base, and the model continues inventing processes, that is not a prompt engineering problem. It is a model characteristic — a strong generative prior that produces plausible-sounding but fabricated processes, contact details, and policy assertions. The assessment is direct: Nova Pro would likely plateau around 86–88% even with further optimization. Reaching 99% would require fine-tuning or a fundamentally different approach.

Multilingual Evaluation: A Dimension That Requires Its Own Methodology

Our multilingual evaluation used a 100-scenario dataset — 50 Tagalog, 50 Cebuano — and ran it through a dedicated LLM-as-a-judge prompt. In the first run against GPT-5.2, it returned a score of 64%. That number is not a reflection of GPT-5.2's multilingual ability. Inspection revealed the judge was mislabeling languages. The 64% was a measurement of the judge's unreliability, not the model's.

It took three judge prompt iterations to produce reliable verdicts: explicit guidance that code-switching is a natural Philippine communication pattern; a tightened FAIL definition (only trigger when the response contains no meaningful use of the customer's language); and concrete annotated PASS/FAIL examples. After calibration, run-to-run variance across all models was under 1%.

The lesson applies beyond multilingual evaluation: any LLM-as-a-judge deployment requires iteration and calibration before its verdicts are usable. A single-pass evaluation through an uncalibrated judge produces noise formatted as a result.

‍

Graphic 5 — Multilingual Evaluation Results

Multilingual evaluation results

Model	Lang Match Run 1	Lang Match Run 2	Avg Match	Multilingual Quality
Claude Sonnet 4.5	100.00%	100.00%	100%	95.68%
Claude Haiku 4.5	96.00%	96.72%	96.36%	89.13%
GPT-5.2 (Baseline)	92.20%	92.38%	92.29%	98.90%
Nova Pro v1	91.19%	91.27%	91.23%	85.77%
Qwen 3 Next 80b	91.00%	91.11%	91.06%	93.02%

On language match, the Anthropic models lead significantly — Sonnet 4.5 perfectly, Haiku well above the GPT-5.2 baseline. On multilingual quality, GPT-5.2 retains its overall lead at 98.90%. No model shows a meaningful quality cliff on multilingual inputs.

‍

Four things this evaluation taught us

‍

Key Insights

Migrating cloud providers is also a model selection exercise.

Losing access to GPT-5.2 forced a re-evaluation that surfaced a genuine improvement: a Bedrock model that achieves a perfect multilingual score that our production model does not. The migration created evaluation pressure that led to a better outcome.

Prompt engineering has a ceiling — and recognizing it saves engineering time.

Sonnet and Qwen absorbed corrections cleanly. Nova Pro resisted corrections to its core failure mode even when given explicit, named rules targeting exact failure patterns. When a model doesn't respond to prompt engineering, that's diagnostic: the failure is in the model's generative prior, not a missing instruction.

Cost per resolved interaction — not cost per token — is the right operating metric.

Nova Pro is approximately 50× cheaper per interaction than Sonnet on a token-cost basis. But a model with 84% accuracy generates more escalations, more repeat interactions, and more human agent involvement — all of which carry costs that token pricing does not capture.

The leaderboard and the deployment recommendation answer different questions.

Qwen ranks #1 on the composite because it's the most well-rounded model. Sonnet is the recommended primary Orchestrator because it has the highest current accuracy ceiling and the shortest path to 99%. Both facts are true, and neither contradicts the other.

‍

Our recommendation

Two-track deployment recommendation

For accuracy-critical enterprise deployments: Claude Sonnet 4.5 as the primary Orchestrator, with continued prompt iteration until the 99% ceiling is reached within 1–2 additional rounds. Highest current accuracy of any Bedrock model, the only perfect multilingual score, and the clearest parallel to the methodology that got GPT-5.2 to 99%.

For cost-optimized or high-volume deployments: Qwen 3 Next 80b a3b as the primary model, with continued iteration toward 99% over 2–3 more rounds. At ~$0.001 per interaction — approximately 25× cheaper than Sonnet — and the fastest latency in the cohort.

For tiered architecture: Haiku 4.5 as the lower tier handling high-volume, lower-complexity intents, routing escalation-risk inputs to Sonnet capacity. The data supports this deployment pattern directly.

‍

Graphic 6 — Key Takeaways

Key takeaways

Evaluate inside your architecture, not in isolation. 102 real production scenarios through your actual pipeline reveals what no public benchmark captures.

Qwen is the strongest overall value model on Bedrock. #1 composite, lowest latency, 25× cheaper than Sonnet, and the only model above 90 on all four dimensions.

Sonnet 4.5 is the right choice when you cannot compromise on accuracy or multilingual fidelity. Perfect language match, highest accuracy ceiling, 1–2 iterations to 99%.

Nova Pro's accuracy gap is not closeable through prompt engineering. The failure is in the generative prior. Operational cost advantages do not offset a 16% error rate in enterprise customer support.

LLM-as-a-judge requires calibration before its results are credible. An uncalibrated judge returned 64% on a model that later scored 92%+ after calibration. Verify your measurement instrument before you trust the measurement.

The methodology is the durable artifact. The scores will change. The framework for generating them should not. We will publish results from the production Bedrock deployment once the migration is complete and sufficient data has accumulated to be statistically meaningful.

If you are building production agentic AI and working through your own model selection, we are happy to talk. Reach out at ragde@chatgenie.ph or connect at chatgenie.ph/book-a-call.

‍‍

ChatGenie is an enterprise Agentic AI platform for customer support automation. ChatGenie is ISO/IEC 27001:2022 certified and deployed on AWS infrastructure. Our flagship deployment with Angkas has demonstrated a 77% reduction in support OPEX and 98–99% resolution accuracy at production scale.

Back to Blog

The Agentic AI Evaluation Playbook: How We Compared GPT-5.2, Claude Sonnet 4.5, and Qwen for Enterprise Deployment [Compressed Version]

May 20, 2026 9:25 AM

ChatGenie's production customer support automation platform was built on Microsoft Azure. Our Orchestrator Agent ran on GPT-5.2 — tuned from a GPT-4o baseline — and it had earned its place: 99% resolution accuracy, a 77% reduction in support OPEX, and a customer team rightsized from 39 to 9 agents for our flagship enterprise client. When we migrated to Amazon EKS and AWS Bedrock, we lost access to Azure AI Foundry and with it, our path to GPT-5.2. We needed a replacement sourced entirely from Amazon Bedrock. This article documents what we found.

The Agentic AI Evaluation Playbook: How We Compared GPT-5.2, Claude Sonnet 4.5, and Qwen for Enterprise Deployment [Complete Version]

May 20, 2026 9:25 AM

How We Cut Latency by 50% by Simplifying Our Agentic Architecture

January 22, 2026 2:04 PM

When we first designed ChatGenie's agentic system for customer chat operations, we followed a principle that seemed intuitive: separate concerns into separate agents. Intent classification? That's one agent. Policy enforcement? Another agent. Response generation? Yet another.The result was a five-agent core chain that was clean, modular, and easy to reason about. It was also slow.Each agent in the chain required a separate LLM call. Five agents meant five round-trips to the model. In customer chat, where users expect near-instant responses, this cumulative latency was becoming a problem. Users would see typing indicators for seconds before receiving a response. Containment rates suffered as impatient users escalated to human agents.We needed to rethink our architecture.

View Blog

The Agentic AI Evaluation Playbook: How We Compared GPT-5.2, Claude Sonnet 4.5, and Qwen for Enterprise Deployment [Complete Version]

ChatGenie Engineering

May 20, 2026 9:25 AM

Why agentic evaluation is a different problem

GPT-5.2 as the incumbent baseline

Our evaluation setup

The composite leaderboard

What we found: a model-by-model account

Claude Sonnet 4.5

Qwen 3 Next 80b a3b

Claude Haiku 4.5

Amazon Nova Pro v1

Multilingual Evaluation: A Dimension That Requires Its Own Methodology

Four things this evaluation taught us

Migrating cloud providers is also a model selection exercise.

Prompt engineering has a ceiling — and recognizing it saves engineering time.

Cost per resolved interaction — not cost per token — is the right operating metric.

The leaderboard and the deployment recommendation answer different questions.

Our recommendation

Two-track deployment recommendation

Key takeaways

Related Post

The Agentic AI Evaluation Playbook: How We Compared GPT-5.2, Claude Sonnet 4.5, and Qwen for Enterprise Deployment [Compressed Version]

The Agentic AI Evaluation Playbook: How We Compared GPT-5.2, Claude Sonnet 4.5, and Qwen for Enterprise Deployment [Complete Version]

How We Cut Latency by 50% by Simplifying Our Agentic Architecture

Sign Up For our Newsletter

Sign Up For our Newsletter