There is a version of this article that begins with a benchmark table. This is not that article.
When we set out to evaluate large language models for ChatGenie's core agentic workflow, we quickly realized that the benchmarks published by model providers — MMLU, HumanEval, MATH — told us almost nothing useful. We were not building a trivia assistant. We were running a production customer support automation system with real enterprise clients, live conversation volume, and service-level commitments that could not absorb experimental failure. The question we needed to answer was not "which model is smartest?" It was "which model behaves most reliably inside a constrained, multi-agent pipeline operating under production conditions?"
What forced us to ask that question was a platform migration. ChatGenie's production infrastructure was originally built on Microsoft Azure. Our Orchestrator Agent ran on GPT-4o — and by the time we began this evaluation, we had already upgraded that backbone to GPT-5.2, which reached 99% accuracy on our evaluation benchmark after just two prompt iterations. Then we moved to AWS. The migration to Amazon EKS and AWS Bedrock was the right architectural decision. But it came with an immediate consequence: we would no longer have access to Azure AI Foundry, and with it, our direct path to GPT-5.2.
That created a non-negotiable requirement: find a replacement that could match or meaningfully approach GPT-5.2's agentic performance, sourced entirely from models available through Amazon Bedrock. This article documents that evaluation — what we tested, how we scored it, and what we found. We are sharing it because the field needs more published accounts from teams who have actually deployed agentic AI at scale — not from teams who have tested it in a sandbox.
"The question was not 'which model is smartest?' It was 'which model behaves most reliably inside a constrained, multi-agent pipeline under production conditions?'"
Why agentic evaluation is a different problem
Most published model comparisons optimize for general capability: reasoning quality, factual accuracy, coding performance. These are meaningful metrics for general-purpose assistants. They are incomplete metrics for agentic systems.
In a multi-agent architecture, a model is not asked to converse. It is being asked to execute. It receives a structured system prompt, a defined set of tools, a conversation history, and a narrow behavioral mandate. Its job is to follow instructions precisely, stay within the behavioral scope of its persona, and produce outputs that the downstream system can process without intervention. Deviation from this contract — hallucinated processes, fabricated contact details, inconsistent output structure — is not a minor quality issue. It is a system failure.
ChatGenie's production architecture pairs a Guard Agent, which handles pre-orchestration intent classification and safety filtering, with an Orchestrator Agent, which manages retrieval, tool routing, and response generation. Both agents are exposed to real customer inputs: adversarial phrasing, ambiguous requests, emotionally charged language, and code-switching between Filipino and English — often within the same message. These constraints defined our evaluation criteria before we tested a single model.
GPT-5.2 as the incumbent baseline
.png)
GPT-5.2, running on Azure AI Foundry, was our production Orchestrator at the time of this evaluation. Its prompts had been originally written for GPT-4o, and even so, GPT-5.2 reached 93% accuracy out of the box on our 102-scenario evaluation benchmark — and 99% accuracy after just two targeted prompt re-alignment iterations. That convergence speed was the fastest of any model we evaluated.
GPT-5.2's average generation latency was approximately 16 seconds with an average per-interaction cost of approximately $0.026 (combining $0.021 average input cost and $0.005 average output cost, across an average of 12,007 input tokens and 365 output tokens per interaction). Its one meaningful gap was multilingual coverage: on our Filipino/Cebuano language-match evaluation, GPT-5.2 scored 92.29%, reflecting a default-to-Tagalog tendency on Cebuano inputs.
Every Bedrock candidate was assessed against the same question: how closely does this model approach GPT-5.2's 99% ceiling, and at what cost and latency profile?
Our evaluation setup
Our candidate pool was bounded by a hard constraint: all models had to be available through Amazon Bedrock. We evaluated four candidates: Claude Sonnet 4.5 and Claude Haiku 4.5 (Anthropic, via Bedrock), Amazon Nova Pro v1 (Bedrock-native), and Qwen 3 Next 80b a3b (via Bedrock's third-party catalog). GPT-5.2 scores from production evaluation logs served as the reference baseline throughout.
Each model ran against a benchmark of 102 scenarios derived from real customer interactions in our flagship enterprise deployment — covering routine inquiries, escalation scenarios, edge cases, and adversarial inputs — all grounded in our actual knowledge base. We ran up to four evaluation iterations per model, updating the prompt after each run based on specific failure patterns surfaced by our LLM-as-a-judge rubric. This methodology mirrors the actual deployment process: prompts in production are living documents that evolve as edge cases surface.

The composite leaderboard
Among Bedrock-only candidates, the ranking is: Qwen first, Sonnet second, Nova Pro third, Haiku fourth. The composite leaderboard and the deployment recommendation answer different questions — the leaderboard measures overall value profile; the recommendation accounts for where each model's accuracy ceiling sits and how quickly it gets there.
What we found: a model-by-model account
Claude Sonnet 4.5
.png)
Sonnet 4.5 was the strongest Bedrock candidate in our evaluation by a meaningful margin, and the model we recommend as the most defensible replacement for GPT-5.2 in a production agentic context.
Sonnet 4.5 entered the evaluation at 93.14% — already matching GPT-5.2's starting accuracy before any tuning. Over four iterations, it finished at 95.10%. Its seven initial failures clustered around three specific patterns: first-person promise of action (implying it would personally check a live status without a real API call), hallucinated contact details, and premature "no information" bailouts despite adjacent KB content. Each had a single, traceable root cause. The same playbook that took GPT-5.2 from 93% to 99% in two iterations applies directly. Sonnet 4.5 is on the same trajectory — assessed at 1–2 additional iterations to reach 99%.
Sonnet 4.5 achieved a perfect 100% language match score across both multilingual evaluation runs — the only model in the cohort to do so, and the clearest outperformance of GPT-5.2's 92.29% baseline. Its average per-interaction cost is approximately $0.028, with an average latency of ~23–24 seconds — the highest in the Bedrock cohort. A model that resolves more conversations correctly on the first turn compresses the effective per-resolution cost.
Qwen 3 Next 80b a3b
.png)
Qwen was the most surprising performer in the evaluation — and the result that most challenged our prior assumptions going in.
Qwen entered at 88.24% and reached 93.14% across four iterations — a gain of 4.90 percentage points, the largest improvement of any model in the cohort. More importantly, every iteration produced measurable gains with no regressions. Its 14 initial failures were distributed across five distinct categories: verification hallucination, process invention, KB contradiction, wrong FAQ application, and promise of action. This variety is encouraging — it indicates a model that adapts to targeted corrections rather than one with a single deep-seated failure mode. Projection: 2–3 additional iterations to reach 99%.
Qwen's cost profile is where it most dramatically separates itself. At approximately $0.0011 per interaction and an average latency of ~12–13 seconds — the lowest in the cohort — it is approximately 25× cheaper than Sonnet 4.5 and 24× cheaper than GPT-5.2 at the interaction level. It is the only Bedrock model to score above 90 across all four evaluation dimensions simultaneously.
Claude Haiku 4.5
.png)
Haiku challenges the assumption that you must sacrifice reliability for speed and cost. Its results also carry the most operationally specific lesson from the entire evaluation: for smaller models, prompt format matters as much as prompt content.
Haiku entered at 86.27% and moved to 88.24% across four iterations. Its first iteration produced no improvement — the added prompt corrections were correct in content but made the prompt too verbose, causing the model to lose focus on earlier rules. No change in score was recorded. Improvement resumed only after the team shifted to surgical, single-line corrections addressing one failure pattern at a time. Projection: 4–5 additional iterations with strict prompt discipline.
Haiku scored 96.36% on language match — second-best in the cohort, trailing only Sonnet 4.5's perfect score. At approximately $0.011 per interaction and ~15 seconds average latency, Haiku is the natural lower tier in a tiered architecture — handling high-volume, lower-complexity intents and routing escalation-risk inputs to Sonnet capacity.
Amazon Nova Pro v1
.png)
Nova Pro entered with structural advantages — Bedrock-native, lowest cost in the cohort, and competitive latency. The evaluation data did not support its selection for a primary Orchestrator slot.
Nova Pro started at 80.39% and moved to 84.31% across four iterations. The more important observation is why the model resisted faster improvement: its core failure patterns persisted across iterations despite explicit guard rules naming the exact patterns with concrete CORRECT/WRONG example pairs. When a rule says do not invent processes not in the knowledge base, and the model continues inventing processes, that is not a prompt engineering problem. It is a model characteristic — a strong generative prior that produces plausible-sounding but fabricated processes, contact details, and policy assertions. The assessment is direct: Nova Pro would likely plateau around 86–88% even with further optimization. Reaching 99% would require fine-tuning or a fundamentally different approach.
.png)
Multilingual Evaluation: A Dimension That Requires Its Own Methodology
.png)
Our multilingual evaluation used a 100-scenario dataset — 50 Tagalog, 50 Cebuano — and ran it through a dedicated LLM-as-a-judge prompt. In the first run against GPT-5.2, it returned a score of 64%. That number is not a reflection of GPT-5.2's multilingual ability. Inspection revealed the judge was mislabeling languages. The 64% was a measurement of the judge's unreliability, not the model's.
It took three judge prompt iterations to produce reliable verdicts: explicit guidance that code-switching is a natural Philippine communication pattern; a tightened FAIL definition (only trigger when the response contains no meaningful use of the customer's language); and concrete annotated PASS/FAIL examples. After calibration, run-to-run variance across all models was under 1%.
The lesson applies beyond multilingual evaluation: any LLM-as-a-judge deployment requires iteration and calibration before its verdicts are usable. A single-pass evaluation through an uncalibrated judge produces noise formatted as a result.
Four things this evaluation taught us
Our recommendation
Two-track deployment recommendation
For accuracy-critical enterprise deployments: Claude Sonnet 4.5 as the primary Orchestrator, with continued prompt iteration until the 99% ceiling is reached within 1–2 additional rounds. Highest current accuracy of any Bedrock model, the only perfect multilingual score, and the clearest parallel to the methodology that got GPT-5.2 to 99%.
For cost-optimized or high-volume deployments: Qwen 3 Next 80b a3b as the primary model, with continued iteration toward 99% over 2–3 more rounds. At ~$0.001 per interaction — approximately 25× cheaper than Sonnet — and the fastest latency in the cohort.
For tiered architecture: Haiku 4.5 as the lower tier handling high-volume, lower-complexity intents, routing escalation-risk inputs to Sonnet capacity. The data supports this deployment pattern directly.
The methodology is the durable artifact. The scores will change. The framework for generating them should not. We will publish results from the production Bedrock deployment once the migration is complete and sufficient data has accumulated to be statistically meaningful.
If you are building production agentic AI and working through your own model selection, we are happy to talk. Reach out at ragde@chatgenie.ph or connect at chatgenie.ph/book-a-call.
ChatGenie is an enterprise Agentic AI platform for customer support automation. ChatGenie is ISO/IEC 27001:2022 certified and deployed on AWS infrastructure. Our flagship deployment with Angkas has demonstrated a 77% reduction in support OPEX and 98–99% resolution accuracy at production scale.




.png)




