From Copilot to Chatbot Agents
As we transitioned toward becoming an AI-focused company by developing an AI Copilot to assist businesses in managing their stores, we realized that delivering substantial value might take longer than anticipated. We created several useful AI-powered features such as product description generators, welcome and greeting message creators, and automated comment-reply configurators for existing Facebook posts.
A few months into this endeavor, we identified a pressing need among our merchants: automating responses on Messenger and Instagram. This led to our official return as a chatbot platform.
Building a chatbot feature using the OpenAI API, particularly with GPT-4o, is relatively straightforward. However, while easy to implement, the accuracy and relevance of its responses can be inconsistent. Generative AI often faces criticism for producing hallucinations in its output, a behavior that is not a bug but an inherent aspect of how large language models operate.
Implementing a proper Retrieval-Augmented Generation (RAG) pipeline in a chatbot can help improve prompts and enhance response accuracy, but it is not sufficient on its own. Our goal is to achieve a 90% accuracy rate to ensure that the businesses and enterprises we serve find the solution genuinely useful.
Multi-AI Agent Approach: A Necessity, Not Hype
After months of testing and understanding the limitations of existing models, we developed our own proprietary Multi-AI Agent Framework for chatbots. This approach enables our chatbot to deconstruct each inquiry and interaction, allowing it to reason through a team of AI agents. This aligns with the industry's direction as large language models evolve.
With our Multi-AI Agent Framework and RAG implementation, ChatGenie is now capable of providing safe, secure, private, accurate, and relevant chatbot experiences.
To explain how it works, consider a typical customer service environment. A helpdesk agent classifies incoming inquiries and routes them to the appropriate customer service agents. These agents, equipped with training and knowledge, respond directly to customers. To maintain quality, a QA team monitors and reviews each interaction. Our Multi-AI Agents function similarly, but instead of humans, AI agents powered by models from OpenAI and Meta perform these roles.
Here are some actual conversations between end-users and chatbots powered by our Multi-AI Agent Framework.
Sample Conversation #1: "Do you deliver?"
The screenshot above is taken directly from our Chatbot Inbox Manager Page. The "Response Breakdown" feature allows merchants and their staff to see our chatbot's thought process in generating responses to customer inquiries.
After receiving a customer inquiry, an Intent AI Agent analyzes it to determine the actual question. In this case, the customer is simply asking if the merchant provides delivery options. The Intent AI Agent can also interpret inquiries in other languages and dialects.
Once the inquiry is reinterpreted, a Guard AI Agent assesses whether it's acceptable within the defined parameters we've set. Inquiries that are outside the scope of the business description, contain malicious intent like prompt injection or are vulgar are handled accordingly by the Guard AI Agent.
If the inquiry is accepted, it is then classified by the Classification AI Agent. This determines whether an inquiry is a general question, a delivery status follow-up, an order placement, etc. Each classification has a different handling process and a specific set of AI Agents, making classification crucial. For those familiar with OpenAI’s Swarm, this corresponds to the triage agent layer. To simplify, we'll explain the handling process for a general inquiry.
Once classified, a Conversation AI Agent drafts the initial response. For general inquiries, we have a proper RAG pipeline that's discussed here. Before the customer receives the drafted response, a Refinement AI Agent assesses it for accuracy and relevance. We'll provide more samples below to show how the Refinement AI Agent ensures accuracy in chatbot responses.
Sample Conversation #2: "The order was received. Hopia is still warm. Thanks" (In Filipino)
In the sample above, the customer informs the merchant that the order has been received and the food is still warm, followed by an expression of gratitude. However, the initial drafted response mentioned that the order was still "for delivery." For context, the delivery method used in this order was manual and not integrated into the system, so it was still marked as "for delivery" on the backend. The Refinement AI Agent spotted this discrepancy and removed "for delivery" in the final response.
Sample Conversation #3: "It might have no effect." (In Filipino)
In this third sample, the customer is expressing doubts about the effectiveness of the product. The first response shows confidence in the product and mentions previous customers and testimonials supporting its claims. However, it doesn't directly address the customer's concern.
In the final response, the Refinement AI Agent ensures that the concern is addressed directly, adding empathy and setting proper expectations about the product instead of hard-selling it.
Accuracy and Cost Efficiency Over Negligible Latency
The response time of this approach is longer than a straightforward zero-shot approach using GPT-4o or comparable models, but it is significantly more accurate. The difference in response time is negligible—not more than 30-45 seconds. This approach is also more cost-efficient than using the latest reasoning models and, in many cases, faster.
Unlike large tech companies like Microsoft and Salesforce, which are currently building AI for specific job roles, we are taking a more fundamental approach by assigning agents to smaller, specific tasks.
Conclusion
Whether the progress of frontier models is plateauing or will see significant improvements in the coming months, we believe that an agentic approach—combining chain-of-thought techniques with multi-agent design—can substantially improve accuracy. Developing our own Multi-Agent AI Framework also allows us to be independent and to continue utilizing different large language models from multiple providers.
There are many more interesting conversations between our chatbots and end-users that we want to share, especially regarding our chatbots assisting customers in the actual ordering process, but we will leave that for our next blog post. We are also exploring the use of appropriate Evals to increase transparency about our chatbot's accuracy in handling customer service.