Accuracy of AI-Generated Responses
We leveraged 3 LLMs (Anthropic Claude 3 Sonnet, Opus, and Mistral AI) & tested them against the same set of roleplay scenarios & user prompts. The accuracy of LLM responses was evaluated based on predefined criteria like contextual relevance, grammatical correctness & goal completion. A detailed accuracy chart was created to compare the models, to make data-driven decisions.
Seamless Integration of Multiple LLMs
W leveraged Streamlit UI that allows users to easily switch between LLMs. The backend was optimized using Python/FastAPI, with each LLM being called asynchronously to minimize latency. We used AWS ECS to deploy the solution & ensured infrastructure could scale according to demand, preventing any performance bottlenecks
Data Storage and Retrieval for Performance Analysis
AWS OpenSearch was utilized to store all conversation logs, user feedback & LLM responses. It provided a robust & scalable solution that allowed for quick retrieval of data, to generate the accuracy charts and performance reports. Indexing was designed to enable fast querying, ensuring that even large datasets could be processed
efficiently.
Providing Actionable Feedback to Users
We implemented a feedback mechanism to analyze user interactions across all LLMs, focusing on goal completion, grammatical accuracy & vocabulary improvement. This consistent feedback was generated using a common evaluation framework built with LangChain & LangSmith & cross-referenced with each LLM’s outputs for uniformity.
Tech Stack






