What we learned from building AI monitoring tools
Generative AI apps are designed to be slightly unpredictable, which is why it's critical to carefully monitor them in production. But when we reviewed AI observability software that could help us improve monitoring, we ran into several limitations.
1 Data privacy risks. Each new external vendor brings customer data risks, so we were hesitant to bring in new third parties with whom we did not have a direct relationship.
2 Usability for non-engineers. Most monitoring tools are designed for technical users. While that’s helpful for debugging engineering issues, it makes it difficult to loop in less technical subject matter experts to help us interpret the results.
3 Lack of context. Much of the focus from AI companies has been on tuning LLMs on recalling objectively verifiable facts. As a result, many monitoring tools focus on reviewing messages in isolation because their goal is just to ensure questions like “What is the capital of New York?” received “Albany” as the response. But our use case requires a much more subjective evaluation of style and intent. The LLM may have generated an answer that looks good to us, but did the user need to make any changes after? Are there other actions they took in the app that indicate their final product? Most of today’s tooling doesn’t show us this valuable context in the evaluation process.
As a result, we decided to gather feedback manually on our first release by asking our team members to fill out a survey. But as this process starts to roll out to external users, we’re investing in a more scalable approach to evaluation. Along with addressing the existing gaps mentioned above, here are a few more key features we’re working on getting right.
Behavior-focused analytics
Instead of looking at LLM traces in isolation, we pair it with key user analytics. In each case we review, we aim to tell a story. What actions did the user take leading up to the LLM generation? What’s the full message history? What manual changes did they make? By giving our review team a narrative flow, we make it much easier to understand the overall intent of the user and the usefulness of the LLM output.
Collaborative reviewing
Because we deal with expert systems, oftentimes we have just one or two people who have the relevant experience to review a case. We’re making it easy for multiple reviewers to discuss the data and surface nuances that wouldn’t otherwise be obvious.
Queue-based workflows
Data is only useful if you look at it, so we’re focused on creating a workflow that allows us to review new cases efficiently. We also know our subject matter experts are busy, so we’re designing tools that let us tag them into cases for which we know they can provide unique insights.
Understanding how products get used in the wild can be the difference between generative AI being an interesting side project to a business-transforming asset. We’re excited to be innovating not only through the products we create, but also the infrastructure we use to support them.
Interested in building something with us? Schedule a chat!