How we used GenAI to make querying unfamiliar data easier without letting the LLM take the wheel
Last year, large language models (LLMs) like GPT-3.5 made huge leaps in capability. It's now possible to use them for tasks that previously required extensive human effort. However, while LLMs are fast, their answers aren't always reliable.
Striking a balance between leveraging their power and ensuring they don't drown us in false information remains an open challenge.
What does that look like in practice?
In this article, we’ll walk through one such LLM implementation on the Faros AI platform and share what we learned as we balanced the pragmatic benefits with ethical cautions.
At Faros AI, our data platform for software engineering is all about providing insights into how teams and organizations are functioning, and how they can be improved. A key component of actionable insights is developing a deep understanding of what the data is showing you.
But there is a reason data scientists and analysts are paid quite well! Understanding data can be difficult and takes a lot of effort. For that reason, we focused our initial efforts with LLMs on making it easier for users to make sense of their data.
First came Lighthouse AI Chart Explainer, a feature based on the understanding that, while a picture may be worth a thousand words, a caption certainly doesn't hurt. We now explain every chart in natural language, making it easier to understand metrics and act on them more confidently.
Our next addition was a more complex undertaking. Lighthouse AI Query Helper utilizes GenAI to receive a natural language question from a user (like ’ How many Sev1 incidents are open for my team?’) and guides users through building a query that retrieves the answer.
In this article, we’ll cover our experience building this capability responsibly. I'll describe:
It has been said before, but is definitely worth saying again, that there are many issues with LLMs. These issues include but are not limited to:
The first three — bias, privacy, and misinformation — are the most addressable in user-facing applications.
How can we ensure LLMs don't generate harmful, biased, or misleading content? How do we maintain privacy? These require thoughtful, responsible development.
With careful monitoring, content filtering, and transparency, risks may be mitigated but not completely eliminated. There are still many open ethical questions that need further research.
So given all these concerns, what are some appropriate use cases for LLMs?
At Faros, we incorporate LLMs to aid human understanding of data — not to fully automate or replace human judgment. Our goal is to guide and inform users without removing the steps that are best reviewed by a human.
We sought use cases where LLMs can make it easier for users to answer business-critical questions about software engineering, without needing to understand where the data lives and how it is structured.
The fact that we store the data in a standardized format enables canonical metrics and comparisons to industry benchmarks. However, there are always nuances and one-off questions that standardized metrics do not capture. The ability to query the data is critical to finding answers to questions unique to each organization.
Lighthouse AI Query Helper guides users in querying data to answer natural language questions, like “What is the build rate failure on my repo for the last month?”.
Query Helper provides:
So how did we develop this tool and make sure it was working as intended?
While generative language models are new on the scene, the principles of deploying AI remain the same:
Defining good metrics and having a crisp definition of what you are solving is key to this process, but how do we define the right metrics to evaluate a multi-purpose tool like an LLM?
While there is a legion of benchmarks used to evaluate an LLM’s performance and suggest that it might be the best LLM, these don’t necessarily tell you how an LLM will perform on your specific task. For example, for our use case, how LLMs performed on the bar exam was irrelevant. What matters is its effectiveness on our task, which we need to measure and evaluate in situ.
In building Lighthouse AI Query Helper, we found that the following steps helped us define quantitative measures that matched our perception of performance:
Unfortunately, the first two steps are hard and time-consuming. And we were on a deadline!
While searching for shortcuts, it might be tempting to offload the evaluation of the LLM to — you guessed it — an LLM. However, to us, that felt a lot like feeding pigs bacon, something that never ends well. We did not offload the whole process to the LLMs and allow the LLMs to judge their brethren!
Instead, we went with a compromise, leveraging LLMs to make creating evaluation data easier, as I describe below.
We started with a small set of hand-written gold examples of good questions and answers. With this data, we carefully experimented with the format of the responses and the metrics used to evaluate how close the LLM came to our examples’ format and content. We looked at every single response to make the judgment on which metrics we should use, so it was a good thing that our starting data was small.
We then stepped up this process by using existing user queries as examples of how to answer questions. An LLM served as an assistant for this step to reformat the answers from raw queries into the exact format we needed for our Query Helper. With a small amount of editing and quality control, we ended up with a substantial amount of gold data that we could use to test and evaluate different prompt and retrieval formations for our task.
The metrics we focused on during the evaluation were:
We used these metrics and our gold data to evaluate, zero shot, n shot static examples, n shot relevant examples, and the detail and specificity in our retrieved table information.
Not surprisingly, the content included in the prompt made a big difference in how well the LLMs performed our task.
Our key findings were:
We tested prompts across multiple LLMs, starting with OpenAI. However, API latency and outage issues led us to try AWS Bedrock. Surprisingly, the specific LLM mattered less than prompt engineering. Performance differences between models were minor and inconsistent in our tests. However, response latency varied greatly between models.
In summary, careful prompt design considering relevancy and brevity were more important than LLM selection for our task. But latency was a key factor for user experience. In the end, we decided that anthropic-claude-instant-v1 provided the best customer experience for our use case, based on the latency of responses and quality of the answers. So that is what we shipped to customers.
Post-project, we shifted focus to real-world deployment, closely observing interactions, query resolutions, and proximity of user queries to AI proposals. This feedback loop will guide refinements and potentially in-house fine-tuned models. Stay tuned to hear how it went.
While impressive, LLMs have limitations and risks requiring careful consideration. The most responsible path forward balances pragmatic benefits and ethical cautions, not pushing generation capabilities beyond what AI can reliably deliver today.
In closing, restraint is wise with this exciting technology. Here is my advice:
What are your thoughts on leveraging LLMs responsibly? I'm happy to discuss more. Please share any feedback!
About the author: Leah McGuire has spent the last two decades working on information representation, processing, and modeling. She started her career as a computational neuroscientist studying sensory integration and then transitioned into data science and engineering. Leah worked on developing AutoML for Salesforce Einstein and contributed to open-sourcing some of the foundational pieces of the Einstein modeling products. Throughout her career, she has focused on making it easier to learn from datasets that are expensive to generate and collect. This focus has influenced her work across many fields, including professional networking, sales and service, biotech, and engineering observability. Leah currently works at FarosAI where she develops the platform’s native AI capabilities.
Global enterprises trust Faros AI to accelerate their engineering operations. Give us 30 minutes of your time and see it for yourself.