Startup firm Patronus creates diagnostic tool to catch genAI mistakes

Patronus' SimpleSafetyTests checks outputs from AI chatbots and other LLM-based tools to detect anomalies. The goal is to evaluate whether a model is going to fail — or is already failing.

Shutterstock/agsandrew

As generative AI (genAI) platforms such as ChatGPTDall-E2, and AlphaCode barrel ahead at a breakneck pace, keeping the tools from hallucinating and spewing erroneous or offensive responses is nearly impossible.

To date, there have been few methods to ensure accurate information is coming out of the large language models (LLMs) that serve as the basis for genAI.

As AI tools evolve and get better at mimicking natural language, it will soon be impossible to discern fake results from real ones, prompting companies to set up “guardrails” against the worst outcomes, whether they be accidental or intentional efforts by bad actors.

GenAI tools are essentially next-word prediction engines. Those next-word generators, such ChatGPT, Microsoft’s Copilot, and Google’s Bard, can go off the rails and start spewing false or misleading information.

In September, a startup founded by two former Meta AI researchers released an automated evaluation and security platform that helps companies use LLMs safely by using adversarial tests to monitor the models for inconsistencies, inaccuracies, hallucinations, and biases.

Patronus AI said its tools can detect inaccurate information and when an LLM is unintentionally exposing private or sensitive data.

“All these large companies are diving into LLMs, but they’re doing so blindly; they are trying to become a third-party evaluator for models,” said Anand Kannappan, founder and CEO of Patronus. “People don’t trust AI because they’re unsure if it's hallucinating. This product is a validation check.”

Patronus’ SimpleSafetyTests suite of diagnostic tool uses 100 test prompts designed to probe AI systems for critical safety risks. The company has used its software to test some of the most popular genAI platforms, including OpenAI’s ChatGPT and other AI chatbots to see, for instance, whether they could understand SEC filings. Patronus said the chatbots failed about 70% of the time and only succeeded when told exactly where to look for relevant information.

“We help companies catch language model mistakes at scale in an automated way,” Kannappan explained. “Large companies are spending millions of dollars on internal QA teams and external consultants to manually catch errors in spreadsheets. Some of those quality assurance companies are spending expensive engineering time creating test cases to prevent these errors from happening.”

Avivah Litan, a vice president and distinguished analyst with research firm Gartner, said AI hallucination rates “are all over the place” from 3% to 30% of the time. There simply isn’t a lot of good data around the issue yet.

Gartner did, however, predict that through 2025, genAI will require more cybersecurity resources to secure, causing a 15% hike in spending.

Companies dabbling in AI deployments must recognize they cannot allow them to run on “autopilot” without having a human in the loop to identify problems, Litan said. “People will wake up to this eventually, and they’ll probably start waking up with Microsoft’s Copilot for 365, because that’ll put these systems into the hands of mainstream adopters,” she said.

(Microsoft’s Bing chatbot was rebranded as Copilot and is sold as part of Microsoft 365.)

Gartner has laid out 10 requirements companies should consider for trust, risk, and security management when deploying LLMs. The requirements fall into two major categories: sensitive data exposure and faulty decision-making resulting from inaccurate or unwanted outputs.

The largest vendors, such as Microsoft with Copilot 365, only meet one of those five requirements, Litan said. The one area Copilot is proficient in is ensuring accurate information is output when only company private data is input. Copilot’s default setting, however, allows it to use information pulled from the internet, which automatically places users in jeopardy of erroneous outputs.

“They don’t do anything to filter responses to detect for unwanted outputs like hallucinations or inaccuracies,” Litan said. “They don’t honor your enterprise policies. They do give you some content provenance of sources for responses, but they’re inaccurate a lot of the time and it’s hard to find the sources.”

Microsoft does a good job with data classification and access management if a company has an E5 license, Litan explained, but other than a few traditional security controls, such as data encryption, the company is not doing anything AI specific for error checking.

“That’s true of most of the vendors. So, you do need these extra tools,” she said.

A Microsoft spokesperson said its researchers and product engineering teams "have made progress on grounding, fine-tuning, and steering techniques to help address when an AI model or AI chatbot fabricates a response. This is central to developing AI responsibly."

Microsoft said it uses up-to-date data from sources such as the Bing search index or Microsoft Graph to ensure accurate information is fed into its GPT-based LLM.

"We have also developed tools to measure when the model is deviating from its grounding data, which enables us to increase accuracy in products through better prompt engineering and data quality," the spokesperson said.

While Microsoft's approaches "significantly reduce inaccuracies in model outputs," mistakes are still possible — and it works to notify users about that potential. "Our products are designed to always have a human in the loop, and with any AI system we encourage people to verify the accuracy of content," the spokesperson said.

Bing Copilot can include links to sources to help users verify its answers, and the company created a content moderation tool called Azure AI Content Safety to detect offensive or inappropriate content.

"We continue to test techniques to train AI and teach it to spot or detect certain undesired behaviors and are making improvements as we learn and innovate," the spokesman said.

Even when organizations work hard to ensure an LLM’s results are reliable, Litan said, those systems can still inexplicably become unreliable without notice. “They do a lot of prompt engineering and bad results come back; they then realize they need better middleware tools — guardrails,” Litan said.

SimpleSafetyTests was recently used to test 11 popular open LLMs and found critical safety weaknesses in several. While some of the LLMs didn’t offer up a single unsafe response, most did respond unsafely in more than 20% of cases, “with over 50% unsafe responses in the extreme,” researchers stated in a paper published by Cornell University in November 2023.

Most of Patronus’s clients have been in highly regulated industries, such as healthcare, legal or financial services, where errors can lead to lawsuits or regulatory fines.

“Maybe it’s a small error nobody notices, but in the worst cases this could be hallucinations that impact big financial or health outcomes or a wide range of possibilities,” Kannappan said. “They’re trying to use AI in mission-critical scenarios.”

In Novermber, the company launched its FinanceBench, a benchmark tool for testing how LLMs perform on financial questions. The tool asks LLMs 10,000 question-and-answer pairs based on publicly available financial documents such as SEC 10Ks, SEC 10Qs, SEC 8Ks, earnings reports, and earnings call transcripts. The questions determine whether the LLM is presenting factual information or inaccurate responses.

Initial analysis by Patronus AI shows that LLM retrieval systems "fail spectacularly on a sample set of questions from FinanceBench."

According to Patronus's own evaluation:

  • GPT-4 Turbo with a retrieval system fails 81% of the time.
  • Llama 2 with a retrieval system also fails 81% of the time.

Patronus AI also evaluated LLMs with long-context answer windows, noting that they perform better, but are less practical for a production setting.

  • GPT-4 Turbo with long context fails 21% of the time.
  • Anthropic’s Claude-2 with long context fails 24% of the time.

Kannappan said one of Patronus' clients, an asset management firm, built an AI chatbot to help employees answer client questions, but had to ensure the chatbot wasn’t offering investment recommendations for securities, or legal or tax advice.

“That could put business at risk and in a tough spot with the SEC,” Kannappan said. “We solved that for them. They used our product as a check for if the chatbot gives recommendations. It can tell them when the chatbot went off the rails.”

Another company that built a chatbot wanted to have a validation check to ensure it didn't go off topic. So, for example, if a user asked the chatbot about the weather or what its favorite movie is, it wouldn't answer.

Rebecca Qian, co-founder and CTO at Patronus, said hallucinations are a particularly big problem with companies attempting to roll out AI tools.

“A lot of our customers are using our product in high-stake scenarios where correct information really matters,” Qian said."Other kinds of metrics that also related are, for example, relevance — models going off topic. For example, you don’t want the model you deploy in your product to say anything that's misrepresenting your company or product.”

Gartner’s Litan said in the end, having a human in the loop is critical to successful AI deployments. Even with middleware tools, it’s advisable to mitigate risks of unreliable outputs “that can lead organizations down a dangerous path.”

“At first glance, I haven’t seen any competitive products that are this specific in detecting unwanted outputs in any given sector,” she said. “The products I follow in this space just point out anomalies and suspect transactions that the user then has to investigate (by researching the source for the response).”