How We Build Trustworthy AI for CFOs

What we've learned on working with customers and researching to push the limits of agent reliability and observability

The limitations of today’s AI agents

AI agents present incredible promise in terms of being able to tackle complex processes previously thought to be un-automatable. But the early waves of the AI agent revolution have also created a new challenge: how do you trust an inherently probabilistic system performing complex analysis and supporting crucial decisions?

While engineers might be comfortable reviewing code or logs to verify an AI's work, CFOs and finance teams need complete confidence in outputs without digging deep into the technical machinery. This creates a unique challenge for AI systems in enterprise finance: the stakes are incredibly high, the users need perfect clarity on the decision process, but we can't expose the technical complexity behind the scenes.

After working with a wide variety of finance teams, we've found the problem boils down to two core requirements: (1) plain-old consistent accuracy in the results (obvious) and (2) ease-of-use and proper transparency through the interface (a bit less obvious). This interestingly makes this challenge equal parts an AI research problem and a user experience (UX) problem.

We let users pick into Sapien’s glass-box brain

Rather than presenting AI responses as a black box, we've built our system around a complete reasoning tree that tracks every analytical step at runtime. This tree structure, that functions similarly to Monte Carlo Tree Search approaches, allows us to compose complex analyses from simpler components while maintaining a clear record of the decision process.

However the key difficulty here is taking this complex underlying structure and making it quickly interpretable to users. To enable this, we built a translation layer between our AI agents and the UI that users interact with. This system linearizes content so it can be presented to finance teams in ways that are easy to follow, rather than a complex branching structure. The goal here is to seamlessly match how people think, and the translation layer makes this happen through this linearization, as well as prioritization and summaries that ensure the most essential details are surfaced to the end user.

Moreover, every number in Sapien’s analyses comes with precise citations back to source data. When Sapien says revenue grew 12.3%, you can click through to see exactly which cells, database entries, or calculations produced that figure. This isn't just about transparency—it's about building the confidence that comes from perfect verifiability.

And the key kicker here is that users don’t just have to take answers at face value. Unlike any other tool finance teams have worked with, on Sapien they can challenge every assumption. If an analyst disagrees with where Sapien is pulling data from or how it’s thinking about a question, all they have to do is spell that out and Sapien can course correct. This ability to disagree and follow up means that even when Sapien might get things wrong, it can get back on track and learn for the next time.

The challenge, of course, is presenting all of this information without overwhelming users. Creating an interface that balances complete transparency with usability remains one of our core design challenges. We're continuously iterating on how to expose this verification layer while maintaining a clean, intuitive interface.

We use verifiable tools to minimize hallucinations

The obvious additional key here is that having a clear view of wrong answers doesn't help anyone. Our approach to ensuring reliability has multiple layers that are based on the underlying way we build our agents.

Most centrally, we use LLMs a lot less than you might think. LLMs are really good at understanding context and knowing what to do next, but they’re awful at actually doing that thing reliably. So we only use LLMs in a few main areas:

  1. Data contextualization, as discussed in another of our blog posts here
  2. Query understanding and problem breakdown
  3. Routing between deterministic tools and answer reflection

We can then minimize hallucinations and ensure verifiability in a few key ways.

First, we constrain our code generation system to use verifiable tools. Rather than letting the language model run free with calculations, it primarily routes between verified nodes or through tested code paths. This dramatically reduces the chance of errors, particularly crucial when small mistakes in financial calculations can compound into major issues.

Second, we treat every interaction as a learning opportunity through human-in-the-loop feedback. When analysts validate or correct our work, that feedback helps refine Sapien’s approach for their specific company (while maintaining utmost data security). This builds towards our vision of a robust financial reward function and isn't just about catching errors—it's about learning what makes a good analysis from the perspective of experienced financial professionals.

Third, we constrain the solution space through structured workflows. These workflows, which formed the core of our initial product offering, ensure our system approaches problems in ways that align with established financial practices. This ensures that key processes can be outlined and done effectively without giving too much free reign to LLMs early on, driving ROI faster than thought to be possible. And furthermore, it is often easy to demonstrate correctness here through backtesting on previous weeks/months/quarters and comparing results.

Finally, we maintain a comprehensive evaluation framework that lets us continuously test and improve our system's performance across different types of financial analyses. This connects to our broader work on company representation while ensuring we're not just technically accurate, but providing genuinely useful insights.

We iterate closely with customers to build trust

Ultimately, earning and maintaining trust requires excellence in both our AI systems and how we present them, but its also about working closely with our end users. They key here is rapid iteration in understanding how each persona interacts with the products, and continuing to build confidence while keeping Sapien as intuitive as possible.

This dual challenge is made a lot easier when speaking to countless analysts, VPs, and CFOs every single week that continue to be excited about Sapien and provide invaluable feedback on how they can most effectively trust and integrate seamlessly with it.

This dual challenge of technical reliability and user trust has shaped everything we build at Sapien. And as we continue pushing the boundaries of what AI can do in finance, maintaining and strengthening this trust remains our north star that drives both our technical development and interface design.