How Sapien Learns Company Representations

A technical dive into how Sapien can understand companies inside and build intuitions across massive data siloes.

Running a company is hard.

“Can AI run a company?” is the core question behind the work we’re doing at Sapien. But, in order to run a company, you need to understand the company.

This is an undertaking that all previous solutions were not built to do. “Understanding” is also an area where even the most cutting edge of AI research is currently underperforming. This critical shortcoming is glaring: how can you make capital allocation decisions, synthesize models, or really do any form of analysis for a company if the deep context behind how the company operates is missing?

This has made our most crucial technical goal a simple one to articulate, and a lofty one to accomplish. We seek to learn the representation of a company: building the embedding model for company financials that harnesses massive amounts of data across distinct, siloed sources to understand how they operate.

What’s in a question?

Let’s take a seemingly simple question a CFO might ask their analyst, like: “What were our profit drivers?”. Answering this question is not just about pulling numbers from SharePoint or NetSuite.

We need to localize, process, and understand demand across data siloes from CRMs to ERPs. We need to consolidate this information and connect it to context on margin changes and growth for different products. And we need to dig into what actually contributed to these changes—was it a greater focus on sales, or a push by a certain rep? Or was it a cut in headcount, or just a general decrease in overhead costs? And this is just scratching the surface.

And while identifying all of this information, we’re looking at more than just the numbers and their location. Across documents and systems of record, each number has a formatting and context that is built to guide humans in analysis. One number is marked as yellow, another is bolded, another has an annotation a few cells over. Each of these aspects needs to be contextually interpreted. And then this understanding is channeled into analysis and formats that CFOs and analysts are used to seeing: their comfortable graphs and Excel models.

But even once we figure out profit drivers, we need to understand our results by cross-checking with existing beliefs about the company. Did we come to similar conclusions last month? If so, are we sure these should be consistent? And if not, did we make any errors? If we didn’t—is this an important anomaly to flag for the team? And going all the way back to first principles, what is this company’s definition for profit? And what KPIs do they care about?

This is a lot. And to be done effectively, every step needs to be completely informed. This is the distinction between an amateur analyst and one that’s been on the job for years, developing deep intuitions about the company. Without this, Sapien becomes another “BI without the I” tool collecting dust on the shelf rather than a superhuman coworker.

So how do we go about building a true understanding engine for companies?

We plug into all the places a company lives

It starts with the all the different data stores that house information today. The truth is that they’re a mess, but the main areas that cover the majority of company information are:

  1. Arbitrary files: Excels (the most important, by far), PDFs, PowerPoints, CSVs…
  2. ERPs: NetSuite, Microsoft Dynamics, SAP, Workday…
  3. CRMs: Salesforce, Hubspot…
  4. Data Lakes: Snowflake, BigQuery…

Processing each of these presents their own complex challenge, but the most central key behind how we approach this is that we don’t just care about numbers: we care about the story behind each number.

Thus, the way in which we interact with each of these modalities and ultimately build up our company representation is all about crafting the story for each individual number so that Sapien is armed with all the context it needs to execute complex tasks.

And we then contextualize every data point

We do this by plugging directly into these sources of truth, whether uploaded files or API integrations, to have immediate access to the information as it changes. To understand these inputs effectively, we build an initial parsing layer that is specifically tailored for each type of data. The goal of these systems is to harmonize arbitrary formats into more standardized and fully contextualized DataFrames and JSONs that can then be processed structurally downstream. We aim to represent the entire data layout of a company; and given a query, we aim to return the relevant data to answer it. What does this mean?

It means our process looks like this:

  1. Integrate directly into the data source via API
  2. Localize each atomic unit, often either a table or specific route
  3. Compute the contextual information and hierarchy, both around and within it
  4. Construct an embedding of the atomic unit using both content and context
  5. Insert the fully processed atomic unit into the graph representation

The key process here is localizing, contextualizing, and arranging all the atomic units. For example, in an Excel, each value is connected to a row, column, table, sheet, Excel, and directory. More specifically:

  • The number $738.26
  • In cell E6 (not actually explicitly defined, but rather a formula linked to 3 other sheets)
  • And this cell might be in a row called “Margin”
  • In a column called “05-2024”
  • In a table called “Product ABC-123 specifics” (which notes that all values are in thousands)
  • In a sheet called “Plant A financials” (that has notes describing how projections were calculated)
  • In an Excel called “2024 Projections”, last modified on January 3, 2024.

For Excels, our agents look into every table, row, column, and cell in this manner to build up a hierarchical representation of all numbers contextualized with this information. See SpreadsheetLLM for a recent attempt at this problem. And we approach API integrations similarly, with agents that explore the key routes through unique SQL requests to split up underlying information into smaller tables. It takes a bit longer to fully map out the topology of this data due to rate limits and the sheer volume and complexity of slicing.

Some exciting directions to further improve this include:

  • Convolutional architectures for Excel parsing (treating cells as high-dimensional pixels)
  • Learning a unique set of heuristics for each company that parses Excels just for their format
  • Fine-tuned SQL agents (especially for the most used integrations)
  • HITL onboarding (ask questions to users to better understand data fringes)

So we can build the graph that ties them all together

With all of this context, we map out relationships between key concepts both semantically and hierarchically to enable fast retrieval of key information given a query. And these relationships can also be further adapted as the system learns new relationships between concepts over time.

Part of this work entails constructing contextual embeddings based not just on the data itself but a textual representation of all of this context (before it was made cool by Anthropic). An exciting direction we’ve started digging into here is fine-tuning our own verticalized embedding model. Through our unique access to aggregated and anonymized pairwise comparison data from users, we can build a better understanding of conceptual relationships.

Once these embeddings are created, we can store a reference to data access (i.e. cell location ranges for Excels, functions for integrations) as a node in our knowledge graph. This enables Sapien to search for information across a company relevant to key questions and to have a structured & verifiable method to access the data directly, without actually storing it.

And some exciting additional directions here are:

  • Fine-tuned embedding models (leveraging unique pairwise-comparison HITL data)
  • Data memory cache (enabling faster retrieval for regularly used data paths/concepts just like L1/L2/L3 cache works)
  • Ahead-of-time graph creation (if you learn someone cares about a certain type of data then precompute the function to get that data and store that route as a node so its a direct pull)

And even bring in the implicit knowledge beyond these sources

Building a comprehensive company representation goes beyond our data distribution graph—it requires capturing the implicit knowledge that makes analysts invaluable. While many teams work on data graphs for specific domains, there's a distinct challenge in understanding a company's workflows, templates, and informal knowledge that often exists only in scattered documents or verbal communications. For instance, effectively cross-checking financial results requires the kind of intuition typically gained from sitting in on quarterly board meetings and developing a feel for what makes sense and what doesn't.

Our solution? Library learning.

What does that mean? It’s the simple idea that when you do something of note, you generalize it, store it in a library that you can then pull from, and apply to other queries/tasks when applicable. See Dreamcoder and Voyager for good references.

To address this challenge, we build up company-specific libraries of key concepts and workflows that help Sapien understand what data matters for particular questions. These structured workflows guide our system to take the right analytical steps—like pulling specific data points known to be pertinent for certain analyses—which then feeds into our code generation suite to produce verifiable analyses and graphs in the company's preferred format. This combination of formal data structures and implicit knowledge creates a company representation that goes beyond just numbers, enabling Sapien to develop the same deep context and intuition that experienced analysts bring to their work.

We learn libraries in three key forms:

  1. Concepts: Aimed at capturing the implicit thoughts that analysts learn on the job or from reading various sources. For example: having priors to know what KPIs are important for a specific team.
  2. Code: When we generate code, we grow and use our library functions to aid in program synthesis. For example: learning and reusing a function that generates graphs in the company format.
  3. Trees: At runtime, we get feedback on our reasoning trace that creates an MCTS-like tree, letting us generalize common workflows to avoid replanning each time. For example: asking to update the same forecast many times just with a slightly different scenario.

By growing these libraries, we increasingly reduce our reliance on LLMs for tasks like deep planning or code generation, instead referring to these library functions. Human feedback helps verify these functions, ensuring the building blocks behind all our processes are correct. This significantly improves both performance and confidence in our results.

Going further beyond verifiability, every library becomes unique to a company and allows Sapien to learn their specific templates and workflows. However, this approach requires strong feedback mechanisms, which we address through human-in-the-loop (HITL) processes. Our ultimate goal is to use this to build the financial reward function, which you can read more about here. See Let’s Verify Step by Step for insights on how you build such a reward model.

Some promising directions in this area include:

  • Employing typing constraints to restrict the program synthesis search space in code generation
  • Utilizing a mixture of online and offline learning for library updates
  • Consistently-updating concept/workflow library through implicit & explicit user feedback (intermediate signals to improve our financial reward model)
  • Human-in-the-loop onboarding (not just from the data lens, but from the prioritization/intuitions lens)

And there’s still a lot more to do

The company representation is the heartbeat of what we’re building at Sapien, essential for driving value for customers and the most ambitious technical problem we’re solving. By breaking down the problem into these pieces and verticalizing, we’ve been able to move incredibly vast to get to our current form of the embedding model for company financials.

But, as discussed here, there are countless exciting directions to explore next and limitless improvements to be made as it comes to making the AI that understands the world’s largest companies. If these problems excite you, and you’re interested in learning more about how we turn this cutting-edge research into value-driving products, shoot as a note to talent@sapien.team or check out how you can get involved here.