Vijay Raina is a titan in the SaaS and software engineering space, known for his deep technical insights into enterprise architecture and behavioral modeling. Currently specializing in the intersection of machine learning and human intent, Vijay has spent years dissecting why the current wave of Large Language Models, while impressive at generating prose, often stumbles when faced with the gritty reality of predicting what a person will do next. With a background rooted in pure mathematics and a career built on solving the complex puzzles of data science, he offers a unique perspective on the “agentic” future of technology. In this conversation, we explore the nuances of behavioral foundation models, the technical debt of legacy heuristics, and the pursuit of privacy-preserving intelligence that goes beyond simple text synthesis.
The summary of this discussion centers on the fundamental limitations of using general-purpose language models for predicting human intent, emphasizing that next-token prediction does not inherently translate to high-stakes decision-making. We delve into the architectural differences between language-based transformers and behavior-centric models that utilize graph neural networks to handle high-cardinality data like hashed user IDs and specific product interactions. The conversation also highlights the immense infrastructure challenges of running real-time behavioral predictions at a scale of millions of queries per second, the shift toward “agentic” workflows that require a dedicated decision layer, and the critical importance of privacy-centric machine learning techniques such as differential privacy and homomorphic encryption in earning consumer trust.
General-purpose language models are highly effective at synthesizing text, but intent prediction often involves decision-making under uncertainty. Why do you believe the inductive bias of standard LLMs falls short when it comes to forecasting human behavior?
The most fascinating thing about models is that they will do exactly what you train them to do, which, ironically, is also their greatest weakness. When we look at Large Language Models, the inductive bias is built around gathering all the text in the world—and perhaps all the video—to become exceptionally good at predicting the next sequence of characters or words. While this allows for emergent behaviors, like a model writing a rap in the style of Shakespeare, it doesn’t naturally grant the model the ability to forecast or make complex decisions under uncertainty. Decision-making is fundamentally different from synthesis; it requires an understanding of expected value and conditional probabilities that aren’t always captured in a conversational string. I often think back to my days studying pure math, where every problem was a mini-puzzle, and I see intent prediction as a similar series of puzzles that require more than just context—they require a base representation of behavior itself. We can see LLMs as phenomenal tools for synthesizing information within a context, but without the right specialized layers, they lack the “decision muscle” needed for real-world forecasting.
You’ve described your work as building a foundation model of behavior rather than text. How does this shift in focus change the way you approach data collection and training compared to the models we see from major frontier labs?
The distinction lies in the data and the training objective, which are the two biggest pillars of any model. For a behavioral foundation model, we aren’t just scraping the public internet; we are working with proprietary, sensitive, and often identifiable data, even if it is tied to an anonymous browser session or a hashed ID. We treat these behaviors as tokens, but the cardinality is vastly different—where a language model might deal with 300,000 to 500,000 base tokens, the number of distinct human behaviors in the world is easily three orders of magnitude larger. Our goal isn’t to create a chatbot experience that feels pleasant or “correct” in a subjective conversation; instead, we want to create a base representation that is broadly predictive of future actions. This means that even if a model encounters a product or a scenario it hasn’t seen in its training set, it can still perform effectively, much like how an LLM can write about a topic it didn’t specifically study by drawing on its foundational understanding of language. It’s about moving away from the “parlor tricks” of world models, like knowing McDonald’s and Burger King are similar, and moving toward a system that understands the underlying “operating system” of human choice.
In your architecture, you utilize both large-scale transformers and graph neural networks. Could you explain why a purely language-based transformer isn’t sufficient for handling the identity and connectivity challenges inherent in behavioral data?
While it’s true that “attention is all you need” for many tasks, the reality of behavioral AI requires us to look at identity in a way that standard language models don’t. We use large-scale transformers for the sequence modeling, but we also integrate graph neural networks because we are constantly dealing with anonymous identifiers that need to be connected while strictly respecting privacy. In a graph world, you have questions about identity and the relationship between different nodes—such as a user moving from one website to another—that are better served by a GNN. This allows us to handle discrete, high-cardinality tokens like specific hashed user IDs or product codes that would overwhelm a standard embedding table. We have to worry about the literal training process, running PyTorch on distributed GPUs in our data centers, and the sheer amount of electricity required to map these complex relationships. By combining these architectures, we can create a model that is not just reactive to text but is deeply aware of the structural connections between different behavioral signals.
The distinction between inductive and transductive models is crucial in your field. How do you manage the “cold start” problem for new users or behaviors when your system is operating at such a high cardinality?
This is one of the most significant challenges I’ve encountered since my time working on recommender systems. A transductive model is essentially frozen; it only has representations for the data it was trained with, requiring heuristics to bring anything new into the fold. However, human behavior is rapidly changing—users take new paths, diverge from their previous patterns, and new websites or franchises open every day. To address this, we focus on architectures that are inductive by definition, allowing us to induct new nodes or user representations in a satisfying way. While our current models might be transductive on the behavior side to hit diminishing returns, we spend a massive amount of time on the user side to ensure we can handle these shifts in real-time. In the language world, you might have six or seven digits worth of tokens, but in our world, the number of behaviors is massive, and we have to be able to map new behaviors to our core user representation without breaking the bank or the latency budget.
With the rise of “agentic” AI, many are looking for a decision layer that can act on a user’s behalf. What does a sophisticated decision layer look like when it’s integrated with a behavioral model versus a standard chatbot?
An agentic process with a true decision layer doesn’t just “will” an answer into existence through conversation; it reaches for specialized tools to forecast outcomes. Today, we see people using LLMs for coding or personal assistants, which is economically valuable, but these tasks don’t always require heavy decision-making—often you’re just telling the model exactly what to do, like “add this to my calendar.” The hard decisions, the ones that matter, are things like asking how much to spend on a marketing campaign today. To solve that, the agent needs to predict an expected value, which requires a specialized model that has been trained on the conditional probabilities of those specific outcomes. We see ourselves as that decision layer—the part of the stack that handles the forecasting and inference, regardless of whether the final product experience is a chatbot or a high-velocity API. It’s about moving past the “belief offloading” where we give up our decision-making to tools that aren’t actually equipped to handle the math of uncertainty.
Operating at a scale of millions of queries per second presents massive infrastructure hurdles. What specific optimization strategies do you employ to ensure high-velocity inferencing without burning through your entire margin?
When you are running millions of queries per second to decide whether to show an ad on a publisher’s page or what creative to display, you have to be incredibly smart about your inferencing stack. Our philosophy is to maximize margin later, but we can’t ignore the fact that we’d run out of money if we weren’t efficient. We use two primary strategies: first, we pre-compute as much as possible, trading memory for latency by having embedding lookup tables ready to go instead of generating everything on the fly. Second, we lean heavily into batching and queuing; even though it’s difficult to build, it’s often not much slower to process 500 requests at once than it is to process one. We also have to be mindful of our latency budget—sometimes we might even use a “backup” heuristic if the main model is taking too long, because an expected value slightly above zero is better than a timed-out request. It’s a constant balancing act between the software business model of adoption and the hard reality of GPU costs.
Privacy is a significant hurdle when dealing with sensitive behavioral patterns. How are you incorporating advanced concepts like differential privacy and homomorphic encryption to maintain trust while still delivering accurate predictions?
Privacy isn’t just a checkbox for us; it’s a first-class citizen of our technical stack. We are heavily invested in the nascent world of privacy-preserving machine learning because we know that companies have to trust us with their most sensitive data. We explore differential privacy, which is a mathematical way of ensuring “K-anonymity”—meaning if someone tries to query our dataset to find a specific person, they’ll always find a pool of at least, say, 25 people, creating a level of ambiguity that protects the individual. Even more exciting is homomorphic machine learning, which goes back to my math roots by allowing us to train models on symbols without ever “seeing” the raw data itself. We don’t have the luxury of a massive social network where everything is tied to a single email ID, so we have to win by having the best algorithmic privacy measures in the industry. It’s about moving beyond simple security and into hard math and science that we can eventually publish and share with the broader community.
What is your forecast for behavioral AI?
I believe we are heading toward a future where the distinction between “searching” for an answer and “predicting” a need will entirely disappear. Within the next few years, I expect behavioral models to become the silent engines behind every digital interaction, moving far beyond ads into areas like fraud prevention and hyper-personalized healthcare. We will see a shift where AI doesn’t just talk to us but anticipates the trade-offs we are willing to make, managing our “personalization budget” in real-time. However, the real breakthrough won’t be in the size of the models, but in our ability to perform this level of prediction while maintaining total data sovereignty for the user through the privacy-preserving techniques we discussed today. The “wild west” of data scraping is ending, and the era of the high-trust, high-accuracy decision layer is just beginning.
