As companies move from experimenting with generative AI to deploying production-ready agents, a familiar problem keeps resurfacing: reliability. Large language models are powerful, but they are also unpredictable. A prompt that works perfectly once may fail the next time, forcing developers to add layers of retries, checks, and fallbacks just to keep systems running.
Over time, this defensive coding creates a mess. Business logic becomes tangled with error handling and inference strategies, making systems harder to maintain, test, and scale. New research from Asari AI, MIT CSAIL, and Caltech argues that this does not have to be the norm. Their work points to a cleaner architectural approach that separates what an AI agent should do from how it navigates uncertainty.
The core problem with today’s AI agents
Most agent-based systems mix two very different concerns in the same code. One is workflow logic, the clear sequence of steps needed to complete a task such as translating code, reviewing documents, or generating reports. The other is inference strategy, the method used to cope with uncertainty, like sampling multiple answers, refining outputs, or running verification loops.
When these are combined, even small changes become expensive. Switching from a simple “best-of-N” sampling approach to a more accurate beam search often means rewriting large parts of the agent’s control flow. As a result, many teams avoid experimenting with better strategies because the engineering cost is too high.
According to the researchers, this tight coupling limits reliability improvements and slows down innovation, especially in enterprise environments where stability and auditability matter.
A new model for scalable agent design
To address this, the research introduces a programming model called Probabilistic Angelic Nondeterminism, or PAN, along with a Python framework named ENCOMPASS. The idea is straightforward but powerful: developers write the ideal or “happy path” of an agent’s workflow, while inference-time decisions are handled separately by a runtime engine.
Using a simple construct called branchpoint(), developers mark where an LLM is invoked and where uncertainty may arise. The code itself assumes success. At runtime, ENCOMPASS interprets these markers and explores different execution paths using search algorithms such as depth-first search, beam search, or Monte Carlo tree search.
This design creates what the authors call “program-in-control” agents. Instead of letting the model decide the entire flow, the program defines the structure, and the model is used only for specific subtasks. For enterprises, this approach offers greater predictability, easier auditing, and clearer boundaries between logic and inference.
Why decoupling logic and search matters
By treating inference as a search problem rather than hard-coded control flow, teams can swap strategies without rewriting business logic. A workflow that works with simple sampling can later use beam search or best-first search to improve accuracy, all without changing the underlying code.
This flexibility makes experimentation cheaper. Teams can test different reliability strategies, measure performance, and adjust based on cost or accuracy requirements. Over time, this reduces technical debt and makes agent systems easier to evolve.
Real-world impact in complex workflows
The researchers demonstrated the approach using a Java-to-Python code translation agent. The task involved translating files, generating inputs, and validating results by running the code. In a traditional setup, adding advanced search required building a state machine and manually managing variables, which obscured the core logic.
With ENCOMPASS, the same workflow stayed linear and readable. Search strategies were added simply by inserting branchpoint() calls before LLM invocations. The results showed that fine-grained beam search, applied at both file and method levels, outperformed simpler sampling approaches. Importantly, this was achieved without increasing code complexity.
The findings also suggested better scaling behavior. Performance improved in proportion to the logarithm of inference cost, meaning smarter search delivered better results without a matching increase in compute spending.
Cost control and performance tuning
Inference costs remain a major concern for AI leaders managing budgets. The research compared traditional refinement loops, where models critique and rewrite their own output, with search-based approaches. In one case study, a best-first search achieved similar performance to multiple refinement cycles but at a lower cost per task.
This shows that inference strategy is not just a technical choice but a financial one. By externalizing it, teams can tune systems based on context. Internal tools might favor faster, cheaper strategies, while customer-facing applications can justify more thorough searches. Both can run on the same codebase.
Practical considerations and limitations
The framework is designed to complement existing tools like LangChain rather than replace them. It operates at the control-flow level, not at the prompt or tool-integration layer.
That said, it does not eliminate the need for careful engineering. Developers still need to identify where uncertainty exists and define how success is measured. In objective tasks like code translation, tests can validate outputs. In more subjective areas, such as summarization or creative writing, reliable scoring remains a challenge.
There are also operational concerns. Because the system may explore multiple execution paths, developers must manage external side effects carefully to avoid duplicate actions, such as repeated database writes or API calls.
What this means for AI agent scalability
The approach behind PAN and ENCOMPASS reflects a broader software engineering principle: modularity. As AI agents become part of core business operations, they need the same discipline applied to traditional systems.
Embedding probabilistic behavior directly into application logic makes systems fragile and hard to govern. Separating inference strategy from workflow logic allows teams to optimize, test, and audit each independently. It also supports better governance, since changes to AI behavior can be applied consistently without rewriting every agent.
As inference-time compute grows and workflows become more complex, architectures that isolate uncertainty will likely scale better than those that let it spread across the codebase.