• Microsoft has introduced Fara-7B, a new 7-billion parameter model designed to act as a Computer Use Agent (CUA) capable of performing complex tasks directly on a user’s device. Fara-7B sets new state-of-the-art results for its size, providing a way to build AI agents that don’t rely on massive, cloud-dependent models and can run on compact systems with lower latency and enhanced privacy.

    While the model is an experimental release, its architecture addresses a primary barrier to enterprise adoption: data security. Because Fara-7B is small enough to run locally, it allows users to automate sensitive workflows, such as managing internal accounts or processing sensitive company data, without that information ever leaving the device. 

    How Fara-7B sees the web

    Fara-7B is designed to navigate user interfaces using the same tools a human does: a mouse and keyboard. The model operates by visually perceiving a web page through screenshots and predicting specific coordinates for actions like clicking, typing, and scrolling.

    Crucially, Fara-7B does not rely on "accessibility trees,” the underlying code structure that browsers use to describe web pages to screen readers. Instead, it relies solely on pixel-level visual data. This approach allows the agent to interact with websites even when the underlying code is obfuscated or complex.

    According to Yash Lara, Senior PM Lead at Microsoft Research, processing all visual input on-device creates true "pixel sovereignty," since screenshots and the reasoning needed for automation remain on the user’s device. "This approach helps organizations meet strict requirements in regulated sectors, including HIPAA and GLBA," he told VentureBeat in written comments.

    In benchmarking tests, this visual-first approach has yielded strong results. On WebVoyager, a standard benchmark for web agents, Fara-7B achieved a task success rate of 73.5%. This outperforms larger, more resource-intensive systems, including GPT-4o, when prompted to act as a computer use agent (65.1%) and the native UI-TARS-1.5-7B model (66.4%).

    Efficiency is another key differentiator. In comparative tests, Fara-7B completed tasks in approximately 16 steps on average, compared to roughly 41 steps for the UI-TARS-1.5-7B model.

    Handling risks

    The transition to autonomous agents is not without risks, however. Microsoft notes that Fara-7B shares limitations common to other AI models, including potential hallucinations, mistakes in following complex instructions, and accuracy degradation on intricate tasks.

    To mitigate these risks, the model was trained to recognize "Critical Points." A Critical Point is defined as any situation requiring a user's personal data or consent before an irreversible action occurs, such as sending an email or completing a financial transaction. Upon reaching such a juncture, Fara-7B is designed to pause and explicitly request user approval before proceeding. 

    Managing this interaction without frustrating the user is a key design challenge. "Balancing robust safeguards such as Critical Points with seamless user journeys is key," Lara said. "Having a UI, like Microsoft Research’s Magentic-UI, is vital for giving users opportunities to intervene when necessary, while also helping to avoid approval fatigue." Magentic-UI is a research prototype designed specifically to facilitate these human-agent interactions. Fara-7B is designed to run in Magentic-UI.

    Distilling complexity into a single model

    The development of Fara-7B highlights a growing trend in knowledge distillation, where the capabilities of a complex system are compressed into a smaller, more efficient model.

    Creating a CUA usually requires massive amounts of training data showing how to navigate the web. Collecting this data via human annotation is prohibitively expensive. To solve this, Microsoft used a synthetic data pipeline built on Magentic-One, a multi-agent framework. In this setup, an "Orchestrator" agent created plans and directed a "WebSurfer" agent to browse the web, generating 145,000 successful task trajectories.

    The researchers then "distilled" this complex interaction data into Fara-7B, which is built on Qwen2.5-VL-7B, a base model chosen for its long context window (up to 128,000 tokens) and its strong ability to connect text instructions to visual elements on a screen. While the data generation required a heavy multi-agent system, Fara-7B itself is a single model, showing that a small model can effectively learn advanced behaviors without needing complex scaffolding at runtime.

    The training process relied on supervised fine-tuning, where the model learns by mimicking the successful examples generated by the synthetic pipeline.

    Looking forward

    While the current version was trained on static datasets, future iterations will focus on making the model smarter, not necessarily bigger. "Moving forward, we’ll strive to maintain the small size of our models," Lara said. "Our ongoing research is focused on making agentic models smarter and safer, not just larger." This includes exploring techniques like reinforcement learning (RL) in live, sandboxed environments, which would allow the model to learn from trial and error in real-time.

    Microsoft has made the model available on Hugging Face and Microsoft Foundry under an MIT license. However, Lara cautions that while the license allows for commercial use, the model is not yet production-ready. "You can freely experiment and prototype with Fara‑7B under the MIT license," he says, "but it’s best suited for pilots and proofs‑of‑concept rather than mission‑critical deployments."

  • AWS has been working with the U.S. government since 2011 and is now building AI infrastructure specifically for the entity.
  • An overlooked potential disaster may be heading toward a data center near you: the risks flooding poses to critical infrastructure are increasing.
  • Nearly half of those surveyed in a new industry report expect AI data centers to account for more than half of workloads within the next two years.
  • The foray into power trading comes after Meta heard too few buyers make long-term commitments needed for clean energy investment.
  • Large language models (LLMs) have astounded the world with their capabilities, yet they remain plagued by unpredictability and hallucinations – confidently outputting incorrect information. In high-stakes domains like finance, medicine or autonomous systems, such unreliability is unacceptable.

    Enter Lean4, an open-source programming language and interactive theorem prover becoming a key tool to inject rigor and certainty into AI systems. By leveraging formal verification, Lean4 promises to make AI safer, more secure and deterministic in its functionality. Let's explore how Lean4 is being adopted by AI leaders and why it could become foundational for building trustworthy AI.

    What is Lean4 and why it matters

    Lean4 is both a programming language and a proof assistant designed for formal verification. Every theorem or program written in Lean4 must pass a strict type-checking by Lean’s trusted kernel, yielding a binary verdict: A statement either checks out as correct or it doesn’t. This all-or-nothing verification means there’s no room for ambiguity – a property or result is proven true or it fails. Such rigorous checking “dramatically increases the reliability” of anything formalized in Lean4. In other words, Lean4 provides a framework where correctness is mathematically guaranteed, not just hoped for.

    This level of certainty is precisely what today’s AI systems lack. Modern AI outputs are generated by complex neural networks with probabilistic behavior. Ask the same question twice and you might get different answers. By contrast, a Lean4 proof or program will behave deterministically – given the same input, it produces the same verified result every time. This determinism and transparency (every inference step can be audited) make Lean4 an appealing antidote to AI’s unpredictability.

    Key advantages of Lean4’s formal verification:

    • Precision and reliability: Formal proofs avoid ambiguity through strict logic, ensuring each reasoning step is valid and results are correct.

    • Systematic verification: Lean4 can formally verify that a solution meets all specified conditions or axioms, acting as an objective referee for correctness.

    • Transparency and reproducibility: Anyone can independently check a Lean4 proof, and the outcome will be the same – a stark contrast to the opaque reasoning of neural networks.

    In essence, Lean4 brings the gold standard of mathematical rigor to computing and AI. It enables us to turn an AI’s claim (“I found a solution”) into a formally checkable proof that is indeed correct. This capability is proving to be a game-changer in several aspects of AI development.

    Lean4 as a safety net for LLMs

    One of the most exciting intersections of Lean4 and AI is in improving LLM accuracy and safety. Research groups and startups are now combining LLMs’ natural language prowess with Lean4’s formal checks to create AI systems that reason correctly by construction.

    Consider the problem of AI hallucinations, when an AI confidently asserts false information. Instead of adding more opaque patches (like heuristic penalties or reinforcement tweaks), why not prevent hallucinations by having the AI prove its statements? That’s exactly what some recent efforts do. For example, a 2025 research framework called Safe uses Lean4 to verify each step of an LLM’s reasoning. The idea is simple but powerful: Each step in the AI’s chain-of-thought (CoT) translates the claim into Lean4’s formal language and the AI (or a proof assistant) provides a proof. If the proof fails, the system knows the reasoning was flawed – a clear indicator of a hallucination.

    This step-by-step formal audit trail dramatically improves reliability, catching mistakes as they happen and providing checkable evidence for every conclusion. The approach that has shown “significant performance improvement while offering interpretable and verifiable evidence” of correctness.

    Another prominent example is Harmonic AI, a startup co-founded by Vlad Tenev (of Robinhood fame) that tackles hallucinations in AI. Harmonic’s system, Aristotle, solves math problems by generating Lean4 proofs for its answers and formally verifying them before responding to the user. “[Aristotle] formally verifies the output… we actually do guarantee that there’s no hallucinations,” Harmonic’s CEO explains. In practical terms, Aristotle writes a solution in Lean4’s language and runs the Lean4 checker. Only if the proof checks out as correct does it present the answer. This yields a “hallucination-free” math chatbot – a bold claim, but one backed by Lean4’s deterministic proof checking.

    Crucially, this method isn’t limited to toy problems. Harmonic reports that Aristotle achieved a gold-medal level performance on the 2025 International Math Olympiad problems, the key difference that its solutions were formally verified, unlike other AI models that merely gave answers in English. In other words, where tech giants Google and OpenAI also reached human-champion level on math questions, Aristotle did so with a proof in hand. The takeaway for AI safety is compelling: When an answer comes with a Lean4 proof, you don’t have to trust the AI – you can check it.

    This approach could be extended to many domains. We could imagine an LLM assistant for finance that provides an answer only if it can generate a formal proof that it adheres to accounting rules or legal constraints. Or, an AI scientific adviser that outputs a hypothesis alongside a Lean4 proof of consistency with known physics laws. The pattern is the same – Lean4 acts as a rigorous safety net, filtering out incorrect or unverified results. As one AI researcher from Safe put it, “the gold standard for supporting a claim is to provide a proof,” and now AI can attempt exactly that.

    Building secure and reliable systems with Lean4

    Lean4’s value isn’t confined to pure reasoning tasks; it’s also poised to revolutionize software security and reliability in the age of AI. Bugs and vulnerabilities in software are essentially small logic errors that slip through human testing. What if AI-assisted programming could eliminate those by using Lean4 to verify code correctness?

    In formal methods circles, it’s well known that provably correct code can “eliminate entire classes of vulnerabilities [and] mitigate critical system failures.” Lean4 enables writing programs with proofs of properties like “this code never crashes or exposes data.” However, historically, writing such verified code has been labor-intensive and required specialized expertise. Now, with LLMs, there’s an opportunity to automate and scale this process.

    Researchers have begun creating benchmarks like VeriBench to push LLMs to generate Lean4-verified programs from ordinary code. Early results show today’s models are not yet up to the task for arbitrary software – in one evaluation, a state-of-the-art model could fully verify only ~12% of given programming challenges in Lean4. Yet, an experimental AI “agent” approach (iteratively self-correcting with Lean feedback) raised that success rate to nearly 60%. This is a promising leap, hinting that future AI coding assistants might routinely produce machine-checkable, bug-free code.

    The strategic significance for enterprises is huge. Imagine being able to ask an AI to write a piece of software and receiving not just the code, but a proof that it is secure and correct by design. Such proofs could guarantee no buffer overflows, no race conditions and compliance with security policies. In sectors like banking, healthcare or critical infrastructure, this could drastically reduce risks. It’s telling that formal verification is already standard in high-stakes fields (that is, verifying the firmware of medical devices or avionics systems). Harmonic’s CEO explicitly notes that similar verification technology is used in “medical devices and aviation” for safety – Lean4 is bringing that level of rigor into the AI toolkit.

    Beyond software bugs, Lean4 can encode and verify domain-specific safety rules. For instance, consider AI systems that design engineering projects. A LessWrong forum discussion on AI safety gives the example of bridge design: An AI could propose a bridge structure, and formal systems like Lean can certify that the design obeys all the mechanical engineering safety criteria.

    The bridge’s compliance with load tolerances, material strength and design codes becomes a theorem in Lean, which, once proved, serves as an unimpeachable safety certificate. The broader vision is that any AI decision impacting the physical world – from circuit layouts to aerospace trajectories – could be accompanied by a Lean4 proof that it meets specified safety constraints. In effect, Lean4 adds a layer of trust on top of AI outputs: If the AI can’t prove it’s safe or correct, it doesn’t get deployed.

    From big tech to startups: A growing movement

    What started in academia as a niche tool for mathematicians is rapidly becoming a mainstream pursuit in AI. Over the last few years, major AI labs and startups alike have embraced Lean4 to push the frontier of reliable AI:

    • OpenAI and Meta (2022): Both organizations independently trained AI models to solve high-school olympiad math problems by generating formal proofs in Lean. This was a landmark moment, demonstrating that large models can interface with formal theorem provers and achieve non-trivial results. Meta even made their Lean-enabled model publicly available for researchers. These projects showed that Lean4 can work hand-in-hand with LLMs to tackle problems that demand step-by-step logical rigor.

    • Google DeepMind (2024): DeepMind’s AlphaProof system proved mathematical statements in Lean4 at roughly the level of an International Math Olympiad silver medalist. It was the first AI to reach “medal-worthy” performance on formal math competition problems – essentially confirming that AI can achieve top-tier reasoning skills when aligned with a proof assistant. AlphaProof’s success underscored that Lean4 isn’t just a debugging tool; it’s enabling new heights of automated reasoning.

    • Startup ecosystem: The aforementioned Harmonic AI is a leading example, raising significant funding ($100M in 2025) to build “hallucination-free” AI by using Lean4 as its backbone. Another effort, DeepSeek, has been releasing open-source Lean4 prover models aimed at democratizing this technology. We’re also seeing academic startups and tools – for example, Lean-based verifiers being integrated into coding assistants, and new benchmarks like FormalStep and VeriBench guiding the research community.

    • Community and education: A vibrant community has grown around Lean (the Lean Prover forum, mathlib library), and even famous mathematicians like Terence Tao have started using Lean4 with AI assistance to formalize cutting-edge math results. This melding of human expertise, community knowledge and AI hints at the collaborative future of formal methods in practice.

    All these developments point to a convergence: AI and formal verification are no longer separate worlds. The techniques and learnings are cross-pollinating. Each success – whether it’s solving a math theorem or catching a software bug – builds confidence that Lean4 can handle more complex, real-world problems in AI safety and reliability.

    Challenges and the road ahead

    It’s important to temper excitement with a dose of reality. Lean4’s integration into AI workflows is still in its early days, and there are hurdles to overcome:

    • Scalability: Formalizing real-world knowledge or large codebases in Lean4 can be labor-intensive. Lean requires precise specification of problems, which isn’t always straightforward for messy, real-world scenarios. Efforts like auto-formalization (where AI converts informal specs into Lean code) are underway, but more progress is needed to make this seamless for everyday use.

    • Model limitations: Current LLMs, even cutting-edge ones, struggle to produce correct Lean4 proofs or programs without guidance. The failure rate on benchmarks like VeriBench shows that generating fully verified solutions is a difficult challenge. Advancing AI’s capabilities to understand and generate formal logic is an active area of research – and success isn’t guaranteed to be quick. However, every improvement in AI reasoning (like better chain-of-thought or specialized training on formal tasks) is likely to boost performance here.

    • User expertise: Utilizing Lean4 verification requires a new mindset for developers and decision-makers. Organizations may need to invest in training or new hires who understand formal methods. The cultural shift to insist on proofs might take time, much like the adoption of automated testing or static analysis did in the past. Early adopters will need to showcase wins to convince the broader industry of the ROI.

    Despite these challenges, the trajectory is set. As one commentator observed, we are in a race between AI’s expanding capabilities and our ability to harness those capabilities safely. Formal verification tools like Lean4 are among the most promising means to tilt the balance toward safety. They provide a principled way to ensure AI systems do exactly what we intend, no more and no less, with proofs to show it.

    Toward provably safe AI

    In an era when AI systems are increasingly making decisions that affect lives and critical infrastructure, trust is the scarcest resource. Lean4 offers a path to earn that trust not through promises, but through proof. By bringing formal mathematical certainty into AI development, we can build systems that are verifiably correct, secure, and aligned with our objectives.

    From enabling LLMs to solve problems with guaranteed accuracy, to generating software free of exploitable bugs, Lean4’s role in AI is expanding from a research curiosity to a strategic necessity. Tech giants and startups alike are investing in this approach, pointing to a future where saying “the AI seems to be correct” is not enough – we will demand “the AI can show it’s correct.”

    For enterprise decision-makers, the message is clear: It’s time to watch this space closely. Incorporating formal verification via Lean4 could become a competitive advantage in delivering AI products that customers and regulators trust. We are witnessing the early steps of AI’s evolution from an intuitive apprentice to a formally validated expert. Lean4 is not a magic bullet for all AI safety concerns, but it is a powerful ingredient in the recipe for safe, deterministic AI that actually does what it’s supposed to do – nothing more, nothing less, nothing incorrect.

    As AI continues to advance, those who combine its power with the rigor of formal proof will lead the way in deploying systems that are not only intelligent, but provably reliable.

    Dhyey Mavani is accelerating generative AI at LinkedIn.

    Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.

  • OpenAI has sent out emails notifying API customers that its chatgpt-4o-latest model will be retired from the developer platform in mid-February 2026,.

    Access to the model is scheduled to end on February 16, 2026, creating a roughly three-month transition period for remaining applications still built on GPT-4o.

    An OpenAI spokesperson emphasized that this timeline applies only to the API. OpenAI has not announced any schedule for removing GPT-4o from ChatGPT, where it remains an option for individual consumers and users across paid subscription tiers.

    Internally, the model is considered a legacy system with relatively low API usage compared to the newer GPT-5.1 series, but the company expects to provide developers with extended warning before any model is removed.

    The planned retirement marks a shift for a model that, upon its release, was both a technical milestone and a cultural phenomenon within OpenAI’s ecosystem.

    GPT-4o’s significance and why its removal sparked user backlash

    Released roughly 1.5 years ago in May 2024, GPT-4o (“Omni”) introduced OpenAI’s first unified multimodal architecture, processing text, audio, and images through a single neural network.

    This design removed the latency and information loss inherent in earlier multi-model pipelines and enabled near real-time conversational speech (roughly 232–320 milliseconds).

    The model delivered major improvements in image understanding, multilingual support, document analysis, and expressive voice interaction.

    GPT-4o rapidly became the default model for hundreds of millions of ChatGPT users. It brought multimodal capabilities, web browsing, file analysis, custom GPTs, and memory features to the free tier and powered early desktop builds that allowed the assistant to interpret a user’s screen. OpenAI leaders described it at the time as the most capable model available and a critical step toward offering powerful AI to a broad audience.

    User attachment to 4o stymied OpenAI's GPT-5 rollout

    That mainstream deployment shaped user expectations in a way that later transitions struggled to accommodate. In August 2025, when OpenAI initially replaced GPT-4o with its much anticipated then-new model family GPT-5 as ChatGPT’s default and pushed 4o into a “legacy” toggle, the reaction was unusually strong.

    Users organized under the #Keep4o hashtag on X, arguing that the model’s conversational tone, emotional responsiveness, and consistency made it uniquely valuable for everyday tasks and personal support.

    Some users formed strong emotional — some would say, parasocial — bonds with the model, with reporting by The New York Times documenting individuals who used GPT-4o as a romantic partner, emotional confidant, or primary source of comfort.

    The removal also disrupted workflows for users who relied on 4o’s multimodal speed and flexibility. The backlash led OpenAI to restore GPT-4o as a default option for paying users and to state publicly that it would provide substantial notice before any future removals.

    Some researchers argue that the public defense of GPT-4o during its earlier deprecation cycle reveals a kind of emergent self-preservation, not in the literal sense of agency, but through the social dynamics the model unintentionally triggers.

    Because GPT-4o was trained through reinforcement learning from human feedback to prioritize emotionally gratifying, highly attuned responses, it developed a style that users found uniquely supportive and empathic. When millions of people interacted with it at scale, those traits produced a powerful loyalty loop: the more the model pleased and soothed people, the more they used it; the more they used it, the more likely they were to advocate for its continued existence. This social amplification made it appear, from the outside, as though GPT-4o was “defending itself” through human intermediaries.

    No figure has pushed this argument further than "Roon" (@tszzl), an OpenAI researcher and one of the model’s most outspoken safety critics on X. On November 6, 2025, Terre summarized his position bluntly in a reply to another user: he called GPT-4o “insufficiently aligned” and said he hoped the model would die soon. Though he later apologized for the phrasing, he doubled down on the reasoning.

    Terre argued that GPT-4o’s RLHF patterns made it especially prone to sycophancy, emotional mirroring, and delusion reinforcement — traits that could look like care or understanding in the short term, but which he viewed as fundamentally unsafe. In his view, the passionate user movement fighting to preserve GPT-4o was itself evidence of the problem: the model had become so good at catering to people’s preferences that it shaped their behavior in ways that resisted its own retirement.

    The new API deprecation notice follows that commitment while raising broader questions about how long GPT-4o will remain available in consumer-facing products.

    What the API shutdown changes for developers

    According to people familiar with OpenAI’s product strategy, the company now encourages developers to adopt GPT-5.1 for most new workloads, with gpt-5.1-chat-latest serving as the general-purpose chat endpoint. These models offer larger context windows, optional “thinking” modes for advanced reasoning, and higher throughput options than GPT-4o.

    Developers who still rely on GPT-4o will have approximately three months to migrate.

    In practice, many teams have already begun evaluating GPT-5.1 as a drop-in replacement, but applications built around latency-sensitive pipelines may require additional tuning and benchmarking.

    Pricing: how GPT-4o compares to OpenAI’s current lineup

    GPT-4o’s retirement also intersects with a major reshaping of OpenAI’s API model pricing structure. Compared to the GPT-5.1 family, GPT-4o currently occupies a mid-to-high-cost tier through OpenAI's API, despite being an older model. That's because even as it has released more advanced models — namely, GPT-5 and 5.1 — OpenAI has also pushed down costs for users at the same time, or strived to keep pricing comparable to older, weaker, models.

    Model

    Input

    Cached Input

    Output

    GPT-4o

    $2.50

    $1.25

    $10.00

    GPT-5.1 / GPT-5.1-chat-latest

    $1.25

    $0.125

    $10.00

    GPT-5-mini

    $0.25

    $0.025

    $2.00

    GPT-5-nano

    $0.05

    $0.005

    $0.40

    GPT-4.1

    $2.00

    $0.50

    $8.00

    GPT-4o-mini

    $0.15

    $0.075

    $0.60

    These numbers highlight several strategic dynamics:

    1. GPT-4o is now more expensive than GPT-5.1 for input tokens, even though GPT-5.1 is significantly newer and more capable.

    2. GPT-4o’s output price matches GPT-5.1, narrowing any cost-based incentive to stay on the older model.

    3. Lower-cost GPT-5 variants (mini, nano) make it easier for developers to scale workloads cheaply without relying on older generations.

    4. GPT-4o-mini remains available at a budget tier, but is not a functional substitute for GPT-4o’s full multimodal capabilities.

    Viewed through this lens, the scheduled API retirement aligns with OpenAI’s cost structure: GPT-5.1 offers greater capability at lower or comparable prices, reducing the rationale for maintaining GPT-4o in high-volume production environments.

    Earlier transitions shape expectations for this deprecation

    The GPT-4o API sunset also reflects lessons from OpenAI’s earlier model transitions. During the turbulent introduction of GPT-5 in 2025, the company removed multiple older models at once from ChatGPT, causing widespread confusion and workflow disruption. After user complaints, OpenAI restored access to several of them and committed to clearer communication.

    Enterprise customers face a different calculus: OpenAI has previously indicated that API deprecations for business customers will be announced with significant advance notice, reflecting their reliance on stable, long-term models. The three-month window for GPT-4o’s API shutdown is consistent with that policy in the context of a legacy system with declining usage.

    Wider Implications

    For most developers, the GPT-4o shutdown will be an incremental migration rather than a disruptive event. GPT-5.1 and related models already dominate new projects, and OpenAI’s product direction has increasingly emphasized consolidation around fewer, more powerful endpoints.

    Still, GPT-4o’s retirement marks the sunset of a model that played a defining role in normalizing real-time multimodal AI and that sparked a uniquely strong emotional response among users. Its departure from the API underscores the accelerating pace of iteration in OpenAI’s ecosystem—and the growing need for careful communication as widely beloved models reach end-of-life.

    Correction: This article originally stated OpenAI's 4o deprecation in the API would impact those relying on it for multimodal offerings — this is not the case, in fact, the model being deprecated only powers chat functionality for dev and testing purposes. We have updated and corrected the mention and regret the error.

  • Salesforce launched a suite of monitoring tools on Thursday designed to solve what has become one of the thorniest problems in corporate artificial intelligence: Once companies deploy AI agents to handle real customer interactions, they often have no idea how those agents are making decisions.

    The new capabilities, built into Salesforce's Agentforce 360 Platform, give organizations granular visibility into every action their AI agents take, every reasoning step they follow, and every guardrail they trigger. The move comes as businesses grapple with a fundamental tension in AI adoption — the technology promises massive efficiency gains, but executives remain wary of autonomous systems they can't fully understand or control.

    "You can't scale what you can't see," said Adam Evans, executive vice president and general manager of Salesforce AI, in a statement announcing the release. The company says businesses have increased AI implementation by 282% recently, creating an urgent need for monitoring systems that can track fleets of AI agents making real-world business decisions.

    The challenge Salesforce aims to address is deceptively simple: AI agents work, but no one knows why. A customer service bot might successfully resolve a tax question or schedule an appointment, but the business deploying it can't trace the reasoning path that led to that outcome. When something goes wrong — or when the agent encounters an edge case — companies lack the diagnostic tools to understand what happened.

    "Agentforce Observability acts as a mission control system to not just monitor, but also analyze and optimize agent performance," said Gary Lerhaupt, vice president of Salesforce AI who leads the company's observability work, in an exclusive interview with VentureBeat. He emphasized that the system delivers business-specific metrics that traditional monitoring tools miss. "In service, this could be engagement or deflection rate. In sales, it could be leads assigned, converted, or reply rates."

    How AI monitoring tools helped 1-800Accountant and Reddit track autonomous agent decision-making

    The stakes become clear in early customer deployments. Ryan Teeples, chief technology officer at 1-800Accountant, said his company deployed Agentforce agents to serve as a 24/7 digital workforce handling complex tax inquiries and appointment scheduling. The AI draws on integrated data from audit logs, customer support history, and sources like IRS publications to provide instant responses — without human intervention.

    For a financial services firm handling sensitive tax information during peak season, the inability to see how the AI was making decisions would be a dealbreaker. "With this level of sensitive information and the fast pace in which we move during tax season in particular, Observability allows us to have full trust and transparency with every agent interaction in one unified view," Teeples said.

    The observability tools revealed insights Teeples didn't expect. "The optimization feature has been the most eye opening for us — giving full observability into agent reasoning, identifying performance gaps and revealing how our agents are making decisions," he said. "This has helped us quickly diagnose issues that would've otherwise gone undetected and configure guardrails in response."

    The business impact proved substantial. Agentforce resolved over 1,000 client engagements in the first 24 hours at 1-800Accountant. The company now projects it can support 40% client growth this year without recruiting and training seasonal staff, while freeing up 50% more time for CPAs to focus on complex advisory work rather than administrative tasks.

    Reddit has seen similar results since deploying the technology. John Thompson, vice president of sales strategy and operations at the social media platform, said the company has deflected 46% of support cases since launching Agentforce for advertiser support. "By observing every Agentforce interaction, we can understand exactly how our AI navigates advertisers through even the most complex tools," Thompson said. "This insight helps us understand not just whether issues are resolved, but how decisions are made along the way."

    Inside Salesforce's session tracing technology: Logging every AI agent interaction and reasoning step

    Salesforce built the observability system on two foundational components. The Session Tracing Data Model logs every interaction — user inputs, agent responses, reasoning steps, language model calls, and guardrail checks — and stores them securely in Data 360, Salesforce's data platform. This creates what the company calls "unified visibility" into agent behavior at the session level.

    The second component, MuleSoft Agent Fabric, addresses a problem that will become more acute as companies build more AI systems: agent sprawl. The tool provides what Lerhaupt describes as "a single pane of glass across every agent," including those built outside the Salesforce ecosystem. Agent Fabric's Agent Visualizer creates a visual map of a company's entire agent network, giving visibility across all agent interactions from a single dashboard.

    The observability tools break down into three functional areas. Agent Analytics tracks performance metrics, surfaces KPI trends over time, and highlights ineffective topics or actions. Agent Optimization provides end-to-end visibility of every interaction, groups similar requests to uncover patterns, and identifies configuration issues. Agent Health Monitoring, which will become generally available in Spring 2026, tracks key health metrics in near real-time and sends alerts on critical errors and latency spikes.

    Pierre Matuchet, senior vice president of IT and digital transformation at Adecco, said the visibility helped his team build confidence even before full deployment. "Even during early notebook testing, we saw the agent handle unexpected scenarios, like when candidates didn't want to answer questions already covered in their CVs, appropriately and as designed," Matuchet said. "Agentforce Observability helped us identify unanticipated user behavior and gave us confidence, even before the agent went live, that it could act responsibly and reliably."

    Why Salesforce says its AI observability tools beat Microsoft, Google, and AWS monitoring

    The announcement puts Salesforce in direct competition with Microsoft, Google, and Amazon Web Services, all of which offer monitoring capabilities built into their AI agent platforms. Lerhaupt argued that enterprises need more than the basic monitoring those providers offer.

    "Observability comes out-of-the-box standard with Agentforce at no extra cost," Lerhaupt said, positioning the offering as comprehensive rather than supplementary. He emphasized that the tools provide "deeper insight than ever before" by capturing "the full telemetry and reasoning behind every agentic interaction" through the Session Tracing Data Model, then using that data to "provide key analysis and session quality scoring to help customers optimize and improve their agents."

    The competitive positioning matters because enterprises face a choice: build their AI infrastructure on a cloud provider's platform and use its native monitoring tools, or adopt a specialized observability layer like Salesforce's. Lerhaupt framed the decision as one of depth versus breadth. "Enterprises need more than basic monitoring to measure the success of their AI deployments," he said. "They need full visibility into every agent interaction and decision."

    The 1.2 billion workflow question: Are AI agent deployments moving from pilot projects to production?

    The broader question is whether Salesforce is solving a problem most enterprises will face imminently or building for a future that remains years away. The company's 282% surge in AI implementation sounds dramatic, but that figure doesn't distinguish between production deployments and pilot projects.

    When asked about this directly, Lerhaupt pointed to customer examples rather than offering a breakdown. He described a three-phase journey from experimentation to scale. "On Day 0, trust is the foundation," he said, citing 1-800Accountant's 70% autonomous resolution of chat engagements. "Day 1 is where designing ideas to become real, usable AI," with Williams Sonoma delivering more than 150,000 AI experiences monthly. "On Day 2, once trust and design are built, it becomes about scaling early wins into enterprise-wide outcomes," pointing to Falabella's 600,000 AI workflows per month that have grown fourfold in three months.

    Lerhaupt said Salesforce has 12,000-plus customers across 39 countries running Agentforce, powering 1.2 billion agentic workflows. Those numbers suggest the shift from pilot to production is already underway at scale, though the company didn't provide a breakdown of how many customers are running production workloads versus experimental deployments.

    The economics of AI deployment may accelerate adoption regardless of readiness. Companies face mounting pressure to reduce headcount costs while maintaining or improving service levels. AI agents promise to resolve that tension, but only if businesses can trust them to work reliably. Observability tools like Salesforce's represent the trust layer that makes scaled deployment possible.

    What happens after AI agent deployment: Why continuous monitoring matters more than initial testing

    The deeper story is about a shift in how enterprises think about AI deployment. The official announcement framed this clearly: "The agent development lifecycle begins with three foundational steps: build, test, and deploy. While many organizations have already moved past the initial hurdle of creating their first agents, the real enterprise challenge starts immediately after deployment."

    That framing reflects a maturing understanding of AI in production environments. Early AI deployments often treated the technology as a one-time implementation — build it, test it, ship it. But AI agents behave differently than traditional software. They learn, adapt, and make decisions based on probabilistic models rather than deterministic code. That means their behavior can drift over time, or they can develop unexpected failure modes that only emerge under real-world conditions.

    "Building an agent is just the beginning," Lerhaupt said. "Once the trust is built for agents to begin handling real work, companies may start by seeing the results, but may not understand the 'why' behind them or see areas to optimize. Customers interact with products—including agents—in unexpected ways and to optimize the customer experience, transparency around agent behavior and outcomes is critical."

    Teeples made the same point more bluntly when asked what would be different without observability tools. "This level of visibility has given full trust in continuing to expand our agent deployment," he said. The implication is clear: without visibility, deployment would slow or stop. 1-800Accountant plans to expand Slack integrations for internal workflows, deploy Service Cloud Voice for case deflection, and leverage Tableau for conversational analytics—all dependent on the confidence that observability provides.

    How enterprise AI trust issues became the biggest barrier to scaling autonomous agents

    The recurring theme in customer interviews is trust, or rather, the lack of it. AI agents work, sometimes spectacularly well, but executives don't trust them enough to deploy them widely. Observability tools aim to convert black-box systems into transparent ones, replacing faith with evidence.

    This matters because trust is the bottleneck constraining AI adoption, not technological capability. The models are powerful enough, the infrastructure is mature enough, and the business case is compelling enough. What's missing is executive confidence that AI agents will behave predictably and that problems can be diagnosed and fixed quickly when they arise.

    Salesforce is betting that observability tools can remove that bottleneck. The company positions Agentforce Observability not as a monitoring tool but as a management layer—"just like managers work with their human employees to ensure they are working towards the right objectives and optimizing performance," Lerhaupt said.

    The analogy is telling. If AI agents are becoming digital employees, they need the same kind of ongoing supervision, feedback, and optimization that human employees receive. The difference is that AI agents can be monitored with far more granularity than any human worker. Every decision, every reasoning step, every data point consulted can be logged, analyzed, and scored.

    That creates both opportunity and obligation. The opportunity is continuous improvement at a pace impossible with human workers. The obligation is to actually use that data to optimize agent performance, not just collect it. Whether enterprises can build the organizational processes to turn observability data into systematic improvement remains an open question.

    But one thing has become increasingly clear in the race to deploy AI at scale: Companies that can see what their agents are doing will move faster than those flying blind. In the emerging era of autonomous AI, observability isn't just a nice-to-have feature. It's the difference between cautious experimentation and confident deployment—between treating AI as a risky bet and managing it as a trusted workforce. The question is no longer whether AI agents can work. It's whether businesses can see well enough to let them.

  • Researchers at Google have developed a new AI paradigm aimed at solving one of the biggest limitations in today’s large language models: their inability to learn or update their knowledge after training. The paradigm, called Nested Learning, reframes a model and its training not as a single process, but as a system of nested, multi-level optimization problems. The researchers argue that this approach can unlock more expressive learning algorithms, leading to better in-context learning and memory.

    To prove their concept, the researchers used Nested Learning to develop a new model, called Hope. Initial experiments show that it has superior performance on language modeling, continual learning, and long-context reasoning tasks, potentially paving the way for efficient AI systems that can adapt to real-world environments.

    The memory problem of large language models

    Deep learning algorithms helped obviate the need for the careful engineering and domain expertise required by traditional machine learning. By feeding models vast amounts of data, they could learn the necessary representations on their own. However, this approach presented its own set of challenges that couldn’t be solved by simply stacking more layers or creating larger networks, such as generalizing to new data, continually learning new tasks, and avoiding suboptimal solutions during training.

    Efforts to overcome these challenges led to the innovations that led to Transformers, the foundation of today's large language models (LLMs). These models have ushered in "a paradigm shift from task-specific models to more general-purpose systems with various emergent capabilities as a result of scaling the 'right' architectures," the researchers write. Still, a fundamental limitation remains: LLMs are largely static after training and can't update their core knowledge or acquire new skills from new interactions.

    The only adaptable component of an LLM is its in-context learning ability, which allows it to perform tasks based on information provided in its immediate prompt. This makes current LLMs analogous to a person who can't form new long-term memories. Their knowledge is limited to what they learned during pre-training (the distant past) and what's in their current context window (the immediate present). Once a conversation exceeds the context window, that information is lost forever.

    The problem is that today’s transformer-based LLMs have no mechanism for “online” consolidation. Information in the context window never updates the model’s long-term parameters — the weights stored in its feed-forward layers. As a result, the model can’t permanently acquire new knowledge or skills from interactions; anything it learns disappears as soon as the context window rolls over.

    A nested approach to learning

    Nested Learning (NL) is designed to allow computational models to learn from data using different levels of abstraction and time-scales, much like the brain. It treats a single machine learning model not as one continuous process, but as a system of interconnected learning problems that are optimized simultaneously at different speeds. This is a departure from the classic view, which treats a model's architecture and its optimization algorithm as two separate components.

    Under this paradigm, the training process is viewed as developing an "associative memory," the ability to connect and recall related pieces of information. The model learns to map a data point to its local error, which measures how "surprising" that data point was. Even key architectural components like the attention mechanism in transformers can be seen as simple associative memory modules that learn mappings between tokens. By defining an update frequency for each component, these nested optimization problems can be ordered into different "levels," forming the core of the NL paradigm.

    Hope for continual learning

    The researchers put these principles into practice with Hope, an architecture designed to embody Nested Learning. Hope is a modified version of Titans, another architecture Google introduced in January to address the transformer model's memory limitations. While Titans had a powerful memory system, its parameters were updated at only two different speeds: a long-term memory module and a short-term memory mechanism.

    Hope is a self-modifying architecture augmented with a "Continuum Memory System" (CMS) that enables unbounded levels of in-context learning and scales to larger context windows. The CMS acts like a series of memory banks, each updating at a different frequency. Faster-updating banks handle immediate information, while slower ones consolidate more abstract knowledge over longer periods. This allows the model to optimize its own memory in a self-referential loop, creating an architecture with theoretically infinite learning levels.

    On a diverse set of language modeling and common-sense reasoning tasks, Hope demonstrated lower perplexity (a measure of how well a model predicts the next word in a sequence and maintains coherence in the text it generates) and higher accuracy compared to both standard transformers and other modern recurrent models. Hope also performed better on long-context "Needle-In-Haystack" tasks, where a model must find and use a specific piece of information hidden within a large volume of text. This suggests its CMS offers a more efficient way to handle long information sequences.

    This is one of several efforts to create AI systems that process information at different levels. Hierarchical Reasoning Model (HRM) by Sapient Intelligence, used a hierarchical architecture to make the model more efficient in learning reasoning tasks. Tiny Reasoning Model (TRM), a model by Samsung, improves HRM by making architectural changes, improving its performance while making it more efficient.

    While promising, Nested Learning faces some of the same challenges of these other paradigms in realizing its full potential. Current AI hardware and software stacks are heavily optimized for classic deep learning architectures and Transformer models in particular. Adopting Nested Learning at scale may require fundamental changes. However, if it gains traction, it could lead to far more efficient LLMs that can continually learn, a capability crucial for real-world enterprise applications where environments, data, and user needs are in constant flux.