- The London Assembly is pushing for stricter data center construction rules to mitigate the threat of a residential power drain.
- The opening of AMS11 underscores the Netherlands’ position as a leading data center market in Europe.
Enterprises are investing billions of dollars in AI agents and infrastructure to transform business processes. However, we are seeing limited success in real-world applications, often due to the inability of agents to truly understand business data, policies and processes.
While we manage the integrations well with technologies like API management, model context protocol (MCP) and others, having agents truly understand the “meaning” of data in the context of a given businesis a different story. Enterprise data is mostly siloed into disparate systems in structured and unstructured forms and needs to be analyzed with a domain-specific business lens.s
As an example, the term “customer” may refer to a different group of people in a Sales CRM system, compared to a finance system which may use this tag for paying clients. One department might define “product” as a SKU; another may represent as a "product" family; a third as a marketing bundle.
Data about “product sales” thus varies in meaning without agreed upon relationships and definitions. For agents to combine data from multiple systems, they must understand different representations. Agents need to know what the data means in context and how to find the right data for the right process. Moreover, schema changes in systems and data quality issues during collection can lead to more ambiguity and inability of agents to know how to act when such situations are encountered.
Furthermore, classification of data into categories like PII (personally identifiable information) needs to be rigorously followed to maintain compliance with standards like GDPR and CCPA. This requires the data to be labelled correctly and agents to be able to understand and respect this classification. Hence we see that building a cool demo using agents is very much doable - but putting into production working on real business data is a different story altogether.
The ontology-based source of truth
Building effective agentic solutions requries an ontology-based single source of truth. Ontology is a business definition of concepts, their hierarchy and relationships. It defines terms with respect to business domains, can help establish a single-source of truth for data and capture uniform field names and apply classifications to fields.
An ontology may be domain-specific (healthcare or finance), or organization-specific based on internal structures. Defining an ontology upfront is time consuming, but can help standardize business processes and lay a strong foundation for agentic AI.
Ontology may be realized using common queryable formats like triplestore. More complex business rules with multi-hop relations could use a labelled property graphs like Neo4j. These graphs can also help enterprises discover new relationships and answer complex questions. Ontologies like FIBO (Finance Industry Business Ontology) and UMLS (Unified Medical Language System) are available in the public domain and can be a very good starting point. However, these usually need to be customized to capture specific details of an enterprise.
Getting started with ontology
Once implemented, an ontology can be the driving force for enterprise agents. We can now prompt AI to follow the ontology and use it to discover data and relationships. If needed, we can have an agentic layer serve key details of the ontology itself and discover data. Business rules and policies can be implemented in this ontology for agents to adhere to. This is an excellent way to ground your agents and establish guardrails based on real business context.
Agents designed in this manner and tuned to follow an ontology can stick to guardrails and avoid hallucinations that can be caused by the large language models (LLM) powering them. For example, a business policy may define that unless all documents associated with a loan do not have verified flags set to "true," the loan status should be kept in “pending” state. Agents can work around this policy and determine what documents are needed and query the knowledge base.
Here's an example implementation:
(Original figure by Author)
As illustrated, we have structured and unstructured data processed by a document intelligence (DocIntel) agent which populates a Neo4j database based on an ontology of the business domain. A data discovery agent in Neo4j finds and queries the right data and passes it to other agents handling business process execution. The inter-agent communication happens with a popular protocol like A2A (agent to agent). A new protocol called AG-UI (Agent User Interaction) can help build more generic UI screens to capture the workings and responses from these agents.
With this method, we can avoid hallucinations by enforcing agents to follow ontology-driven paths and maintain data classifications and relationships. Moreover, we can scale easily by adding new assets, relationships and policies that agents can automatically comply to, and control hallucinations by defining rules for the whole system rather than individual entities. For example, if an agent hallucinates an individual 'customer,' because the connected data for the hallucinated 'customer' will not be verifiable in the data discovery, we can easily detect this anomaly and plan to eliminate it. This helps the agentic system scale with the business and manage its dynamic nature.
Indeed, a reference architecture like this adds some overhead in data discovery and graph databases. But for a large enterprise, it adds the right guardrails and gives agents directions to orchestrate complex business processes.
Dattaraj Rao is innovation and R&D architect at Persistent Systems.
Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.
As AI systems enter production, reliability and governance can’t depend on wishful thinking. Here’s how observability turns large language models (LLMs) into auditable, trustworthy enterprise systems.
Why observability secures the future of enterprise AI
The enterprise race to deploy LLM systems mirrors the early days of cloud adoption. Executives love the promise; compliance demands accountability; engineers just want a paved road.
Yet, beneath the excitement, most leaders admit they can’t trace how AI decisions are made, whether they helped the business, or if they broke any rule.
Take one Fortune 100 bank that deployed an LLM to classify loan applications. Benchmark accuracy looked stellar. Yet, 6 months later, auditors found that 18% of critical cases were misrouted, without a single alert or trace. The root cause wasn’t bias or bad data. It was invisible. No observability, no accountability.
If you can’t observe it, you can’t trust it. And unobserved AI will fail in silence.
Visibility isn’t a luxury; it’s the foundation of trust. Without it, AI becomes ungovernable.
Start with outcomes, not models
Most corporate AI projects begin with tech leaders choosing a model and, later, defining success metrics. That’s backward.
Flip the order:
-
Define the outcome first. What’s the measurable business goal?
-
Deflect 15 % of billing calls
-
Reduce document review time by 60 %
-
Cut case-handling time by two minutes
-
-
Design telemetry around that outcome, not around “accuracy” or “BLEU score.”
-
Select prompts, retrieval methods and models that demonstrably move those KPIs.
At one global insurer, for instance, reframing success as “minutes saved per claim” instead of “model precision” turned an isolated pilot into a company-wide roadmap.
A 3-layer telemetry model for LLM observability
Just like microservices rely on logs, metrics and traces, AI systems need a structured observability stack:
a) Prompts and context: What went in
-
Log every prompt template, variable and retrieved document.
-
Record model ID, version, latency and token counts (your leading cost indicators).
-
Maintain an auditable redaction log showing what data was masked, when and by which rule.
b) Policies and controls: The guardrails
-
Capture safety-filter outcomes (toxicity, PII), citation presence and rule triggers.
-
Store policy reasons and risk tier for each deployment.
-
Link outputs back to the governing model card for transparency.
c) Outcomes and feedback: Did it work?
-
Gather human ratings and edit distances from accepted answers.
-
Track downstream business events, case closed, document approved, issue resolved.
-
Measure the KPI deltas, call time, backlog, reopen rate.
All three layers connect through a common trace ID, enabling any decision to be replayed, audited or improved.
Diagram © SaiKrishna Koorapati (2025). Created specifically for this article; licensed to VentureBeat for publication.
Apply SRE discipline: SLOs and error budgets for AI
Service reliability engineering (SRE) transformed software operations; now it’s AI’s turn.
Define three “golden signals” for every critical workflow:
Signal
Target SLO
When breached
Factuality
≥ 95 % verified against source of record
Fallback to verified template
Safety
≥ 99.9 % pass toxicity/PII filters
Quarantine and human review
Usefulness
≥ 80 % accepted on first pass
Retrain or rollback prompt/model
If hallucinations or refusals exceed budget, the system auto-routes to safer prompts or human review just like rerouting traffic during a service outage.
This isn’t bureaucracy; it’s reliability applied to reasoning.
Build the thin observability layer in two agile sprints
You don’t need a six-month roadmap, just focus and two short sprints.
Sprint 1 (weeks 1-3): Foundations
-
Version-controlled prompt registry
-
Redaction middleware tied to policy
-
Request/response logging with trace IDs
-
Basic evaluations (PII checks, citation presence)
-
Simple human-in-the-loop (HITL) UI
Sprint 2 (weeks 4-6): Guardrails and KPIs
-
Offline test sets (100–300 real examples)
-
Policy gates for factuality and safety
-
Lightweight dashboard tracking SLOs and cost
-
Automated token and latency tracker
In 6 weeks, you’ll have the thin layer that answers 90% of governance and product questions.
Make evaluations continuous (and boring)
Evaluations shouldn’t be heroic one-offs; they should be routine.
-
Curate test sets from real cases; refresh 10–20 % monthly.
-
Define clear acceptance criteria shared by product and risk teams.
-
Run the suite on every prompt/model/policy change and weekly for drift checks.
-
Publish one unified scorecard each week covering factuality, safety, usefulness and cost.
When evals are part of CI/CD, they stop being compliance theater and become operational pulse checks.
Apply human oversight where it matters
Full automation is neither realistic nor responsible. High-risk or ambiguous cases should escalate to human review.
-
Route low-confidence or policy-flagged responses to experts.
-
Capture every edit and reason as training data and audit evidence.
-
Feed reviewer feedback back into prompts and policies for continuous improvement.
At one health-tech firm, this approach cut false positives by 22 % and produced a retrainable, compliance-ready dataset in weeks.
Cost control through design, not hope
LLM costs grow non-linearly. Budgets won’t save you architecture will.
-
Structure prompts so deterministic sections run before generative ones.
-
Compress and rerank context instead of dumping entire documents.
-
Cache frequent queries and memoize tool outputs with TTL.
-
Track latency, throughput and token use per feature.
When observability covers tokens and latency, cost becomes a controlled variable, not a surprise.
The 90-day playbook
Within 3 months of adopting observable AI principles, enterprises should see:
-
1–2 production AI assists with HITL for edge cases
-
Automated evaluation suite for pre-deploy and nightly runs
-
Weekly scorecard shared across SRE, product and risk
-
Audit-ready traces linking prompts, policies and outcomes
At a Fortune 100 client, this structure reduced incident time by 40 % and aligned product and compliance roadmaps.
Scaling trust through observability
Observable AI is how you turn AI from experiment to infrastructure.
With clear telemetry, SLOs and human feedback loops:
-
Executives gain evidence-backed confidence.
-
Compliance teams get replayable audit chains.
-
Engineers iterate faster and ship safely.
-
Customers experience reliable, explainable AI.
Observability isn’t an add-on layer, it’s the foundation for trust at scale.
SaiKrishna Koorapati is a software engineering leader.
Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.
-
Agent memory remains a problem that enterprises want to fix, as agents forget some instructions or conversations the longer they run.
Anthropic believes it has solved this issue for its Claude Agent SDK, developing a two-fold solution that allows an agent to work across different context windows.
“The core challenge of long-running agents is that they must work in discrete sessions, and each new session begins with no memory of what came before,” Anthropic wrote in a blog post. “Because context windows are limited, and because most complex projects cannot be completed within a single window, agents need a way to bridge the gap between coding sessions.”
Anthropic engineers proposed a two-fold approach for its Agent SDK: An initializer agent to set up the environment, and a coding agent to make incremental progress in each session and leave artifacts for the next.
The agent memory problem
Since agents are built on foundation models, they remain constrained by the limited, although continually growing, context windows. For long-running agents, this could create a larger problem, leading the agent to forget instructions and behave abnormally while performing a task. Enhancing agent memory becomes essential for consistent, business-safe performance.
Several methods emerged over the past year, all attempting to bridge the gap between context windows and agent memory. LangChain’s LangMem SDK, Memobase and OpenAI’s Swarm are examples of companies offering memory solutions. Research on agentic memory has also exploded recently, with proposed frameworks like Memp and the Nested Learning Paradigm from Google offering new alternatives to enhance memory.
Many of the current memory frameworks are open source and can ideally adapt to different large language models (LLMs) powering agents. Anthropic’s approach improves its Claude Agent SDK.
How it works
Anthropic identified that even though the Claude Agent SDK had context management capabilities and “should be possible for an agent to continue to do useful work for an arbitrarily long time,” it was not sufficient. The company said in its blog post that a model like Opus 4.5 running the Claude Agent SDK can “fall short of building a production-quality web app if it’s only given a high-level prompt, such as 'build a clone of claude.ai.'”
The failures manifested in two patterns, Anthropic said. First, the agent tried to do too much, causing the model to run out of context in the middle. The agent then has to guess what happened and cannot pass clear instructions to the next agent. The second failure occurs later on, after some features have already been built. The agent sees progress has been made and just declares the job done.
Anthropic researchers broke down the solution: Setting up an initial environment to lay the foundation for features and prompting each agent to make incremental progress towards a goal, while still leaving a clean slate at the end.
This is where the two-part solution of Anthropic's agent comes in. The initializer agent sets up the environment, logging what agents have done and which files have been added. The coding agent will then ask models to make incremental progress and leave structured updates.
“Inspiration for these practices came from knowing what effective software engineers do every day,” Anthropic said.
The researchers said they added testing tools to the coding agent, improving its ability to identify and fix bugs that weren’t obvious from the code alone.
Future research
Anthropic noted that its approach is “one possible set of solutions in a long-running agent harness.” However, this is just the beginning stage of what could become a wider research area for many in the AI space.
The company said its experiments to boost long-term memory for agents haven’t shown whether a single general-purpose coding agent works best across contexts or a multi-agent structure.
Its demo also focused on full-stack web app development, so other experiments should focus on generalizing the results across different tasks.
“It’s likely that some or all of these lessons can be applied to the types of long-running agentic tasks required in, for example, scientific research or financial modeling,” Anthropic said.
Hello, dear readers. Happy belated Thanksgiving and Black Friday!
This year has felt like living inside a permanent DevDay. Every week, some lab drops a new model, a new agent framework, or a new “this changes everything” demo. It’s overwhelming. But it’s also the first year I’ve felt like AI is finally diversifying — not just one or two frontier models in the cloud, but a whole ecosystem: open and closed, giant and tiny, Western and Chinese, cloud and local.
So for this Thanksgiving edition, here’s what I’m genuinely thankful for in AI in 2025 — the releases that feel like they’ll matter in 12–24 months, not just during this week’s hype cycle.
1. OpenAI kept shipping strong: GPT-5, GPT-5.1, Atlas, Sora 2 and open weights
As the company that undeniably birthed the "generative AI" era with its viral hit product ChatGPT in late 2022, OpenAI arguably had among the hardest tasks of any AI company in 2025: continue its growth trajectory even as well-funded competitors like Google with its Gemini models and other startups like Anthropic fielded their own highly competitive offerings.
Thankfully, OpenAI rose to the challenge and then some. Its headline act was GPT-5, unveiled in August as the next frontier reasoning model, followed in November by GPT-5.1 with new Instant and Thinking variants that dynamically adjust how much “thinking time” they spend per task.
In practice, GPT-5’s launch was bumpy — VentureBeat documented early math and coding failures and a cooler-than-expected community reaction in “OpenAI’s GPT-5 rollout is not going smoothly," but it quickly course corrected based on user feedback and, as a daily user of this model, I'm personally pleased with it and impressed with it.
At the same time, enterprises actually using the models are reporting solid gains. ZenDesk Global, for example, says GPT-5-powered agents now resolve more than half of customer tickets, with some customers seeing 80–90% resolution rates. That’s the quiet story: these models may not always impress the chattering classes on X, but they’re starting to move real KPIs.
On the tooling side, OpenAI finally gave developers a serious AI engineer with GPT-5.1-Codex-Max, a new coding model that can run long, agentic workflows and is already the default in OpenAI’s Codex environment. VentureBeat covered it in detail in “OpenAI debuts GPT-5.1-Codex-Max coding model and it already completed a 24-hour task internally.”
Then there’s ChatGPT Atlas, a full browser with ChatGPT baked into the chrome itself — sidebar summaries, on-page analysis, and search tightly integrated into regular browsing. It’s the clearest sign yet that “assistant” and “browser” are on a collision course.
On the media side, Sora 2 turned the original Sora video demo into a full video-and-audio model with better physics, synchronized sound and dialogue, and more control over style and shot structure, plus a dedicated Sora app with a full fledged social networking component, allowing any user to create their own TV network in their pocket.
Finally — and maybe most symbolically — OpenAI released gpt-oss-120B and gpt-oss-20B, open-weight MoE reasoning models under an Apache 2.0–style license. Whatever you think of their quality (and early open-source users have been loud about their complaints), this is the first time since GPT-2 that OpenAI has put serious weights into the public commons.
2. China’s open-source wave goes mainstream
If 2023–24 was about Llama and Mistral, 2025 belongs to China’s open-weight ecosystem.
A study from MIT and Hugging Face found that China now slightly leads the U.S. in global open-model downloads, largely thanks to DeepSeek and Alibaba’s Qwen family.
Highlights:
-
DeepSeek-R1 dropped in January as an open-source reasoning model rivaling OpenAI’s o1, with MIT-licensed weights and a family of distilled smaller models. VentureBeat has followed the story from its release to its cybersecurity impact to performance-tuned R1 variants.
-
Kimi K2 Thinking from Moonshot, a “thinking” open-source model that reasons step-by-step with tools, very much in the o1/R1 mold, and is positioned as the best open reasoning model so far in the world.
-
Z.ai shipped GLM-4.5 and GLM-4.5-Air as “agentic” models, open-sourcing base and hybrid reasoning variants on GitHub.
-
Baidu’s ERNIE 4.5 family arrived as a fully open-sourced, multimodal MoE suite under Apache 2.0, including a 0.3B dense model and visual “Thinking” variants focused on charts, STEM, and tool use.
-
Alibaba’s Qwen3 line — including Qwen3-Coder, large reasoning models, and the Qwen3-VL series released over the summer and fall months of 2025 — continues to set a high bar for open weights in coding, translation, and multimodal reasoning, leading me to declare this past summer as "
VentureBeat has been tracking these shifts, including Chinese math and reasoning models like Light-R1-32B and Weibo’s tiny VibeThinker-1.5B, which beat DeepSeek baselines on shoestring training budgets.
If you care about open ecosystems or on-premise options, this is the year China’s open-weight scene stopped being a curiosity and became a serious alternative.
3. Small and local models grow up
Another thing I’m thankful for: we’re finally getting good small models, not just toys.
Liquid AI spent 2025 pushing its Liquid Foundation Models (LFM2) and LFM2-VL vision-language variants, designed from day one for low-latency, device-aware deployments — edge boxes, robots, and constrained servers, not just giant clusters. The newer LFM2-VL-3B targets embedded robotics and industrial autonomy, with demos planned at ROSCon.
On the big-tech side, Google’s Gemma 3 line made a strong case that “tiny” can still be capable. Gemma 3 spans from 270M parameters up through 27B, all with open weights and multimodal support in the larger variants.
The standout is Gemma 3 270M, a compact model purpose-built for fine-tuning and structured text tasks — think custom formatters, routers, and watchdogs — covered both in Google’s developer blog and community discussions in local-LLM circles.
These models may never trend on X, but they’re exactly what you need for privacy-sensitive workloads, offline workflows, thin-client devices, and “agent swarms” where you don’t want every tool call hitting a giant frontier LLM.
4. Meta + Midjourney: aesthetics as a service
One of the stranger twists this year: Meta partnered with Midjourney instead of simply trying to beat it.
In August, Meta announced a deal to license Midjourney’s “aesthetic technology” — its image and video generation stack — and integrate it into Meta’s future models and products, from Facebook and Instagram feeds to Meta AI features.
VentureBeat covered the partnership in “Meta is partnering with Midjourney and will license its technology for future models and products,” raising the obvious question: does this slow or reshape Midjourney’s own API roadmap? Still awaiting an answer there, but unfortunately, stated plans for an API release have yet to materialize, suggesting that it has.
For creators and brands, though, the immediate implication is simple: Midjourney-grade visuals start to show up in mainstream social tools instead of being locked away in a Discord bot. That could normalize higher-quality AI art for a much wider audience — and force rivals like OpenAI, Google, and Black Forest Labs to keep raising the bar.
5. Google’s Gemini 3 and Nano Banana Pro
Google tried to answer GPT-5 with Gemini 3, billed as its most capable model yet, with better reasoning, coding, and multimodal understanding, plus a new Deep Think mode for slow, hard problems.
VentureBeat’s coverage, “Google unveils Gemini 3 claiming the lead in math, science, multimodal and agentic AI,” framed it as a direct shot at frontier benchmarks and agentic workflows.
But the surprise hit is Nano Banana Pro (Gemini 3 Pro Image), Google’s new flagship image generator. It specializes in infographics, diagrams, multi-subject scenes, and multilingual text that actually renders legibly across 2K and 4K resolutions.
In the world of enterprise AI — where charts, product schematics, and “explain this system visually” images matter more than fantasy dragons — that’s a big deal.
6. Wild cards I’m keeping an eye on
A few more releases I’m thankful for, even if they don’t fit neatly into one bucket:
-
Black Forest Labs’ Flux.2 image models, which launched just earlier this week with ambitions to challenge both Nano Banana Pro and Midjourney on quality and control. VentureBeat dug into the details in “Black Forest Labs launches Flux.2 AI image models to challenge Nano Banana Pro and Midjourney."
-
Anthropic’s Claude Opus 4.5, a new flagship that aims for cheaper, more capable coding and long-horizon task execution, covered in “Anthropic’s Claude Opus 4.5 is here: Cheaper AI, infinite chats, and coding skills that beat humans."
-
A steady drumbeat of open math/reasoning models — from Light-R1 to VibeThinker and others — that show you don’t need $100M training runs to move the needle.
Last thought (for now)
If 2024 was the year of “one big model in the cloud,” 2025 is the year the map exploded: multiple frontiers at the top, China taking the lead in open models, small and efficient systems maturing fast, and creative ecosystems like Midjourney getting pulled into big-tech stacks.
I’m thankful not just for any single model, but for the fact that we now have options — closed and open, local and hosted, reasoning-first and media-first. For journalists, builders, and enterprises, that diversity is the real story of 2025.
Happy holidays and best to you and your loved ones!
-
Researchers at the University of Science and Technology of China have developed a new reinforcement learning (RL) framework that helps train large language models (LLMs) for complex agentic tasks beyond well-defined problems such as math and coding.
Their framework, Agent-R1, is compatible with popular RL algorithms and shows considerable improvement on reasoning tasks that require multiple retrieval stages and multi-turn interactions with tools.
The framework is built on a redefinition of the RL paradigm that takes into account the dynamic nature of agentic applications that require interacting with evolving environments and imperfect information. This framing is much more similar to real-world applications and can have important uses for agentic tasks in enterprise settings.
Rethinking reinforcement learning for agents
RL has become a cornerstone of training LLMs for well-defined reasoning tasks. In areas like mathematics and coding, the model receives a clear signal: The answer is either right or wrong. This makes it relatively straightforward to reward or penalize its behavior.
But this approach struggles with agentic tasks that require models to work in interactive environments, develop dynamic memories across conversations, perform multi-step reasoning and respond to unpredictable feedback. Training agents with RL for these scenarios presents unique challenges, especially in multi-turn interactions where designing effective rewards is complex and the trained agent often fails to generalize to the messy, unpredictable nature of real-world environments.
To address these challenges, the University of Science and Technology researchers revisited the fundamental framework of RL, known as the Markov Decision Process (MDP). An MDP models decision-making using four key components: a state space (the set of possible states an agent can be in); an action space (what the agent can do); a state transition probability (the state to which an action will likely lead); and a reward function (whether the outcome is good or bad). The paper proposes extending this framework to better suit LLM agents.
In the new formulation, the state space is expanded to include not just the current state (the current sequence of tokens generated by the model) but the entire history of interactions and environmental feedback. Actions are still fundamentally about generating text, but specific sequences of text can now trigger external tools, like an API call. State transitions become unpredictable, or "stochastic," because the outcome depends not just on the tokens the model predicts but also on the environment's response, which depends on external factors. Finally, the reward system becomes more granular, incorporating intermediate "process rewards" for successfully completing steps along the way, rather than just a single reward at the very end. This provides more frequent and precise guidance to the agent during training.
This last bit is especially important and addresses the “sparse reward” problem that most RL frameworks face. When the agent receives a single reward signal based on the final outcome, it does not learn from the right and wrong intermediate steps it has taken along the way. Process rewards solve this problem by providing feedback signals on these intermediate steps, making the learning process much more efficient.
“These extensions are crucial for enabling reinforcement learning algorithms to train sophisticated Agents capable of complex, multi-step reasoning and interaction within dynamic environments,” the researchers write in their paper.
The Agent-R1 framework
Based on the extended MDP definition, the researchers developed Agent-R1, a flexible and user-friendly training platform for RL-based LLM agents. It extends traditional single-turn RL frameworks to handle the multi-turn, interactive nature of agentic tasks, allowing for seamless integration with diverse environments.
The most significant difference lies in the "rollout phase," where the agent generates responses. In single-turn RL, the model generates a response once. In multi-turn RL, the process involves a series of complex back-and-forth interactions.
Agent-R1 achieves this flexible multi-turn rollout with two core modules: Tool and ToolEnv. The Tool module acts as an executor for specific actions such as calling an API or accessing a database. When invoked, a Tool performs its action and returns the direct, raw outcome. In contrast, the ToolEnv module is the orchestrator and interpreter. It takes the output from the Tool and determines how that outcome affects the agent's state and the overall task progress. ToolEnv manages state transitions, calculates reward signals based on tool outcomes and packages the new state information for the agent.
In short, when an action is complete, the Tool reports "what happened," while ToolEnv dictates "what this outcome means for the agent and the task."
Agent-R1 in action
The researchers tested Agent-R1 on the challenging task of multi-hop question answering, which requires complex reasoning, information retrieval across multiple documents and multi-step decision-making. They trained Qwen2.5-3B-Instruct on QA datasets and evaluated its performance on the HotpotQA and 2WikiMultihopQA datasets. They also tested it on the Musique dataset, which was out of the domain of tasks the agent was trained on.
They compared various RL algorithms trained with Agent-R1 against two baselines: Naive RAG, a single-pass retrieval method where an LLM answers based on one set of retrieved documents, and Base Tool Call, which uses the model's native function-calling ability without specialized RL training.
The results demonstrated that all RL-trained agents substantially outperformed the baselines. GRPO, an RL algorithm used in advanced reasoning models like DeepSeek-R1, delivered the best overall performance.
“These results robustly validate Agent-R1’s efficacy in training powerful LLM agents via end-to-end RL, showing consistent, substantial gains over baselines across diverse datasets and RL algorithms,” the researchers write.
These findings can be significant for the enterprise, where there is a strong push to apply RL and reasoning beyond well-defined domains. A framework designed to handle messy, multi-turn interactions with users and dynamic environments can pave the way for new agents capable of solving complex problems in real-world settings.
“We hope Agent-R1 provides a foundation for future work on scalable and unified RL training for agentic LLMs,” the researchers conclude.
- Industry-wide collaboration is essential to develop standards that guide smarter infrastructure decisions, ensuring efficiency and ROI in the evolving landscape of AI and analytics.
This weekend, Andrej Karpathy, the former director of AI at Tesla and a founding member of OpenAI, decided he wanted to read a book. But he did not want to read it alone. He wanted to read it accompanied by a committee of artificial intelligences, each offering its own perspective, critiquing the others, and eventually synthesizing a final answer under the guidance of a "Chairman."
To make this happen, Karpathy wrote what he called a "vibe code project" — a piece of software written quickly, largely by AI assistants, intended for fun rather than function. He posted the result, a repository called "LLM Council," to GitHub with a stark disclaimer: "I’m not going to support it in any way... Code is ephemeral now and libraries are over."
Yet, for technical decision-makers across the enterprise landscape, looking past the casual disclaimer reveals something far more significant than a weekend toy. In a few hundred lines of Python and JavaScript, Karpathy has sketched a reference architecture for the most critical, undefined layer of the modern software stack: the orchestration middleware sitting between corporate applications and the volatile market of AI models.
As companies finalize their platform investments for 2026, LLM Council offers a stripped-down look at the "build vs. buy" reality of AI infrastructure. It demonstrates that while the logic of routing and aggregating AI models is surprisingly simple, the operational wrapper required to make it enterprise-ready is where the true complexity lies.
How the LLM Council works: Four AI models debate, critique, and synthesize answers
To the casual observer, the LLM Council web application looks almost identical to ChatGPT. A user types a query into a chat box. But behind the scenes, the application triggers a sophisticated, three-stage workflow that mirrors how human decision-making bodies operate.
First, the system dispatches the user’s query to a panel of frontier models. In Karpathy’s default configuration, this includes OpenAI’s GPT-5.1, Google’s Gemini 3.0 Pro, Anthropic’s Claude Sonnet 4.5, and xAI’s Grok 4. These models generate their initial responses in parallel.
In the second stage, the software performs a peer review. Each model is fed the anonymized responses of its counterparts and asked to evaluate them based on accuracy and insight. This step transforms the AI from a generator into a critic, forcing a layer of quality control that is rare in standard chatbot interactions.
Finally, a designated "Chairman LLM" — currently configured as Google’s Gemini 3 — receives the original query, the individual responses, and the peer rankings. It synthesizes this mass of context into a single, authoritative answer for the user.
Karpathy noted that the results were often surprising. "Quite often, the models are surprisingly willing to select another LLM's response as superior to their own," he wrote on X (formerly Twitter). He described using the tool to read book chapters, observing that the models consistently praised GPT-5.1 as the most insightful while rating Claude the lowest. However, Karpathy’s own qualitative assessment diverged from his digital council; he found GPT-5.1 "too wordy" and preferred the "condensed and processed" output of Gemini.
FastAPI, OpenRouter, and the case for treating frontier models as swappable components
For CTOs and platform architects, the value of LLM Council lies not in its literary criticism, but in its construction. The repository serves as a primary document showing exactly what a modern, minimal AI stack looks like in late 2025.
The application is built on a "thin" architecture. The backend uses FastAPI, a modern Python framework, while the frontend is a standard React application built with Vite. Data storage is handled not by a complex database, but by simple JSON files written to the local disk.
The linchpin of the entire operation is OpenRouter, an API aggregator that normalizes the differences between various model providers. By routing requests through this single broker, Karpathy avoided writing separate integration code for OpenAI, Google, and Anthropic. The application does not know or care which company provides the intelligence; it simply sends a prompt and awaits a response.
This design choice highlights a growing trend in enterprise architecture: the commoditization of the model layer. By treating frontier models as interchangeable components that can be swapped by editing a single line in a configuration file — specifically the COUNCIL_MODELS list in the backend code — the architecture protects the application from vendor lock-in. If a new model from Meta or Mistral tops the leaderboards next week, it can be added to the council in seconds.
What's missing from prototype to production: Authentication, PII redaction, and compliance
While the core logic of LLM Council is elegant, it also serves as a stark illustration of the gap between a "weekend hack" and a production system. For an enterprise platform team, cloning Karpathy’s repository is merely step one of a marathon.
A technical audit of the code reveals the missing "boring" infrastructure that commercial vendors sell for premium prices. The system lacks authentication; anyone with access to the web interface can query the models. There is no concept of user roles, meaning a junior developer has the same access rights as the CIO.
Furthermore, the governance layer is nonexistent. In a corporate environment, sending data to four different external AI providers simultaneously triggers immediate compliance concerns. There is no mechanism here to redact Personally Identifiable Information (PII) before it leaves the local network, nor is there an audit log to track who asked what.
Reliability is another open question. The system assumes the OpenRouter API is always up and that the models will respond in a timely fashion. It lacks the circuit breakers, fallback strategies, and retry logic that keep business-critical applications running when a provider suffers an outage.
These absences are not flaws in Karpathy’s code — he explicitly stated he does not intend to support or improve the project — but they define the value proposition for the commercial AI infrastructure market.
Companies like LangChain, AWS Bedrock, and various AI gateway startups are essentially selling the "hardening" around the core logic that Karpathy demonstrated. They provide the security, observability, and compliance wrappers that turn a raw orchestration script into a viable enterprise platform.
Why Karpathy believes code is now "ephemeral" and traditional software libraries are obsolete
Perhaps the most provocative aspect of the project is the philosophy under which it was built. Karpathy described the development process as "99% vibe-coded," implying he relied heavily on AI assistants to generate the code rather than writing it line-by-line himself.
"Code is ephemeral now and libraries are over, ask your LLM to change it in whatever way you like," he wrote in the repository’s documentation.
This statement marks a radical shift in software engineering capability. Traditionally, companies build internal libraries and abstractions to manage complexity, maintaining them for years. Karpathy is suggesting a future where code is treated as "promptable scaffolding" — disposable, easily rewritten by AI, and not meant to last.
For enterprise decision-makers, this poses a difficult strategic question. If internal tools can be "vibe coded" in a weekend, does it make sense to buy expensive, rigid software suites for internal workflows? Or should platform teams empower their engineers to generate custom, disposable tools that fit their exact needs for a fraction of the cost?
When AI models judge AI: The dangerous gap between machine preferences and human needs
Beyond the architecture, the LLM Council project inadvertently shines a light on a specific risk in automated AI deployment: the divergence between human and machine judgment.
Karpathy’s observation that his models preferred GPT-5.1, while he preferred Gemini, suggests that AI models may have shared biases. They might favor verbosity, specific formatting, or rhetorical confidence that does not necessarily align with human business needs for brevity and accuracy.
As enterprises increasingly rely on "LLM-as-a-Judge" systems to evaluate the quality of their customer-facing bots, this discrepancy matters. If the automated evaluator consistently rewards "wordy and sprawled" answers while human customers want concise solutions, the metrics will show success while customer satisfaction plummets. Karpathy’s experiment suggests that relying solely on AI to grade AI is a strategy fraught with hidden alignment issues.
What enterprise platform teams can learn from a weekend hack before building their 2026 stack
Ultimately, LLM Council acts as a Rorschach test for the AI industry. For the hobbyist, it is a fun way to read books. For the vendor, it is a threat, proving that the core functionality of their products can be replicated in a few hundred lines of code.
But for the enterprise technology leader, it is a reference architecture. It demystifies the orchestration layer, showing that the technical challenge is not in routing the prompts, but in governing the data.
As platform teams head into 2026, many will likely find themselves staring at Karpathy’s code, not to deploy it, but to understand it. It proves that a multi-model strategy is not technically out of reach. The question remains whether companies will build the governance layer themselves or pay someone else to wrap the "vibe code" in enterprise-grade armor.
It's not just Google's Gemini 3, Nano Banana Pro, and Anthropic's Claude Opus 4.5 we have to be thankful for this year around the Thanksgiving holiday here in the U.S.
No, today the German AI startup Black Forest Labs released FLUX.2, a new image generation and editing system complete with four different models designed to support production-grade creative workflows.
FLUX.2 introduces multi-reference conditioning, higher-fidelity outputs, and improved text rendering, and it expands the company’s open-core ecosystem with both commercial endpoints and open-weight checkpoints.
While Black Forest Labs previously launched with and made a name for itself on open source text-to-image models in its Flux family, today's release includes one fully open-source component: the Flux.2 VAE, available now under the Apache 2.0 license.
Four other models of varying size and uses — Flux.2 [Pro], Flux.2 [Flex], and Flux.2 [Dev] —are not open source; Pro and Flex remain proprietary hosted offerings, while Dev is an open-weight downloadable model that requires a commercial license obtained directly from Black Forest Labs for any commercial use. An upcoming open-source model is Flux.2 [Klein], which will also be released under Apache 2.0 when available.
But the open source Flux.2 VAE, or variational autoencoder, is important and useful to enterprises for several reasons. This is a module that compresses images into a latent space and reconstructs them back into high-resolution outputs; in Flux.2, it defines the latent representation used across the multiple (four total, see blow) model variants, enabling higher-quality reconstructions, more efficient training, and 4-megapixel editing.
Because this VAE is open and freely usable, enterprises can adopt the same latent space used by BFL’s commercial models in their own self-hosted pipelines, gaining interoperability between internal systems and external providers while avoiding vendor lock-in.
The availability of a fully open, standardized latent space also enables practical benefits beyond media-focused organizations. Enterprises can use an open-source VAE as a stable, shared foundation for multiple image-generation models, allowing them to switch or mix generators without reworking downstream tools or workflows.
Standardizing on a transparent, Apache-licensed VAE supports auditability and compliance requirements, ensures consistent reconstruction quality across internal assets, and allows future models trained for the same latent space to function as drop-in replacements.
This transparency also enables downstream customization such as lightweight fine-tuning for brand styles or internal visual templates—even for organizations that do not specialize in media but rely on consistent, controllable image generation for marketing materials, product imagery, documentation, or stock-style visuals.
The announcement positions FLUX.2 as an evolution of the FLUX.1 family, with an emphasis on reliability, controllability, and integration into existing creative pipelines rather than one-off demos.
A Shift Toward Production-Centric Image Models
FLUX.2 extends the prior FLUX.1 architecture with more consistent character, layout, and style adherence across up to ten reference images.
The system maintains coherence at 4-megapixel resolutions for both generation and editing tasks, enabling use cases such as product visualization, brand-aligned asset creation, and structured design workflows.
The model also improves prompt following across multi-part instructions while reducing failure modes related to lighting, spatial logic, and world knowledge.
In parallel, Black Forest Labs continues to follow an open-core release strategy. The company provides hosted, performance-optimized versions of FLUX.2 for commercial deployments, while also publishing inspectable open-weight models that researchers and independent developers can run locally. This approach extends a track record begun with FLUX.1, which became the most widely used open image model globally.
Model Variants and Deployment Options
Flux.2 arrives with 5 variants as follows:
-
Flux.2 [Pro]: This is the highest-performance tier, intended for applications that require minimal latency and maximal visual fidelity. It is available through the BFL Playground, the FLUX API, and partner platforms. The model aims to match leading closed-weight systems in prompt adherence and image quality while reducing compute demand.
-
Flux.2 [Flex]: This version exposes parameters such as the number of sampling steps and the guidance scale. The design enables developers to tune the trade-offs between speed, text accuracy, and detail fidelity. In practice, this enables workflows where low-step previews can be generated quickly before higher-step renders are invoked.
-
Flux.2 [Dev]: The most notable release for the open ecosystem is the 32-billion-parameter open-weight checkpoint which integrates text-to-image generation and image editing into a single model. It supports multi-reference conditioning without requiring separate modules or pipelines. The model can run locally using BFL’s reference inference code or optimized fp8 implementations developed in partnership with NVIDIA and ComfyUI. Hosted inference is also available via FAL, Replicate, Runware, Verda, TogetherAI, Cloudflare, and DeepInfra.
-
Flux.2 [Klein]: Coming soon, this size-distilled model is released under Apache 2.0 and is intended to offer improved performance relative to comparable models of the same size trained from scratch. A beta program is currently open.
-
Flux.2 – VAE: Released under the enterprise friendly (even for commercial use) Apache 2.0 license, updated variational autoencoder provides the latent space that underpins all Flux.2 variants. The VAE emphasizes an optimized balance between reconstruction fidelity, learnability, and compression rate—a long-standing challenge for latent-space generative architectures.
Benchmark Performance
Black Forest Labs published two sets of evaluations highlighting FLUX.2’s performance relative to other open-weight and hosted image-generation models. In head-to-head win-rate comparisons across three categories—text-to-image generation, single-reference editing, and multi-reference editing—FLUX.2 [Dev] led all open-weight alternatives by a substantial margin.
It achieved a 66.6% win rate in text-to-image generation (vs. 51.3% for Qwen-Image and 48.1% for Hunyuan Image 3.0), 59.8% in single-reference editing (vs. 49.3% for Qwen-Image and 41.2% for FLUX.1 Kontext), and 63.6% in multi-reference editing (vs. 36.4% for Qwen-Image). These results reflect consistent gains over both earlier FLUX.1 models and contemporary open-weight systems.
A second benchmark compared model quality using ELO scores against approximate per-image cost. In this analysis, FLUX.2 [Pro], FLUX.2 [Flex], and FLUX.2 [Dev] cluster in the upper-quality, lower-cost region of the chart, with ELO scores in the ~1030–1050 band while operating in the 2–6 cent range.
By contrast, earlier models such as FLUX.1 Kontext [max] and Hunyuan Image 3.0 appear significantly lower on the ELO axis despite similar or higher per-image costs. Only proprietary competitors like Nano Banana 2 reach higher ELO levels, but at noticeably elevated cost. According to BFL, this positions FLUX.2’s variants as offering strong quality–cost efficiency across performance tiers, with FLUX.2 [Dev] in particular delivering near–top-tier quality while remaining one of the lowest-cost options in its class.
Pricing via API and Comparison to Nano Banana Pro
A pricing calculator on BFL’s site indicates that FLUX.2 [Pro] is billed at roughly $0.03 per megapixel of combined input and output. A standard 1024×1024 (1 MP) generation costs $0.030, and higher resolutions scale proportionally. The calculator also counts input images toward total megapixels, suggesting that multi-image reference workflows will have higher per-call costs.
By contrast, Google’s Gemini 3 Pro Image Preview aka "Nano Banana Pro," currently prices image output at $120 per 1M tokens, resulting in a cost of $0.134 per 1K–2K image (up to 2048×2048) and $0.24 per 4K image. Image input is billed at $0.0011 per image, which is negligible compared to output costs.
While Gemini’s model uses token-based billing, its effective per-image pricing places 1K–2K images at more than 4× the cost of a 1 MP FLUX.2 [Pro] generation, and 4K outputs at roughly 8× the cost of a similar-resolution FLUX.2 output if scaled proportionally.
In practical terms, the available data suggests that FLUX.2 [Pro] currently offers significantly lower per-image pricing, particularly for high-resolution outputs or multi-image editing workflows, whereas Gemini 3 Pro’s preview tier is positioned as a higher-cost, token-metered service with more variability depending on resolution.
Technical Design and the Latent Space Overhaul
FLUX.2 is built on a latent flow matching architecture, combining a rectified flow transformer with a vision-language model based on Mistral-3 (24B). The VLM contributes semantic grounding and contextual understanding, while the transformer handles spatial structure, material representation, and lighting behavior.
A major component of the update is the re-training of the model’s latent space. The FLUX.2 VAE integrates advances in semantic alignment, reconstruction quality, and representational learnability drawn from recent research on autoencoder optimization. Earlier models often faced trade-offs in the learnability–quality–compression triad: highly compressed spaces increase training efficiency but degrade reconstructions, while wider bottlenecks can reduce the ability of generative models to learn consistent transformations.
According to BFL’s research data, the FLUX.2 VAE achieves lower LPIPS distortion than the FLUX.1 and SD autoencoders while also improving generative FID. This balance allows FLUX.2 to support high-fidelity editing—an area that typically demands reconstruction accuracy—and still maintain competitive learnability for large-scale generative training.
Capabilities Across Creative Workflows
The most significant functional upgrade is multi-reference support. FLUX.2 can ingest up to ten reference images and maintain identity, product details, or stylistic elements across the output. This feature is relevant for commercial applications such as merchandising, virtual photography, storyboarding, and branded campaign development.
The system’s typography improvements address a persistent challenge for diffusion- and flow-based architectures. FLUX.2 is able to generate legible fine text, structured layouts, UI elements, and infographic-style assets with greater reliability. This capability, combined with flexible aspect ratios and high-resolution editing, broadens the use cases where text and image jointly define the final output.
FLUX.2 enhances instruction following for multi-step, compositional prompts, enabling more predictable outcomes in constrained workflows. The model exhibits better grounding in physical attributes—such as lighting and material behavior—reducing inconsistencies in scenes requiring photoreal equilibrium.
Ecosystem and Open-Core Strategy
Black Forest Labs continues to position its models within an ecosystem that blends open research with commercial reliability. The FLUX.1 open models helped establish the company’s reach across both the developer and enterprise markets, and FLUX.2 expands this structure: tightly optimized commercial endpoints for production deployments and open, composable checkpoints for research and community experimentation.
The company emphasizes transparency through published inference code, open-weight VAE release, prompting guides, and detailed architectural documentation. It also continues to recruit talent in Freiburg and San Francisco as it pursues a longer-term roadmap toward multimodal models that unify perception, memory, reasoning, and generation.
Background: Flux and the Formation of Black Forest Labs
Black Forest Labs (BFL) was founded in 2024 by Robin Rombach, Patrick Esser, and Andreas Blattmann, the original creators of Stable Diffusion. Their move from Stability AI came at a moment of turbulence for the broader open-source generative AI community, and the launch of BFL signaled a renewed effort to build accessible, high-performance image models. The company secured $31 million in seed funding led by Andreessen Horowitz, with additional support from Brendan Iribe, Michael Ovitz, and Garry Tan, providing early validation for its technical direction.
BFL’s first major release, FLUX.1, introduced a 12-billion-parameter architecture available in Pro, Dev, and Schnell variants. It quickly gained a reputation for output quality that matched or exceeded closed-source competitors such as Midjourney v6 and DALL·E 3, while the Dev and Schnell versions reinforced the company’s commitment to open distribution. FLUX.1 also saw rapid adoption in downstream products, including xAI’s Grok 2, and arrived amid ongoing industry discussions about dataset transparency, responsible model usage, and the role of open-source distribution. BFL published strict usage policies aimed at preventing misuse and non-consensual content generation.
In late 2024, BFL expanded the lineup with Flux 1.1 Pro, a proprietary high-speed model delivering sixfold generation speed improvements and achieving leading ELO scores on Artificial Analysis. The company launched a paid API alongside the release, enabling configurable integrations with adjustable resolution, model choice, and moderation settings at pricing that began at $0.04 per image.
Partnerships with TogetherAI, Replicate, FAL, and Freepik broadened access and made the model available to users without the need for self-hosting, extending BFL’s reach across commercial and creator-oriented platforms.
These developments unfolded against a backdrop of accelerating competition in generative media.
Implications for Enterprise Technical Decision Makers
The FLUX.2 release carries distinct operational implications for enterprise teams responsible for AI engineering, orchestration, data management, and security. For AI engineers responsible for model lifecycle management, the availability of both hosted endpoints and open-weight checkpoints enables flexible integration paths.
FLUX.2’s multi-reference capabilities and expanded resolution support reduce the need for bespoke fine-tuning pipelines when handling brand-specific or identity-consistent outputs, lowering development overhead and accelerating deployment timelines. The model’s improved prompt adherence and typography performance also reduce iterative prompting cycles, which can have a measurable impact on production workload efficiency.
Teams focused on AI orchestration and operational scaling benefit from the structure of FLUX.2’s product family. The Pro tier offers predictable latency characteristics suitable for pipeline-critical workloads, while the Flex tier enables direct control over sampling steps and guidance parameters, aligning with environments that require strict performance tuning.
Open-weight access for the Dev model facilitates the creation of custom containerized deployments and allows orchestration platforms to manage the model under existing CI/CD practices. This is particularly relevant for organizations balancing cutting-edge tooling with budget constraints, as self-hosted deployments offer cost control at the expense of in-house optimization requirements.
Data engineering stakeholders gain advantages from the model’s latent architecture and improved reconstruction fidelity. High-quality, predictable image representations reduce downstream data-cleaning burdens in workflows where generated assets feed into analytics systems, creative automation pipelines, or multimodal model development.
Because FLUX.2 consolidates text-to-image and image-editing functions into a single model, it simplifies integration points and reduces the complexity of data flows across storage, versioning, and monitoring layers. For teams managing large volumes of reference imagery, the ability to incorporate up to ten inputs per generation may also streamline asset management processes by shifting more variation handling into the model rather than external tooling.
For security teams, FLUX.2’s open-core approach introduces considerations related to access control, model governance, and API usage monitoring. Hosted FLUX.2 endpoints allow for centralized enforcement of security policies and reduce local exposure to model weights, which may be preferable for organizations with stricter compliance requirements.
Conversely, open-weight deployments require internal controls for model integrity, version tracking, and inference-time monitoring to prevent misuse or unapproved modifications. The model’s handling of typography and realistic compositions also reinforces the need for established content governance frameworks, particularly where generative systems interface with public-facing channels.
Across these roles, FLUX.2’s design emphasizes predictable performance characteristics, modular deployment options, and reduced operational friction. For enterprises with lean teams or rapidly evolving requirements, the release offers a set of capabilities aligned with practical constraints around speed, quality, budget, and model governance.
FLUX.2 marks a substantial iterative improvement in Black Forest Labs’ generative image stack, with notable gains in multi-reference consistency, text rendering, latent space quality, and structured prompt adherence. By pairing fully managed offerings with open-weight checkpoints, BFL maintains its open-core model while extending its relevance to commercial creative workflows. The release demonstrates a shift from experimental image generation toward more predictable, scalable, and controllable systems suited for operational use.
-


