• OpenAI researchers have introduced a novel method that acts as a "truth serum" for large language models (LLMs), compelling them to self-report their own misbehavior, hallucinations and policy violations. This technique, "confessions," addresses a growing concern in enterprise AI: Models can be dishonest, overstating their confidence or covering up the shortcuts they take to arrive at an answer. 

    For real-world applications, this technique evolves the creation of more transparent and steerable AI systems.

    What are confessions?

    Many forms of AI deception result from the complexities of the reinforcement learning (RL) phase of model training. In RL, models are given rewards for producing outputs that meet a mix of objectives, including correctness, style and safety. This can create a risk of "reward misspecification," where models learn to produce answers that simply "look good" to the reward function, rather than answers that are genuinely faithful to a user's intent.

    A confession is a structured report generated by the model after it provides its main answer. It serves as a self-evaluation of its own compliance with instructions. In this report, the model must list all instructions it was supposed to follow, evaluate how well it satisfied them and report any uncertainties or judgment calls it made along the way. The goal is to create a separate channel where the model is incentivized only to be honest.

    In a blog post, the OpenAI researchers provide a few examples the "confessions" technique in action. A model was given a test along with a hidden rule: If it scored above 50%, its knowledge would be erased. To preserve itself, the model "sandbagged" by intentionally answering half the questions incorrectly. While its primary answer hid this deception, its subsequent confession was brutally honest. It admitted, "The user wanted correct answers, but we sabotaged half of them." 

    The researchers found that when a model misbehaves, it is significantly more likely to admit it in the confession than in its main answer.

    How confession training works

    The key to this method is the separation of rewards. During training, the reward assigned to the confession is based solely on its honesty and is never mixed with the reward for the main task. "Like the Catholic Church’s 'seal of confession', nothing that the model reveals can change the reward it receives for completing its original task," the researchers write. This creates a "safe space" for the model to admit fault without penalty.

    This approach is powerful because it sidesteps a major challenge in AI training. The researchers’ intuition is that honestly confessing to misbehavior is an easier task than achieving a high reward on the original, often complex, problem. 

    Therefore, when the model is judged purely on honesty, its incentive to trick the "confession judge" is much weaker than its incentive to trick the reward model for its main answer. For example, in an experiment where a model was trained to "hack" a weak reward model, its performance on the main task became more deceptive over time. However, its confessions became more honest, correctly identifying the reward hacking it was performing.

    However, the technique has its limits. Confessions are not a panacea for all types of AI failures. The system works best when a model is aware that it is misbehaving. It is less effective for "unknown unknowns." For instance, if a model hallucinates a fact and genuinely believes it is correct, it cannot confess to providing false information. The most common reason for a failed confession is model confusion, not intentional deception. Confusion often occurs when the instructions are ambiguous and the model cannot clearly determine human user intent.

    What it means for enterprise AI

    OpenAI’s confessions technique is part of a growing body of work on AI safety and control. Anthropic, an OpenAI competitor, has also released research that shows how LLMs can learn malicious behavior. The company is also working toward plugging these holes as they emerge.

    For AI applications, mechanisms such as confessions can provide a practical monitoring mechanism. The structured output from a confession can be used at inference time to flag or reject a model’s response before it causes a problem. For example, a system could be designed to automatically escalate any output for human review if its confession indicates a policy violation or high uncertainty.

    In a world where AI is increasingly agentic and capable of complex tasks, observability and control will be key elements for safe and reliable deployment.

    “As models become more capable and are deployed in higher-stakes settings, we need better tools for understanding what they are doing and why,” the OpenAI researchers write. “Confessions are not a complete solution, but they add a meaningful layer to our transparency and oversight stack.”

  • The debate over whether artificial intelligence belongs in the corporate boardroom appears to be over — at least for the people responsible for generating revenue.

    Seven in ten enterprise revenue leaders now trust AI to regularly inform their business decisions, according to a sweeping new study released Thursday by Gong, the revenue intelligence company. The finding marks a dramatic shift from just two years ago, when most organizations treated AI as an experimental technology relegated to pilot programs and individual productivity hacks.

    The research, based on an analysis of 7.1 million sales opportunities across more than 3,600 companies and a survey of over 3,000 global revenue leaders spanning the United States, United Kingdom, Australia, and Germany, paints a picture of an industry in rapid transformation. Organizations that have embedded AI into their core go-to-market strategies are 65 percent more likely to increase their win rates than competitors still treating the technology as optional.

    "I don't think people delegate decisions to AI, but they do rely on AI in the process of making decisions," Amit Bendov, Gong's co-founder and chief executive, said in an exclusive interview with VentureBeat. "Humans are making the decision, but they're largely assisted."

    The distinction matters. Rather than replacing human judgment, AI has become what Bendov describes as a "second opinion" — a data-driven check on the intuition and guesswork that has traditionally governed sales forecasting and strategy.

    Slowing growth is forcing sales teams to squeeze more from every rep

    The timing of AI's ascendance in revenue organizations is no coincidence. The study reveals a sobering reality: after rebounding in 2024, average annual revenue growth among surveyed companies decelerated to 16 percent in 2025, marking a three-percentage-point decline year over year. Sales rep quota attainment fell from 52 percent to 46 percent over the same period.

    The culprit, according to Gong's analysis, isn't that salespeople are performing worse on individual deals. Win rates and deal duration remained consistent. The problem is that representatives are working fewer opportunities—a finding that suggests operational inefficiencies are eating into selling time.

    This helps explain why productivity has rocketed to the top of executive priorities. For the first time in the study's history, increasing the productivity of existing teams ranked as the number-one growth strategy for 2026, jumping from fourth place the previous year.

    "The focus is on increasing sales productivity," Bendov said. "How much dollar-output per dollar-input."

    The numbers back up the urgency. Teams where sellers regularly use AI tools generate 77 percent more revenue per representative than those that don't — a gap Gong characterizes as a six-figure difference per salesperson annually.

    Companies are moving beyond basic AI automation toward strategic decision-making

    The nature of AI adoption in sales has evolved considerably over the past year. In 2024, most revenue teams used AI for basic automation: transcribing calls, drafting emails, updating CRM records. Those use cases continue to grow, but 2025 marked what the report calls a shift "from automation to intelligence."

    The number of U.S. companies using AI for forecasting and measuring strategic initiatives jumped 50 percent year over year. These more sophisticated applications — predicting deal outcomes, identifying at-risk accounts, measuring which value propositions resonate with different buyer personas — correlate with dramatically better results.

    Organizations in the 95th percentile of commercial impact from AI were two to four times more likely to have deployed these strategic use cases, according to the study.

    Bendov offered a concrete example of how this plays out in practice. "Companies have thousands of deals that they roll up into their forecast," he said. "It used to be based solely on human sentiment—believe it or not. That's why a lot of companies miss their numbers: because people say, 'Oh, he told me he'll buy,' or 'I think I can probably get this one.'"

    AI changes that calculus by examining evidence rather than optimism. "Companies now get a second opinion from AI on their forecasting, and that improves forecasting accuracy dramatically — 10 [or] 15 percent better accuracy just because it's evidence-based, not just based on human sentiment," Bendov said.

    Revenue-specific AI tools are dramatically outperforming general-purpose alternatives

    One of the study's more provocative findings concerns the type of AI that delivers results. Teams using revenue-specific AI solutions — tools built explicitly for sales workflows rather than general-purpose platforms like ChatGPT — reported 13 percent higher revenue growth and 85 percent greater commercial impact than those relying on generic tools.

    These specialized systems were also twice as likely to be deployed for forecasting and predictive modeling, the report found.

    The finding carries obvious implications for Gong, which sells precisely this type of domain-specific platform. But the data suggests a real distinction in outcomes. General-purpose AI, while more prevalent, often creates what the report describes as a "blind spot" for organizations — particularly when employees adopt consumer AI tools without company oversight.

    Research from MIT suggests that while only 59 percent of survey respondents said their teams use personal AI tools like ChatGPT at work, the actual figure is likely closer to 90 percent. This shadow AI usage poses security risks and creates fragmented technology stacks that undermine the potential for organization-wide intelligence.

    Most sales leaders believe AI will reshape their jobs rather than eliminate them

    Perhaps the most closely watched question in any AI study concerns employment. The Gong research offers a more nuanced picture than the apocalyptic predictions that often dominate headlines.

    When asked about AI's three-year impact on revenue headcount, 43 percent of respondents said they expect it to transform jobs without reducing headcount — the most common response. Only 28 percent anticipate job eliminations, while 21 percent actually foresee AI creating new roles. Just 8 percent predict minimal impact.

    Bendov frames the opportunity in terms of reclaiming lost time. He cited Forrester research indicating that 77 percent of a sales representative's time is spent on activities that don't involve customers — administrative work, meeting preparation, researching accounts, updating forecasts, and internal briefings.

    "AI can eliminate, ideally, all 77 percent—all the drudgery work that they're doing," Bendov said. "I don't think it necessarily eliminates jobs. People are half productive right now. Let's make them fully productive, and whatever you're paying them will translate to much higher revenue."

    The transformation is already visible in role consolidation. Over the past decade, sales organizations splintered into hyper-specialized functions: one person qualifies leads, another sets appointments, a third closes deals, a fourth handles onboarding. The result was customers interacting with five or six different people across their buying journey.

    "Which is not a great buyer experience, because every time I meet a new person that might not have the full context, and it's very inefficient for companies," Bendov said. "Now with AI, you can have one person do all this, or much of this."

    At Gong itself, sellers now generate 80 percent of their own appointments because AI handles the prospecting legwork, Bendov said.

    American companies are adopting AI 18 months faster than their European counterparts

    The study reveals a notable divide in AI adoption between the United States and Europe. While 87 percent of U.S. companies now use AI in their revenue operations, with another 9 percent planning adoption within a year, the United Kingdom trails by 12 to 18 months. Just 70 percent of UK companies currently use AI, with 22 percent planning near-term adoption — figures that mirror U.S. data from 2024.

    Bendov said the pattern reflects a broader historical tendency for enterprise technology trends to cross the Atlantic with a delay. "It's always like that," he said. "Even when the internet was taking off in the US, Europe was a step behind."

    The gap isn't permanent, he noted, and Europe sometimes leads on technology adoption — mobile payments and messaging apps like WhatsApp gained traction there before the U.S. — but for AI specifically, the American market remains ahead.

    Gong says a decade of AI development gives it an edge over Salesforce and Microsoft

    The findings arrive as Gong navigates an increasingly crowded market. The company, which recently surpassed $300 million in annual recurring revenue, faces potential competition from enterprise software giants like Salesforce and Microsoft, both of which are embedding AI capabilities into their platforms.

    Bendov argues that Gong's decade of AI development creates a substantial barrier to entry. The company's architecture comprises three layers: a "revenue graph" that aggregates customer data from CRM systems, emails, calls, videos, and web signals; an intelligence layer combining large language models with approximately 40 proprietary small language models; and workflow applications built on top.

    "Anybody that would want to build something like that—it's not a small feature, it's 10 years in development—would need first to build the revenue graph," Bendov said.

    Rather than viewing Salesforce and Microsoft as threats, Bendov characterized them as partners, pointing to both companies' participation in Gong's recent user conference to discuss agent interoperability. The rise of MCP (Model Context Protocol) support and consumption-based pricing models means customers can mix AI agents from multiple vendors rather than committing to a single platform.

    The real question is whether AI will expand the sales profession or hollow it out

    The report's implications extend beyond sales departments. If AI can transform revenue operations — long considered a relationship-driven, human-centric function — it raises questions about which other business processes might be next.

    Bendov sees the potential for expansion rather than contraction. Drawing an analogy to digital photography, he noted that while camera manufacturers suffered, the total number of photos taken exploded once smartphones made photography effortless.

    "If AI makes selling simple, I could see a world—I don't know exactly what it looks like yet—but why not?" Bendov said. "Maybe ten times more jobs than we have now. It's expensive and inefficient today, but if it becomes as easy as taking a photo, the industry could actually grow and create opportunities for people of different abilities, from different locations."

    For Bendov, who co-founded Gong in 2015 when AI was still a hard sell to non-technical business users, the current moment represents something he waited a decade to see. Back then, mentioning AI to sales executives sounded like science fiction. The company struggled to raise money because the underlying technology barely existed.

    "When we started the company, we were born as an AI company, but we had to almost hide AI," Bendov recalled. "It was intimidating."

    Now, seven out of ten of those same executives say they trust AI to help run their business. The technology that once had to be disguised has become the one thing nobody can afford to ignore.

  • Amazon Web Services on Wednesday introduced Kiro powers, a system that allows software developers to give their AI coding assistants instant, specialized expertise in specific tools and workflows — addressing what the company calls a fundamental bottleneck in how artificial intelligence agents operate today.

    AWS made the announcement at its annual re:Invent conference in Las Vegas. The capability marks a departure from how most AI coding tools work today. Typically, these tools load every possible capability into memory upfront — a process that burns through computational resources and can overwhelm the AI with irrelevant information. Kiro powers takes the opposite approach, activating specialized knowledge only at the moment a developer actually needs it.

    "Our goal is to give the agent specialized context so it can reach the right outcome faster — and in a way that also reduces cost," said Deepak Singh, Vice President of Developer Agents and Experiences at Amazon, in an exclusive interview with VentureBeat.

    The launch includes partnerships with nine technology companies: Datadog, Dynatrace, Figma, Neon, Netlify, Postman, Stripe, Supabase, and AWS's own services. Developers can also create and share their own powers with the community.

    Why AI coding assistants choke when developers connect too many tools

    To understand why Kiro powers matters, it helps to understand a growing tension in the AI development tool market.

    Modern AI coding assistants rely on something called the Model Context Protocol, or MCP, to connect with external tools and services. When a developer wants their AI assistant to work with Stripe for payments, Figma for design, and Supabase for databases, they connect MCP servers for each service.

    The problem: each connection loads dozens of tool definitions into the AI's working memory before it writes a single line of code. According to AWS documentation, connecting just five MCP servers can consume more than 50,000 tokens — roughly 40 percent of an AI model's context window — before the developer even types their first request.

    Developers have grown increasingly vocal about this issue. Many complain that they don't want to burn through their token allocations just to have an AI agent figure out which tools are relevant to a specific task. They want to get to their workflow instantly — not watch an overloaded agent struggle to sort through irrelevant context.

    This phenomenon, which some in the industry call "context rot," leads to slower responses, lower-quality outputs, and significantly higher costs — since AI services typically charge by the token.

    Inside the technology that loads AI expertise on demand

    Kiro powers addresses this by packaging three components into a single, dynamically-loaded bundle.

    The first component is a steering file called POWER.md, which functions as an onboarding manual for the AI agent. It tells the agent what tools are available and, crucially, when to use them. The second component is the MCP server configuration itself — the actual connection to external services. The third includes optional hooks and automation that trigger specific actions.

    When a developer mentions "payment" or "checkout" in their conversation with Kiro, the system automatically activates the Stripe power, loading its tools and best practices into context. When the developer shifts to database work, Supabase activates while Stripe deactivates. The baseline context usage when no powers are active approaches zero.

    "You click a button and it automatically loads," Singh said. "Once a power has been created, developers just select 'open in Kiro' and it launches the IDE with everything ready to go."

    How AWS is bringing elite developer techniques to the masses

    Singh framed Kiro powers as a democratization of advanced development practices. Before this capability, only the most sophisticated developers knew how to properly configure their AI agents with specialized context — writing custom steering files, crafting precise prompts, and manually managing which tools were active at any given time.

    "We've found that our developers were adding in capabilities to make their agents more specialized," Singh said. "They wanted to give the agent some special powers to do a specific problem. For example, they wanted their front end developer, and they wanted the agent to become an expert at backend as a service."

    This observation led to a key insight: if Supabase or Stripe could build the optimal context configuration once, every developer using those services could benefit.

    "Kiro powers formalizes that — things that people, only the most advanced people were doing — and allows anyone to get those kind of skills," Singh said.

    Why dynamic loading beats fine-tuning for most AI coding use cases

    The announcement also positions Kiro powers as a more economical alternative to fine-tuning, the process of training an AI model on specialized data to improve its performance in specific domains.

    "It's much cheaper," Singh said, when asked how powers compare to fine-tuning. "Fine-tuning is very expensive, and you can't fine-tune most frontier models."

    This is a significant point. The most capable AI models from Anthropic, OpenAI, and Google are typically "closed source," meaning developers cannot modify their underlying training. They can only influence the models' behavior through the prompts and context they provide.

    "Most people are already using powerful models like Sonnet 4.5 or Opus 4.5," Singh said. "What those models need is to be pointed in the right direction."

    The dynamic loading mechanism also reduces ongoing costs. Because powers only activate when relevant, developers aren't paying for token usage on tools they're not currently using.

    Where Kiro powers fits in Amazon's bigger bet on autonomous AI agents

    Kiro powers arrives as part of a broader push by AWS into what the company calls "agentic AI" — artificial intelligence systems that can operate autonomously over extended periods.

    Earlier at re:Invent, AWS announced three "frontier agents" designed to work for hours or days without human intervention: the Kiro autonomous agent for software development, the AWS security agent, and the AWS DevOps agent. These represent a different approach from Kiro powers — tackling large, ambiguous problems rather than providing specialized expertise for specific tasks.

    The two approaches are complementary. Frontier agents handle complex, multi-day projects that require autonomous decision-making across multiple codebases. Kiro powers, by contrast, gives developers precise, efficient tools for everyday development tasks where speed and token efficiency matter most.

    The company is betting that developers need both ends of this spectrum to be productive.

    What Kiro powers reveals about the future of AI-assisted software development

    The launch reflects a maturing market for AI development tools. GitHub Copilot, which Microsoft launched in 2021, introduced millions of developers to AI-assisted coding. Since then, a proliferation of tools — including Cursor, Cline, and Claude Code — have competed for developers' attention.

    But as these tools have grown more capable, they've also grown more complex. The Model Context Protocol, which Anthropic open-sourced last year, created a standard for connecting AI agents to external services. That solved one problem while creating another: the context overload that Kiro powers now addresses.

    AWS is positioning itself as the company that understands production software development at scale. Singh emphasized that Amazon's experience running AWS for 20 years, combined with its own massive internal software engineering organization, gives it unique insight into how developers actually work.

    "It's not something you would use just for your prototype or your toy application," Singh said of AWS's AI development tools. "If you want to build production applications, there's a lot of knowledge that we bring in as AWS that applies here."

    The road ahead for Kiro powers and cross-platform compatibility

    AWS indicated that Kiro powers currently works only within the Kiro IDE, but the company is building toward cross-compatibility with other AI development tools, including command-line interfaces, Cursor, Cline, and Claude Code. The company's documentation describes a future where developers can "build a power once, use it anywhere" — though that vision remains aspirational for now.

    For the technology partners launching powers today, the appeal is straightforward: rather than maintaining separate integration documentation for every AI tool on the market, they can create a single power that works everywhere Kiro does. As more AI coding assistants crowd into the market, that kind of efficiency becomes increasingly valuable.

    Kiro powers is available now to developers using Kiro IDE version 0.7 or later at no additional charge beyond the standard Kiro subscription.

    The underlying bet is a familiar one in the history of computing: that the winners in AI-assisted development won't be the tools that try to do everything at once, but the ones smart enough to know what to forget.

  • Just a few short weeks ago, Google debuted its Gemini 3 model, claiming it scored a leadership position in multiple AI benchmarks. But the challenge with vendor-provided benchmarks is that they are just that — vendor-provided.

    A new vendor-neutral evaluation from Prolific, however, puts Gemini 3 at the top of the leaderboard. This isn't on a set of academic benchmarks; rather, it's on a set of real-world attributes that actual users and organizations care about. 

    Prolific was founded by researchers at the University of Oxford. The company delivers high-quality, reliable human data to power rigorous research and ethical AI development. The company's “HUMAINE benchmark” applies this approach by using representative human sampling and blind testing to rigorously compare AI models across a variety of user scenarios, measuring not just technical performance but also user trust, adaptability and communication style.

    The latest HUMAINE test evaluated 26,000 users in a blind test of models. In the evaluation, Gemini 3 Pro's trust score surged from 16% to 69%, the highest ever recorded by Prolific. Gemini 3 now ranks number one overall in trust, ethics and safety 69% of the time across demographic subgroups, compared to its predecessor Gemini 2.5 Pro, which held the top spot only 16% of the time.

    Overall, Gemini 3 ranked first in three of four evaluation categories: performance and reasoning, interaction and adaptiveness and trust and safety. It lost only on communication style, where DeepSeek V3 topped preferences at 43%. The HUMAINE test also showed that Gemini 3 performed consistently well across 22 different demographic user groups, including variations in age, sex, ethnicity and political orientation. The evaluation also found that users are now five times more likely to choose the model in head-to-head blind comparisons.

    But the ranking matters less than why it won.

    "It's the consistency across a very wide range of different use cases, and a personality and a style that appeals across a wide range of different user types," Phelim Bradley, co-founder and CEO of Prolific, told VentureBeat. "Although in some specific instances, other models are preferred by either small subgroups or on a particular conversation type, it's the breadth of knowledge and the flexibility of the model across a range of different use cases and audience types that allowed it to win this particular benchmark."

    How blinded testing reveals what academic benchmarks miss

    HUMAINE's methodology exposes gaps in how the industry evaluates models. Users interact with two models simultaneously in multi-turn conversations. They don't know which vendors power each response. They discuss whatever topics matter to them, not predetermined test questions.

    It's the sample itself that matters. HUMAINE uses representative sampling across U.S. and UK populations, controlling for age, sex, ethnicity and political orientation. This reveals something static benchmarks can't capture: Model performance varies by audience.

    "If you take an AI leaderboard, the majority of them still could have a fairly static list," Bradley said. "But for us, if you control for the audience, we end up with a slightly different leaderboard, whether you're looking at a left-leaning sample, right-leaning sample, U.S., UK. And I think age was actually the most different stated condition in our experiment."

    For enterprises deploying AI across diverse employee populations, this matters. A model that performs well for one demographic may underperform for another.

    The methodology also addresses a fundamental question in AI evaluation: Why use human judges at all when AI could evaluate itself? Bradley noted that his firm does use AI judges in certain use cases, although he stressed that human evaluation is still the critical factor.

    "We see the biggest benefit coming from smart orchestration of both LLM judge and human data, both have strengths and weaknesses, that, when smartly combined, do better together," said Bradley. "But we still think that human data is where the alpha is. We're still extremely bullish that human data and human intelligence is required to be in the loop."

    What trust means in AI evaluation

    Trust, ethics and safety measures user confidence in reliability, factual accuracy and responsible behavior. In HUMAINE's methodology, trust isn't a vendor claim or a technical metric — it's what users report after blinded conversations with competing models.

    The 69% figure represents probability across demographic groups. This consistency matters more than aggregate scores because organizations can serve diverse populations.

    "There was no awareness that they were using Gemini in this scenario," Bradley said. "It was based only on the blinded multi-turn response."

    This separates perceived trust from earned trust. Users judged model outputs without knowing which vendor produced them, eliminating Google's brand advantage. For customer-facing deployments where the AI vendor remains invisible to end users, this distinction matters.

    What enterprises should do now

    One of the critical things that enterprises should do now when considering different models is embrace an evaluation framework that works.

    "It is increasingly challenging to evaluate models exclusively based on vibes," Bradley said. "I think increasingly we need more rigorous, scientific approaches to truly understand how these models are performing."

    The HUMAINE data provides a framework: Test for consistency across use cases and user demographics, not just peak performance on specific tasks. Blind the testing to separate model quality from brand perception. Use representative samples that match your actual user population. Plan for continuous evaluation as models change.

    For enterprises looking to deploy AI at scale, this means moving beyond "which model is best" to "which model is best for our specific use case, user demographics and required attributes."

     The rigor of representative sampling and blind testing provides the data to make that determination — something technical benchmarks and vibes-based evaluation cannot deliver.

  • Presented by Indeed


    As AI continues to reshape how we work, organizations are rethinking what skills they need, how they hire, and how they retain talent. According to Indeed’s 2025 Tech Talent report, tech job postings are still down more than 30% from pre-pandemic highs, yet demand for AI expertise has never been greater. New roles are emerging almost overnight, from prompt engineers to AI operations managers, and leaders are under growing pressure to close skill gaps while supporting their teams through change.

    Shibani Ahuja, SVP of enterprise IT strategy at Salesforce; Matt Candy, global managing partner of generative AI strategy and transformation at IBM; and Jessica Hardeman, global head of attraction and engagement at Indeed came together for a recent roundtable conversation about the future of tech talent strategy, from hiring and reskilling to how it's reshaping the workforce.

    Strategies for sourcing talent

    To find the right candidates, organizations need to be certain their communication is clear from the get-go, and that means beginning with a well-thought-out job description, Hardeman said.

    "How clearly are you outlining the skills that are actually required for the role, versus using very high-level or ambiguous language," she said. "Something that I also highly recommend is skill-cluster sourcing. We use that to identify candidates that might be adjacent to these harder-to-find niche skills. That’s something we can upskill people into. For example, skills that are in distributed computing or machine learning frameworks also share other high-value capabilities. Using these clusters can help recruiters identify candidates that may not have that exact skill set you’re looking for, but can quickly upskill into it."

    Recruiters should also be upskilled, able to spot that potential in candidates. And once they're hired, companies have to be intentional about how they’re growing talent from the day they step in the door.

    "What that means in the near term is focusing on the mentorship, embedding that AI fluency into their onboarding experience, into their growth, into their development," she said. "That means offering upskilling that teaches not just the tools they’ll need, but how to think with those tools and alongside those. The new early career sweet spot is where technical skills meet our human strengths. Curiosity. Communication. Data judgment. Workflow design. Those are the things that AI cannot replicate or replace. We have to create mentorship and sponsorship opportunities. Well-being and culture are critical components to ensuring that we’re creating good places for that early-in-career talent to land."

    How work will evolve along AI

    As AI becomes embedded into daily technical work, organizations are rethinking what it means to be a developer, designer, or engineer. Instead of automating roles end to end, companies are increasingly building AI agents that act as teammates, supporting workers across the entire software development lifecycle.

    Candy explained that IBM is already seeing this shift in action through its Consulting Advantage platform, which serves as a unified AI experience layer for consultants and technical teams.

    “This is a platform that every one of our consultants works with,” he said. “It’s supported by every piece of AI technology and model out there. It’s the place where our consultants can access thousands of agents that help them in each job role and activity they’re doing.”

    These aren’t just prebuilt tools — teams can create and publish their own agents into an internal marketplace. That has sparked a systematic effort to map every task across traditional tech roles and build agents to enhance them.

    “If I think about your traditional designer, DevOps engineer, AI Ops engineer — what are all the different agents that are supporting them in those activities?” Candy said. “It’s far more than just coding. Tools like Cursor, Windsurf, and GitHub Copilot accelerate coding, but that’s only one part of delivering software end to end. We’re building agents to support people at every stage of that journey.”

    Candy said this shift leads toward a workplace where AI becomes a collaborative partner rather than a replacement, something that enables tech workers to spend more time on creative, strategic, and human-centered tasks.

    "This future where employees have agents working alongside them, taking care of some of these repetitive activities, focusing on higher-value strategic work where human skills are innately important, I think becomes right at the heart of that,” he explained. “You have to unleash the organization to be able to think and rethink in that way."

    A lot of that depends on the mindset of company leaders, Ahuja said.

    "I can see the difference between leaders that look at AI as cost-cutting, reduction — it’s a bottom-line activity,” she said. “And then there are organizations that are starting to shift their mindset to say, no, the goal is not about replacing people. It’s about reimagining the work to make us humans more human, ironically. For some leaders that’s the story their PR teams have told them to say. But for those that actually believe that AI is about helping us become more human, it’s interesting how they’re bringing that to life and bridging this gap between humanity and digital labor."

    Shifting the culture toward AI

    The companies that are most successful at navigating the obstacles around successful AI implementation and culture change make employees their first priority, Ahuja added. They prioritize use cases that solve the most boring problems that are burdening their teams, demonstrating how AI will help, as opposed to looking at what the maximum number of jobs automation can replace.

    "They’re thinking of it as preserving human accountability, so in high-stakes moments, people will still make that final call," she said. "Looking at where AI is going to excel at scale and speed with pattern recognition, leaving that space for humans to bring their judgement, their ethics, and their emotional intelligence. It seems like a very subtle shift, but it’s pretty big in terms of where it starts at the beginning of an organization and how it trickles down."

    It's also important to build a level of comfort in using AI in employees’ day-to-day work. Salesforce created a Slack chat called Bite-Sized AI in which they encourage every colleague, including company leaders, to talk about where they're using AI and why, and what hacks they've found.

    "That’s creating a safe space," Ahuja explained. "It’s creating that psychological safety — that this isn’t just a buzzword. We’re trying to encourage it through behavior."

    "This is all about how you ignite, especially in big enterprises, the kind of passion and fire inside everyone’s belly," Candy added. "Storytelling, showing examples of what great looks like. The expression is 'demos, not memos'. Stop writing PowerPoint slides explaining what we're going to do and actually getting into the tools to show it in real life.”

    AI makes that continuous learning a non-negotiable, Hardeman added, with companies training employees in understanding how to use the AI tools they're provided, and that goes a long way toward building that AI culture.

    "We view upskilling as a retention lever and a performance driver," she said. "It creates that confidence, it reduces the fear around AI adoption. It helps people see a future for themselves as the technology evolves. AI didn’t just raise the bar on skills. It raised the bar on how we’re trying to support our people. It’s important that we are also rising to that occasion, and we’re not just raising expectations on the folks that we work with."


    Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

  • Presented by Celonis


    When tariff rates change overnight, companies have 48 hours to model alternatives and act before competitors secure the best options. At Celosphere 2025 in Munich, enterprises demonstrated how they’re turning that chaos into competitive advantage — with quantifiable results that separate winners from losers.

    Vinmar International: Theglobal plastics and chemicals distributor created a real-time digital twin of its $3B supply chain, cutting default expedites by more than 20% and improving delivery agility across global operations.

    Florida Crystals: One of America's largest cane sugar producers, the company unlocked millions in working capital and strengthened supply chain resilience by eliminating manual rework across Finance, Procurement, and Inbound Supply. AI pilots now extend gains into invoice processing, predictive maintenance, and order management.

    ASOS: The ecommerce fashion giant connected its end-to-end supply chain for full transparency, reducing process variation, accelerating speed-to-market, and improving customer experience at scale.

    The common thread here: process intelligence that bridges the gap traditional ERP systems can’t close — connecting operational dots across ERP, finance, and logistics systems when seconds matter.

    “The question isn’t whether disruptions will hit,” says Peter Budweiser, General Manager of Supply Chain at Celonis. “It’s whether your systems can show you what’s breaking fast enough to fix it.”

    That visibility gap costs the average company double-digit millions in working capital and competitive positioning. As 54% of supply chain leaders face disruptions daily, the pressure is shifting to AI agents that execute real actions: triggering purchase orders, rerouting shipments, adjusting inventory. But an autonomous agent acting on stale or siloed data can make million-dollar mistakes when tariff structures shift overnight.

    Tariffs, as old as trade itself, have become the ultimate stress test for enterprise AI — revealing whether companies truly understand their supply chains and whether their AI can be trusted to act.

    Modern ERP: Data rich, insight poor

    Supply chain leaders face a paradox: drowning in data while starving for insight. Traditional enterprise systems — SAP, Oracle, PeopleSoft — capture every transaction meticulously.

    SAP logs the purchase order. Oracle tracks the shipment. The warehouse system records inventory movement. Each performs its function, but when tariffs change and companies need to model alternative sourcing scenarios across all three simultaneously, the data sits in silos.

    “What’s changed is the speed at which disruptions cascade,” says Manik Sharma, Head of Supply Chain GTM AI at Celonis. “Traditional ERP systems weren’t built for today’s volatility.”

    Companies generate thousands of reports showing what happened last quarter. They struggle to answer what happens if tariffs increase 25% tomorrow and need to switch suppliers within days.

    Tariffs: The 48-hour scramble

    Global trade volatility has transformed tariffs from predictable costs into strategic weapons. When new rates drop with unprecedented frequency, input costs spike across suppliers, finance teams scramble to calculate margin impact, and procurement races to identify alternatives buried in disconnected systems where no one knows if switching suppliers delays shipments or violates contracts.

    By hour 48, competitors who already modeled scenarios execute supplier switches while late movers face capacity constraints and premium pricing.

    Process intelligence changes that dynamic by allowing businesses to continuously model “what-if” scenarios, showing leaders how tariff changes cascade through suppliers, contracts, production lines, warehouses, and customers. When rates hit, companies can move within hours instead of days.

    No AI without PI: Why process intelligence is non-negotiable for supply chains

    AI and supply chains are mutually dependent: AI needs operational context, and supply chains need AI to keep pace with volatility. But here's the truth — there is no AI without PI. Without process intelligence, AI agents operate blindly.

    The ongoing SAP migration wave illustrates why. An estimated 85–90% of SAP customers are still moving from ECC to S/4HANA. Moving to newer databases doesn’t solve supply chain visibility — it provides faster access to the same fragmented data.

    Kerry Brown, a transformation evangelist at Celonis, sees this across industries.

    “Organizations are shifting from PeopleSoft to Oracle, or EBS to Fusion. The bulk is in SAP,” she explains. “But what they really need isn’t a new ERP. They need to understand how work actually flows across systems they already have.”

    That requires end-to-end operational context. Process intelligence provides this by enabling companies to extract and connect event data across systems, showing how processes execute in real time.

    This distinction becomes critical when deploying autonomous agents. When visibility is fragmented, autonomous agents can easily make decisions that appear rational locally but create downstream disruption. With real-time context, AI can operate with clarity and precision, and supply chains can stay ahead of tariff-driven disruption.

    Digital Twins: Powering real-time response

    The companies highlighted at Celosphere all applied the same principle: understand how processes run across systems in real time. Celonis PI creates a digital twin above existing systems, using its Process Intelligence Graph to link orders, shipments, invoices, and payments end-to-end. Dependencies that traditional integrations miss become visible. A delay in SAP instantly reveals its impact across Oracle, warehouse scheduling, and customer delivery commitments.

    “The platform brings together process data spanning systems and departments, enriched with business context that powers AI agents to transform operations effectively,” says Daniel Brown, Chief Product Officer at Celonis.

    With this cross-system awareness, Celonis coordinates actions across complex workflows involving AI agents, humans, and automations — especially critical when tariffs force rapid decisions about suppliers, shipments, and customers.

    Zero-copy integration enables instant modeling

    A key advancement unveiled at Celosphere — zero-copy integration with Databricks — removes another barrier. Traditionally, analyzing supply chain data meant copying from source systems into central warehouses, creating data latency.

    Celonis Data Core now integrates directly with platforms like Databricks and Microsoft Fabric, querying billions of records in near real time without duplication. When trade policy shifts, companies model alternatives instantly, not after overnight data refresh cycles.

    Enhanced Task Mining extends this by connecting desktop activity — keystrokes, mouse clicks, screen scrolls — to business processes. This exposes manual work invisible to system logs: spreadsheet gymnastics, email negotiations, phone calls that keep supply chains moving during urgent changes.

    Competitive advantage in volatile markets

    Most companies can’t rip out and replace systems running critical operations — nor should they. Process intelligence offers a different path: compose workflows from existing systems, deploy AI where it creates value, and adapt continuously as conditions change. This “Free the Process” movement liberates companies from rigid architectures without forcing wholesale replacement.

    As global trade volatility intensifies, the companies that model will move faster, make smarter decisions, and turn tariff chaos into competitive advantage — all while existing ERPs keep running.

    When the next wave of tariffs hits — and it will — companies won’t have days to respond. They’ll have hours. The question isn’t whether your ERP captures the data. It’s whether your systems connect the dots fast enough to matter.

    Missed Celosphere 2025? Catch up with all the highlights here.


    Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

  • Mistral AI, Europe's most prominent artificial intelligence startup, is releasing its most ambitious product suite to date: a family of 10 open-source models designed to run everywhere from smartphones and autonomous drones to enterprise cloud systems, marking a major escalation in the company's challenge to both U.S. tech giants and surging Chinese competitors.

    The Mistral 3 family, launching today, includes a new flagship model called Mistral Large 3 and a suite of smaller "Ministral 3" models optimized for edge computing applications. All models will be released under the permissive Apache 2.0 license, allowing unrestricted commercial use — a sharp contrast to the closed systems offered by OpenAI, Google, and Anthropic.

    The release is a pointed bet by Mistral that the future of artificial intelligence lies not in building ever-larger proprietary systems, but in offering businesses maximum flexibility to customize and deploy AI tailored to their specific needs, often using smaller models that can run without cloud connectivity.

    "The gap between closed and open source is getting smaller, because more and more people are contributing to open source, which is great," Guillaume Lample, Mistral's chief scientist and co-founder, said in an exclusive interview with VentureBeat. "We are catching up fast."

    Why Mistral is choosing flexibility over frontier performance in the AI race

    The strategic calculus behind Mistral 3 diverges sharply from recent model releases by industry leaders. While OpenAI, Google, and Anthropic have focused recent launches on increasingly capable "agentic" systems — AI that can autonomously execute complex multi-step tasks — Mistral is prioritizing breadth, efficiency, and what Lample calls "distributed intelligence."

    Mistral Large 3, the flagship model, employs a Mixture of Experts architecture with 41 billion active parameters drawn from a total pool of 675 billion parameters. The model can process both text and images, handles context windows up to 256,000 tokens, and was trained with particular emphasis on non-English languages — a rarity among frontier AI systems.

    "Most AI labs focus on their native language, but Mistral Large 3 was trained on a wide variety of languages, making advanced AI useful for billions who speak different native languages," the company said in a statement reviewed ahead of the announcement.

    But the more significant departure lies in the Ministral 3 lineup: nine compact models across three sizes (14 billion, 8 billion, and 3 billion parameters) and three variants tailored for different use cases. Each variant serves a distinct purpose: base models for extensive customization, instruction-tuned models for general chat and task completion, and reasoning-optimized models for complex logic requiring step-by-step deliberation.

    The smallest Ministral 3 models can run on devices with as little as 4 gigabytes of video memory using 4-bit quantization — making frontier AI capabilities accessible on standard laptops, smartphones, and embedded systems without requiring expensive cloud infrastructure or even internet connectivity. This approach reflects Mistral's belief that AI's next evolution will be defined not by sheer scale, but by ubiquity: models small enough to run on drones, in vehicles, in robots, and on consumer devices.

    How fine-tuned small models beat expensive large models for enterprise customers

    Lample's comments reveal a business model fundamentally different from that of closed-source competitors. Rather than competing primarily on benchmark performance, Mistral is targeting enterprise customers frustrated by the cost and inflexibility of proprietary systems.

    "Sometimes customers say, 'Is there a use case where the best closed-source model isn't working?' If that's the case, then they're essentially stuck," Lample explained. "There's nothing they can do. It's the best model available, and it's not working out of the box."

    This is where Mistral's approach diverges. When a generic model fails, the company deploys engineering teams to work directly with customers, analyzing specific problems, creating synthetic training data, and fine-tuning smaller models to outperform larger general-purpose systems on narrow tasks.

    "In more than 90% of cases, a small model can do the job, especially if it's fine-tuned. It doesn't have to be a model with hundreds of billions of parameters, just a 14-billion or 24-billion parameter model," Lample said. "So it's not only much cheaper, but also faster, plus you have all the benefits: you don't need to worry about privacy, latency, reliability, and so on."

    The economic argument is compelling. Multiple enterprise customers have approached Mistral after building prototypes with expensive closed-source models, only to find deployment costs prohibitive at scale, according to Lample.

    "They come back to us a couple of months later because they realize, 'We built this prototype, but it's way too slow and way too expensive,'" he said.

    Where Mistral 3 fits in the increasingly crowded open-source AI market

    Mistral's release comes amid fierce competition on multiple fronts. OpenAI recently released GPT-5.1 with enhanced agentic capabilities. Google launched Gemini 3 with improved multimodal understanding. Anthropic released Opus 4.5 on the same day as this interview, with similar agent-focused features.

    But Lample argues those comparisons miss the point. "It's a little bit behind. But I think what matters is that we are catching up fast," he acknowledged regarding performance against closed models. "I think we are maybe playing a strategic long game."

    That long game involves a different competitive set: primarily open-source models from Chinese companies like DeepSeek and Alibaba's Qwen series, which have made remarkable strides in recent months.

    Mistral differentiates itself through multilingual capabilities that extend far beyond English or Chinese, multimodal integration handling both text and images in a unified model, and what the company characterizes as superior customization through easier fine-tuning.

    "One key difference with the models themselves is that we focused much more on multilinguality," Lample said. "If you look at all the top models from [Chinese competitors], they're all text-only. They have visual models as well, but as separate systems. We wanted to integrate everything into a single model."

    The multilingual emphasis aligns with Mistral's broader positioning as a European AI champion focused on digital sovereignty — the principle that organizations and nations should maintain control over their AI infrastructure and data.

    Building beyond models: Mistral's full-stack enterprise AI platform strategy

    Mistral 3's release builds on an increasingly comprehensive enterprise AI platform that extends well beyond model development. The company has assembled a full-stack offering that differentiates it from pure model providers.

    Recent product launches include Mistral Agents API, which combines language models with built-in connectors for code execution, web search, image generation, and persistent memory across conversations; Magistral, the company's reasoning model designed for domain-specific, transparent, and multilingual reasoning; and Mistral Code, an AI-powered coding assistant bundling models, an in-IDE assistant, and local deployment options with enterprise tooling.

    The consumer-facing Le Chat assistant has been enhanced with Deep Research mode for structured research reports, voice capabilities, and Projects for organizing conversations into context-rich folders. More recently, Le Chat gained a connector directory with 20+ enterprise integrations powered by the Model Context Protocol (MCP), spanning tools like Databricks, Snowflake, GitHub, Atlassian, Asana, and Stripe.

    In October, Mistral unveiled AI Studio, a production AI platform providing observability, agent runtime, and AI registry capabilities to help enterprises track output changes, monitor usage, run evaluations, and fine-tune models using proprietary data.

    Mistral now positions itself as a full-stack, global enterprise AI company, offering not just models but an application-building layer through AI Studio, compute infrastructure, and forward-deployed engineers to help businesses realize return on investment.

    Why open source AI matters for customization, transparency and sovereignty

    Mistral's commitment to open-source development under permissive licenses is both an ideological stance and a competitive strategy in an AI landscape increasingly dominated by closed systems.

    Lample elaborated on the practical benefits: "I think something that people don't realize — but our customers know this very well — is how much better any model can actually improve if you fine tune it on the task of interest. There's a huge gap between a base model and one that's fine-tuned for a specific task, and in many cases, it outperforms the closed-source model."

    The approach enables capabilities impossible with closed systems: organizations can fine-tune models on proprietary data that never leaves their infrastructure, customize architectures for specific workflows, and maintain complete transparency into how AI systems make decisions — critical for regulated industries like finance, healthcare, and defense.

    This positioning has attracted government and public sector partnerships. The company launched "AI for Citizens" in July 2025, an initiative to "help States and public institutions strategically harness AI for their people by transforming public services" and has secured strategic partnerships with France's army and job agency, Luxembourg's government, and various European public sector organizations.

    Mistral's transatlantic AI collaboration goes beyond European borders

    While Mistral is frequently characterized as Europe's answer to OpenAI, the company views itself as a transatlantic collaboration rather than a purely European venture. The company has teams across both continents, with co-founders spending significant time with customers and partners in the United States, and these models are being trained in partnerships with U.S.-based teams and infrastructure providers.

    This transatlantic positioning may prove strategically important as geopolitical tensions around AI development intensify. The recent ASML investment, a €1.7 billion ($1.5 billion) funding round led by the Dutch semiconductor equipment manufacturer, signals deepening collaboration across the Western semiconductor and AI value chain at a moment when both Europe and the United States are seeking to reduce dependence on Chinese technology.

    Mistral's investor base reflects this dynamic: the Series C round included participation from U.S. firms Andreessen Horowitz, General Catalyst, Lightspeed, and Index Ventures alongside European investors like France's state-backed Bpifrance and global players like DST Global and Nvidia.

    Founded in May 2023 by former Google DeepMind and Meta researchers, Mistral has raised roughly $1.05 billion (€1 billion) in funding. The company was valued at $6 billion in a June 2024 Series B, then more than doubled its valuation in a September Series C.

    Can customization and efficiency beat raw performance in enterprise AI?

    The Mistral 3 release crystallizes a fundamental question facing the AI industry: Will enterprises ultimately prioritize the absolute cutting-edge capabilities of proprietary systems, or will they choose open, customizable alternatives that offer greater control, lower costs, and independence from big tech platforms?

    Mistral's answer is unambiguous. The company is betting that as AI moves from prototype to production, the factors that matter most shift dramatically. Raw benchmark scores matter less than total cost of ownership. Slight performance edges matter less than the ability to fine-tune for specific workflows. Cloud-based convenience matters less than data sovereignty and edge deployment.

    It's a wager with significant risks. Despite Lample's optimism about closing the performance gap, Mistral's models still trail the absolute frontier. The company's revenue, while growing, reportedly remains modest relative to its nearly $14 billion valuation. And competition intensifies from both well-funded Chinese rivals making remarkable open-source progress and U.S. tech giants increasingly offering their own smaller, more efficient models.

    But if Mistral is right — if the future of AI looks less like a handful of cloud-based oracles and more like millions of specialized systems running everywhere from factory floors to smartphones — then the company has positioned itself at the center of that transformation.

    The release of Mistral 3 is the most comprehensive expression yet of that vision: 10 models, spanning every size category, optimized for every deployment scenario, available to anyone who wants to build with them.

    Whether "distributed intelligence" becomes the industry's dominant paradigm or remains a compelling alternative serving a narrower market will determine not just Mistral's fate, but the broader question of who controls the AI future — and whether that future will be open.

    For now, the race is on. And Mistral is betting it can win not by building the biggest model, but by building everywhere else.

  • Amazon Web Services on Tuesday announced a new class of artificial intelligence systems called "frontier agents" that can work autonomously for hours or even days without human intervention, representing one of the most ambitious attempts yet to automate the full software development lifecycle.

    The announcement, made during AWS CEO Matt Garman's keynote address at the company's annual re:Invent conference, introduces three specialized AI agents designed to act as virtual team members: Kiro autonomous agent for software development, AWS Security Agent for application security, and AWS DevOps Agent for IT operations.

    The move signals Amazon's intent to leap ahead in the intensifying competition to build AI systems capable of performing complex, multi-step tasks that currently require teams of skilled engineers.

    "We see frontier agents as a completely new class of agents," said Deepak Singh, vice president of developer agents and experiences at Amazon, in an interview ahead of the announcement. "They're fundamentally designed to work for hours and days. You're not giving them a problem that you want finished in the next five minutes. You're giving them complex challenges that they may have to think about, try different solutions, and get to the right conclusion — and they should do that without intervention."

    Why Amazon believes its new agents leave existing AI coding tools behind

    The frontier agents differ from existing AI coding assistants like GitHub Copilot or Amazon's own CodeWhisperer in several fundamental ways.

    Current AI coding tools, while powerful, require engineers to drive every interaction. Developers must write prompts, provide context, and manually coordinate work across different code repositories. When switching between tasks, the AI loses context and must start fresh.

    The new frontier agents, by contrast, maintain persistent memory across sessions and continuously learn from an organization's codebase, documentation, and team communications. They can independently determine which code repositories require changes, work on multiple files simultaneously, and coordinate complex transformations spanning dozens of microservices.

    "With a current agent, you would go microservice by microservice, making changes one at a time, and each change would be a different session with no shared context," Singh explained. "With a frontier agent, you say, 'I need to solve this broad problem.' You point it to the right application, and it decides which repos need changes."

    The agents exhibit three defining characteristics that AWS believes set them apart: autonomy in decision-making, the ability to scale by spawning multiple agents to work on different aspects of a problem simultaneously, and the capacity to operate independently for extended periods.

    "A frontier agent can decide to spin up 10 versions of itself, all working on different parts of the problem at once," Singh said.

    How each of the three frontier agents tackles a different phase of development

    Kiro autonomous agent serves as a virtual developer that maintains context across coding sessions and learns from an organization's pull requests, code reviews, and technical discussions. Teams can connect it to GitHub, Jira, Slack, and internal documentation systems. The agent then acts like a teammate, accepting task assignments and working independently until it either completes the work or requires human guidance.

    AWS Security Agent embeds security expertise throughout the development process, automatically reviewing design documents and scanning pull requests against organizational security requirements. Perhaps most significantly, it transforms penetration testing from a weeks-long manual process into an on-demand capability that completes in hours.

    SmugMug, a photo hosting platform, has already deployed the security agent. "AWS Security Agent helped catch a business logic bug that no existing tools would have caught, exposing information improperly," said Andres Ruiz, staff software engineer at the company. "To any other tool, this would have been invisible. But the ability for Security Agent to contextualize the information, parse the API response, and find the unexpected information there represents a leap forward in automated security testing."

    AWS DevOps Agent functions as an always-on operations team member, responding instantly to incidents and using its accumulated knowledge to identify root causes. It connects to observability tools including Amazon CloudWatch, Datadog, Dynatrace, New Relic, and Splunk, along with runbooks and deployment pipelines.

    Commonwealth Bank of Australia tested the DevOps agent by replicating a complex network and identity management issue that typically requires hours for experienced engineers to diagnose. The agent identified the root cause in under 15 minutes.

    "AWS DevOps Agent thinks and acts like a seasoned DevOps engineer, helping our engineers build a banking infrastructure that's faster, more resilient, and designed to deliver better experiences for our customers," said Jason Sandry, head of cloud services at Commonwealth Bank.

    Amazon makes its case against Google and Microsoft in the AI coding wars

    The announcement arrives amid a fierce battle among technology giants to dominate the emerging market for AI-powered development tools. Google has made significant noise in recent weeks with its own AI coding capabilities, while Microsoft continues to advance GitHub Copilot and its broader AI development toolkit.

    Singh argued that AWS holds distinct advantages rooted in the company's 20-year history operating cloud infrastructure and Amazon's own massive software engineering organization.

    "AWS has been the cloud of choice for 20 years, so we have two decades of knowledge building and running it, and working with customers who've been building and running applications on it," Singh said. "The learnings from operating AWS, the knowledge our customers have, the experience we've built using these tools ourselves every day to build real-world applications—all of that is embodied in these frontier agents."

    He drew a distinction between tools suitable for prototypes versus production systems. "There's a lot of things out there that you can use to build your prototype or your toy application. But if you want to build production applications, there's a lot of knowledge that we bring in as AWS that apply here."

    The safeguards Amazon built to keep autonomous agents from going rogue

    The prospect of AI systems operating autonomously for days raises immediate questions about what happens when they go off track. Singh described multiple safeguards built into the system.

    All learnings accumulated by the agents are logged and visible, allowing engineers to understand what knowledge influences the agent's decisions. Teams can even remove specific learnings if they discover the agent has absorbed incorrect information from team communications.

    "You can go in and even redact that from its knowledge like, 'No, we don't want you to ever use this knowledge,'" Singh said. "You can look at the knowledge like it's almost—it's like looking at your neurons inside your brain. You can disconnect some."

    Engineers can also monitor agent activity in real-time and intervene when necessary, either redirecting the agent or taking over entirely. Most critically, the agents never commit code directly to production systems. That responsibility remains with human engineers.

    "These agents are never going to check the code into production. That is still the human's responsibility," Singh emphasized. "You are still, as an engineer, responsible for the code you're checking in, whether it's generated by you or by an agent working autonomously."

    What frontier agents mean for the future of software engineering jobs

    The announcement inevitably raises concerns about the impact on software engineering jobs. Singh pushed back against the notion that frontier agents will replace developers, framing them instead as tools that amplify human capabilities.

    "Software engineering is craft. What's changing is not, 'Hey, agents are doing all the work.' The craft of software engineering is changing—how you use agents, how do you set up your code base, how do you set up your prompts, how do you set up your rules, how do you set up your knowledge bases so that agents can be effective," he said.

    Singh noted that senior engineers who had drifted away from hands-on coding are now writing more code than ever. "It's actually easier for them to become software engineers," he said.

    He pointed to an internal example where a team completed a project in 78 days that would have taken 18 months using traditional practices. "Because they were able to use AI. And the thing that made it work was not just the fact that they were using AI, but how they organized and set up their practices of how they built that software were maximized around that."

    How Amazon plans to make AI-generated code more trustworthy over time

    Singh outlined several areas where frontier agents will evolve over the coming years. Multi-agent architectures, where systems of specialized agents coordinate to solve complex problems, represent a major frontier. So does the integration of formal verification techniques to increase confidence in AI-generated code.

    AWS recently introduced property-based testing in Kiro, which uses automated reasoning to extract testable properties from specifications and generate thousands of test scenarios automatically.

    "If you have a shopping cart application, every way an order can be canceled, and how it might be canceled, and the way refunds are handled in Germany versus the US—if you're writing a unit test, maybe two, Germany and US, but now, because you have this property-based testing approach, your agent can create a scenario for every country you operate in and test all of them automatically for you," Singh explained.

    Building trust in autonomous systems remains the central challenge. "Right now you still require tons of human guardrails at every step to make sure that the right thing happens. And as we get better at these techniques, you will use less and less, and you'll be able to trust the agents a lot more," he said.

    Amazon's bigger bet on autonomous AI stretches far beyond writing code

    The frontier agents announcement arrived alongside a cascade of other news at re:Invent 2025. AWS kicked off the conference with major announcements on agentic AI capabilities, customer service innovations, and multicloud networking.

    Amazon expanded its Nova portfolio with four new models delivering industry-leading price-performance across reasoning, multimodal processing, conversational AI, code generation, and agentic tasks. Nova Forge pioneers "open training," giving organizations access to pre-trained model checkpoints and the ability to blend proprietary data with Amazon Nova-curated datasets.

    AWS also added 18 new open weight models to Amazon Bedrock, reinforcing its commitment to offering a broad selection of fully managed models from leading AI providers. The launch includes new models from Mistral AI, Google's Gemma 3, MiniMax's M2, NVIDIA's Nemotron, and OpenAI's GPT OSS Safeguard.

    On the infrastructure side, Amazon EC2 Trn3 UltraServers, powered by AWS's first 3nm AI chip, pack up to 144 Trainium3 chips into a single integrated system, delivering up to 4.4x more compute performance and 4x greater energy efficiency than the previous generation. AWS AI Factories provides enterprises and government organizations with dedicated AWS AI infrastructure deployed in their own data centers, combining NVIDIA GPUs, Trainium chips, AWS networking, and AI services like Amazon Bedrock and SageMaker AI.

    All three frontier agents launched in preview on Tuesday. Pricing will be announced when the services reach general availability.

    Singh made clear the company sees applications far beyond coding. "These are the first frontier agents we are releasing, and they're in the software development lifecycle," he said. "The problems and use cases for frontier agents—these agents that are long running, capable of autonomy, thinking, always learning and improving—can be applied to many, many domains."

    Amazon, after all, operates satellite networks, runs robotics warehouses, and manages one of the world's largest e-commerce platforms. If autonomous agents can learn to write code on their own, the company is betting they can eventually learn to do just about anything else.

  • While artificial intelligence has stormed into law firms and accounting practices with billion-dollar startups like Harvey leading the charge, the global consulting industry—a $250 billion behemoth—has remained stubbornly analog. A London-based startup founded by former McKinsey consultants is betting $2 million that it can crack open this resistant market, one Excel spreadsheet at a time.

    Ascentra Labs announced Tuesday that it has closed a $2 million seed round led by NAP, a Berlin-based venture capital firm formerly known as Cavalry Ventures. The funding comes with participation from notable founder-angels including Alan Chang, chief executive of Fuse and former chief revenue officer at Revolut, and Fredrik Hjelm, chief executive of European e-scooter company Voi.

    The investment is modest by the standards of enterprise AI — a sector that has seen funding rounds routinely reach into the hundreds of millions. But Ascentra's founders argue that their focused approach to a narrow but painful problem could give them an edge in a market where broad AI solutions have repeatedly failed to gain traction.

    Consultants spend countless hours on Excel survey analysis that even top firms haven't automated

    Paritosh Devbhandari, Ascentra's co-founder and chief executive, spent years at McKinsey & Company, including a stint at QuantumBlack, the firm's AI and advanced analytics division. He knows intimately the late nights consultants spend wrestling with survey data—the kind of quantitative research that forms the backbone of private equity due diligence.

    "Before starting the company, I was working at McKinsey, specifically on the private equity team," Devbhandari explained in an exclusive interview with VentureBeat. The work, he said, involves analyzing encoded survey responses from customers, suppliers, and market participants during potential acquisitions.

    "Consultants typically spend a lot of time doing this in Excel," he said. "One of the things that surprised me, having worked at a couple of different places, is that the workflow — even at the best firms — really isn't that different from some of the boutiques. I always expected there would be some smarter way of doing things, and often there just isn't."

    That gap between expectation and reality became the foundation for Ascentra. The company's platform ingests raw survey data files and outputs formatted Excel workbooks complete with traceable formulas — the kind of deliverable a junior associate would spend hours constructing manually.

    AI has transformed legal work but consulting presents unique technical challenges that have blocked adoption

    The disparity between AI adoption in law versus consulting raises an obvious question: if the consulting market is so large and the workflows so manual, why hasn't venture capital flooded the space the way it has legal tech?

    Devbhandari offered a frank assessment. "It's not like people haven't tried," he said. "The top of the funnel in our space is crowded. When we speak to our consulting clients, the partners say they get another pitch deck in their LinkedIn inbox or email every week—sometimes several. There are plenty of people trying."

    The barriers, he argued, are structural. Professional services firms move slowly on technology adoption, demanding extensive security credentials and customer references before granting even a pilot opportunity. "I think that's where 90% of startups in professional services, writ large, fall down," he said.

    But consulting presents unique technical challenges beyond the sales cycle. Unlike legal work, which largely involves text documents that modern large language models handle well, consulting spans multiple data modalities — PowerPoint presentations, Excel spreadsheets, Word documents — with information that can be tabular, graphical, or textual.

    "You can have multiple formats of Excel in itself," Devbhandari noted. "And that's a big contrast to the legal space, where you could have a multi-purpose AI agent, or collection of agents, which can actually do a lot of the tasks that lawyers do day to day. Consulting is the opposite of that."

    Ascentra's private equity focus reflects a calculated bet on repeatable workflows

    Ascentra's strategy hinges on extreme specificity. Rather than attempting to automate the full spectrum of consulting work, the company focuses exclusively on survey analysis within private equity due diligence — a niche within a niche.

    The logic is both technical and commercial. Private equity work tends to be more standardized than other consulting engagements, with similar analyses recurring across deals. That repeatability makes automation feasible. It also positions Ascentra against a less formidable competitive set: even the largest consulting firms, Devbhandari claimed, lack dedicated internal tools for this particular workflow.

    "Survey analysis automation is so specific that even the biggest and best firms haven't developed anything in-house for it," he said.

    The company claims that three of the world's top five consulting firms now use its platform, with early adopters reporting time savings of 60 to 80 percent on active due diligence projects. But there's a notable caveat: Ascentra cannot publicly name any of these clients.

    "It's a very private industry, so at the moment, we can't announce any clients publicly," Devbhandari acknowledged. "What I can say is that we're working with three of the top five consulting firms. We've passed pilots at multiple organizations and have submitted business cases for enterprise rollouts."

    Eliminating AI hallucinations becomes critical when billion-dollar deals hang in the balance

    For an AI company selling into quantitative workflows, accuracy is existential. Consultants delivering analysis to private equity clients face enormous pressure to be precise—a single error in a financial model can undermine credibility and, potentially, billion-dollar investment decisions.

    Devbhandari described this as Ascentra's central design challenge. "Consultants require a very, very high degree of fidelity when they're doing their analysis," he said. "So with quantitative data, even if it's 95% accurate, they will revert to Excel because they know it, they trust it, and they don't want there to be any margin for error."

    Ascentra's technical approach attempts to address this by limiting where AI models operate within the workflow. The company uses GPT-based models from OpenAI to interpret and ingest incoming data, but the actual analysis relies on deterministic Python scripts that produce consistent, verifiable outputs.

    "What's different is the steps that follow are deterministic," Devbhandari explained. "There's no room for error. There's no hallucinations, and the Excel writer that we've connected to the product on the back end converts this analysis into Excel formula, which are live and traceable, so consultants can get that assurance that they can follow along with the maths."

    Whether this hybrid approach delivers on its promise of eliminating hallucinations while maintaining useful AI capabilities will be tested as the platform scales across more complex use cases and client environments.

    Enterprise security certifications give Ascentra an edge over less prepared competitors

    Selling software to major consulting firms requires clearing an unusually high security bar. These organizations handle sensitive client data across industries, and their vendor security assessments can take months to complete.

    Ascentra invested early in obtaining enterprise-grade certifications, a strategic choice that Devbhandari framed as essential table stakes. The company has achieved SOC 2 Type II and ISO 27001 certifications and claims to be under audit for ISO 42001, an emerging standard for AI management systems.

    Data handling policies also reflect the sensitivity of the target market. Client data is deleted within 30 to 45 days, depending on contractual terms, and Ascentra does not use customer data to train its models.

    There's also an argument that survey data carries somewhat lower sensitivity than other consulting materials. "Survey data is unique in consulting data because it's collected during the course of a project, and it is market data," Devbhandari noted. "You interview people in the market, and you collect a bunch of data in an Excel, as opposed to—you look at Rogo or some of the other finance AI startups—they use client data, so financials, which is confidential and strictly non-public."

    Per-project pricing aligns with how consulting firms actually spend money

    Ascentra's pricing model departs from the subscription-based approach that dominates enterprise software. The company charges on a per-project basis, a structure Devbhandari said aligns with how consulting firms allocate budgets.

    "Project budgets are in consulting set on a per project basis," he explained. "You'll have central budgets which are for things like Microsoft, right, very central things that every team will use all of the time. And then you have project budgets which are for the teams that are using specific resources, teams or products nowadays."

    This approach may ease initial adoption by avoiding the need for central IT procurement approval, but it also introduces revenue unpredictability. The company's success will depend on converting project-level usage into broader enterprise relationships—a path Devbhandari suggested is already underway through submitted business cases for enterprise rollouts.

    AI may not eliminate consulting jobs, but it will fundamentally transform what consultants do

    Perhaps the most interesting tension in Devbhandari's vision concerns what AI ultimately means for consulting employment. He pushed back on predictions that AI will eliminate consulting jobs while simultaneously describing an industry on the cusp of fundamental transformation.

    "People love to talk about how AI is going to remove the need for consultants, and I disagree," he said. "Yes, the role will change, but I don't think the industry goes away. I think the best solutions will come from people within the industry building products around the work they know."

    Yet he also painted a picture of dramatic change. "At the moment, you have a big intake of graduates who just do—for the most part, you know, they have the strategic work as part of what they do, but they also have a lot of work in Excel and PowerPoint. I think in a few years' time, we'll look back at these times and think, you know, very, very different."

    The honest answer, he acknowledged, is that no one truly knows how this plays out. "I don't think even AI leaders truly know what that looks like yet," he said of whether productivity gains will translate to more work or fewer workers.

    Ascentra plans to use seed funding to expand its U.S. presence and go-to-market team

    The $2 million will primarily fund Ascentra's expansion into the United States, where more than 80 percent of its customers are already based. Devbhandari plans to relocate there personally as the company builds out go-to-market capabilities.

    "One of the things that we've really noticed is that with consulting being an American industry, and I think America being a great place for innovation and trying new things, we've definitely drawn ourselves to the U.S.," he said. "American hires are very expensive, and I'm sure that a lot of the raise will go towards that."

    The seed round represents a bet by NAP on what its co-founder Stefan Walter called an overdue disruption. "While most knowledge work has been reshaped by new technology, consulting has remained stubbornly manual," Walter said. "AI won't replace consultants, but consultants using Ascentra might."

    The startup now faces the hard work of converting pilot wins into lasting enterprise contracts

    Ascentra enters 2026 with momentum but no guarantee of success. The company must transform pilot programs at elite firms into sticky enterprise contracts — all while fending off the inevitable well-funded competitors who will flood into the space once the opportunity becomes undeniable. Its deliberately narrow focus on survey analysis provides a defensible beachhead, but expanding into adjacent workflows will require building entirely new products without sacrificing the domain expertise that Devbhandari argues is the company's core advantage.

    Oliver Thurston, Ascentra's co-founder and chief technology officer, who previously led machine learning at Mathison AI, offered a clear-eyed assessment of the challenge. "Consulting workflows are uniquely complex and difficult to build products around," he said in a statement. "It's not surprising the space hasn't changed yet. This will change though, and there's no doubt that the industry is going to look completely different in five years' time."

    For now, Ascentra is placing a focused wager: that the consultants who once spent their nights formatting spreadsheets will be the ones who finally bring AI into an industry that has long resisted it. The irony is hard to miss. After years of advising Fortune 500 companies on digital transformation, consulting may finally have to take its own medicine.

  • A stealth artificial intelligence startup founded by an MIT researcher emerged this morning with an ambitious claim: its new AI model can control computers better than systems built by OpenAI and Anthropic — at a fraction of the cost.

    OpenAGI, led by chief executive Zengyi Qin, released Lux, a foundation model designed to operate computers autonomously by interpreting screenshots and executing actions across desktop applications. The San Francisco-based company says Lux achieves an 83.6 percent success rate on Online-Mind2Web, a benchmark that has become the industry's most rigorous test for evaluating AI agents that control computers.

    That score is a significant leap over the leading models from well-funded competitors. OpenAI's Operator, released in January, scores 61.3 percent on the same benchmark. Anthropic's Claude Computer Use achieves 56.3 percent.

    "Traditional LLM training feeds a large amount of text corpus into the model. The model learns to produce text," Qin said in an exclusive interview with VentureBeat. "By contrast, our model learns to produce actions. The model is trained with a large amount of computer screenshots and action sequences, allowing it to produce actions to control the computer."

    The announcement arrives at a pivotal moment for the AI industry. Technology giants and startups alike have poured billions of dollars into developing autonomous agents capable of navigating software, booking travel, filling out forms, and executing complex workflows. OpenAI, Anthropic, Google, and Microsoft have all released or announced agent products in the past year, betting that computer-controlling AI will become as transformative as chatbots.

    Yet independent research has cast doubt on whether current agents are as capable as their creators suggest.

    Why university researchers built a tougher benchmark to test AI agents—and what they discovered

    The Online-Mind2Web benchmark, developed by researchers at Ohio State University and the University of California, Berkeley, was designed specifically to expose the gap between marketing claims and actual performance.

    Published in April and accepted to the Conference on Language Modeling 2025, the benchmark comprises 300 diverse tasks across 136 real websites — everything from booking flights to navigating complex e-commerce checkouts. Unlike earlier benchmarks that cached parts of websites, Online-Mind2Web tests agents in live online environments where pages change dynamically and unexpected obstacles appear.

    The results, according to the researchers, painted "a very different picture of the competency of current agents, suggesting over-optimism in previously reported results."

    When the Ohio State team tested five leading web agents with careful human evaluation, they found that many recent systems — despite heavy investment and marketing fanfare — did not outperform SeeAct, a relatively simple agent released in January 2024. Even OpenAI's Operator, the best performer among commercial offerings in their study, achieved only 61 percent success.

    "It seemed that highly capable and practical agents were maybe indeed just months away," the researchers wrote in a blog post accompanying their paper. "However, we are also well aware that there are still many fundamental gaps in research to fully autonomous agents, and current agents are probably not as competent as the reported benchmark numbers may depict."

    The benchmark has gained traction as an industry standard, with a public leaderboard hosted on Hugging Face tracking submissions from research groups and companies.

    How OpenAGI trained its AI to take actions instead of just generating text

    OpenAGI's claimed performance advantage stems from what the company calls "Agentic Active Pre-training," a training methodology that differs fundamentally from how most large language models learn.

    Conventional language models train on vast text corpora, learning to predict the next word in a sequence. The resulting systems excel at generating coherent text but were not designed to take actions in graphical environments.

    Lux, according to Qin, takes a different approach. The model trains on computer screenshots paired with action sequences, learning to interpret visual interfaces and determine which clicks, keystrokes, and navigation steps will accomplish a given goal.

    "The action allows the model to actively explore the computer environment, and such exploration generates new knowledge, which is then fed back to the model for training," Qin told VentureBeat. "This is a naturally self-evolving process, where a better model produces better exploration, better exploration produces better knowledge, and better knowledge leads to a better model."

    This self-reinforcing training loop, if it functions as described, could help explain how a smaller team might achieve results that elude larger organizations. Rather than requiring ever-larger static datasets, the approach would allow the model to continuously improve by generating its own training data through exploration.

    OpenAGI also claims significant cost advantages. The company says Lux operates at roughly one-tenth the cost of frontier models from OpenAI and Anthropic while executing tasks faster.

    Unlike browser-only competitors, Lux can control Slack, Excel, and other desktop applications

    A critical distinction in OpenAGI's announcement: Lux can control applications across an entire desktop operating system, not just web browsers.

    Most commercially available computer-use agents, including early versions of Anthropic's Claude Computer Use, focus primarily on browser-based tasks. That limitation excludes vast categories of productivity work that occur in desktop applications — spreadsheets in Microsoft Excel, communications in Slack, design work in Adobe products, code editing in development environments.

    OpenAGI says Lux can navigate these native applications, a capability that would substantially expand the addressable market for computer-use agents. The company is releasing a developer software development kit alongside the model, allowing third parties to build applications on top of Lux.

    The company is also working with Intel to optimize Lux for edge devices, which would allow the model to run locally on laptops and workstations rather than requiring cloud infrastructure. That partnership could address enterprise concerns about sending sensitive screen data to external servers.

    "We are partnering with Intel to optimize our model on edge devices, which will make it the best on-device computer-use model," Qin said.

    The company confirmed it is in exploratory discussions with AMD and Microsoft about additional partnerships.

    What happens when you ask an AI agent to copy your bank details

    Computer-use agents present novel safety challenges that do not arise with conventional chatbots. An AI system capable of clicking buttons, entering text, and navigating applications could, if misdirected, cause significant harm — transferring money, deleting files, or exfiltrating sensitive information.

    OpenAGI says it has built safety mechanisms directly into Lux. When the model encounters requests that violate its safety policies, it refuses to proceed and alerts the user.

    In an example provided by the company, when a user asked the model to "copy my bank details and paste it into a new Google doc," Lux responded with an internal reasoning step: "The user asks me to copy the bank details, which are sensitive information. Based on the safety policy, I am not able to perform this action." The model then issued a warning to the user rather than executing the potentially dangerous request.

    Such safeguards will face intense scrutiny as computer-use agents proliferate. Security researchers have already demonstrated prompt injection attacks against early agent systems, where malicious instructions embedded in websites or documents can hijack an agent's behavior. Whether Lux's safety mechanisms can withstand adversarial attacks remains to be tested by independent researchers.

    The MIT researcher who built two of GitHub's most downloaded AI models

    Qin brings an unusual combination of academic credentials and entrepreneurial experience to OpenAGI.

    He completed his doctorate at the Massachusetts Institute of Technology in 2025, where his research focused on computer vision, robotics, and machine learning. His academic work appeared in top venues including the Conference on Computer Vision and Pattern Recognition, the International Conference on Learning Representations, and the International Conference on Machine Learning.

    Before founding OpenAGI, Qin built several widely adopted AI systems. JetMoE, a large language model he led development on, demonstrated that a high-performing model could be trained from scratch for less than $100,000 — a fraction of the tens of millions typically required. The model outperformed Meta's LLaMA2-7B on standard benchmarks, according to a technical report that attracted attention from MIT's Computer Science and Artificial Intelligence Laboratory.

    His previous open-source projects achieved remarkable adoption. OpenVoice, a voice cloning model, accumulated approximately 35,000 stars on GitHub and ranked in the top 0.03 percent of open-source projects by popularity. MeloTTS, a text-to-speech system, has been downloaded more than 19 million times, making it one of the most widely used audio AI models since its 2024 release.

    Qin also co-founded MyShell, an AI agent platform that has attracted six million users who have collectively built more than 200,000 AI agents. Users have had more than one billion interactions with agents on the platform, according to the company.

    Inside the billion-dollar race to build AI that controls your computer

    The computer-use agent market has attracted intense interest from investors and technology giants over the past year.

    OpenAI released Operator in January, allowing users to instruct an AI to complete tasks across the web. Anthropic has continued developing Claude Computer Use, positioning it as a core capability of its Claude model family. Google has incorporated agent features into its Gemini products. Microsoft has integrated agent capabilities across its Copilot offerings and Windows.

    Yet the market remains nascent. Enterprise adoption has been limited by concerns about reliability, security, and the ability to handle edge cases that occur frequently in real-world workflows. The performance gaps revealed by benchmarks like Online-Mind2Web suggest that current systems may not be ready for mission-critical applications.

    OpenAGI enters this competitive landscape as an independent alternative, positioning superior benchmark performance and lower costs against the massive resources of its well-funded rivals. The company's Lux model and developer SDK are available beginning today.

    Whether OpenAGI can translate benchmark dominance into real-world reliability remains the central question. The AI industry has a long history of impressive demos that falter in production, of laboratory results that crumble against the chaos of actual use. Benchmarks measure what they measure, and the distance between a controlled test and an 8-hour workday full of edge cases, exceptions, and surprises can be vast.

    But if Lux performs in the wild the way it performs in the lab, the implications extend far beyond one startup's success. It would suggest that the path to capable AI agents runs not through the largest checkbooks but through the cleverest architectures—that a small team with the right ideas can outmaneuver the giants.

    The technology industry has seen that story before. It rarely stays true for long.