Amazon Web Services on Tuesday announced a new class of artificial intelligence systems called "frontier agents" that can work autonomously for hours or even days without human intervention, representing one of the most ambitious attempts yet to automate the full software development lifecycle.
The announcement, made during AWS CEO Matt Garman's keynote address at the company's annual re:Invent conference, introduces three specialized AI agents designed to act as virtual team members: Kiro autonomous agent for software development, AWS Security Agent for application security, and AWS DevOps Agent for IT operations.
The move signals Amazon's intent to leap ahead in the intensifying competition to build AI systems capable of performing complex, multi-step tasks that currently require teams of skilled engineers.
"We see frontier agents as a completely new class of agents," said Deepak Singh, vice president of developer agents and experiences at Amazon, in an interview ahead of the announcement. "They're fundamentally designed to work for hours and days. You're not giving them a problem that you want finished in the next five minutes. You're giving them complex challenges that they may have to think about, try different solutions, and get to the right conclusion — and they should do that without intervention."
Why Amazon believes its new agents leave existing AI coding tools behind
The frontier agents differ from existing AI coding assistants like GitHub Copilot or Amazon's own CodeWhisperer in several fundamental ways.
Current AI coding tools, while powerful, require engineers to drive every interaction. Developers must write prompts, provide context, and manually coordinate work across different code repositories. When switching between tasks, the AI loses context and must start fresh.
The new frontier agents, by contrast, maintain persistent memory across sessions and continuously learn from an organization's codebase, documentation, and team communications. They can independently determine which code repositories require changes, work on multiple files simultaneously, and coordinate complex transformations spanning dozens of microservices.
"With a current agent, you would go microservice by microservice, making changes one at a time, and each change would be a different session with no shared context," Singh explained. "With a frontier agent, you say, 'I need to solve this broad problem.' You point it to the right application, and it decides which repos need changes."
The agents exhibit three defining characteristics that AWS believes set them apart: autonomy in decision-making, the ability to scale by spawning multiple agents to work on different aspects of a problem simultaneously, and the capacity to operate independently for extended periods.
"A frontier agent can decide to spin up 10 versions of itself, all working on different parts of the problem at once," Singh said.
How each of the three frontier agents tackles a different phase of development
Kiro autonomous agent serves as a virtual developer that maintains context across coding sessions and learns from an organization's pull requests, code reviews, and technical discussions. Teams can connect it to GitHub, Jira, Slack, and internal documentation systems. The agent then acts like a teammate, accepting task assignments and working independently until it either completes the work or requires human guidance.
AWS Security Agent embeds security expertise throughout the development process, automatically reviewing design documents and scanning pull requests against organizational security requirements. Perhaps most significantly, it transforms penetration testing from a weeks-long manual process into an on-demand capability that completes in hours.
SmugMug, a photo hosting platform, has already deployed the security agent. "AWS Security Agent helped catch a business logic bug that no existing tools would have caught, exposing information improperly," said Andres Ruiz, staff software engineer at the company. "To any other tool, this would have been invisible. But the ability for Security Agent to contextualize the information, parse the API response, and find the unexpected information there represents a leap forward in automated security testing."
AWS DevOps Agent functions as an always-on operations team member, responding instantly to incidents and using its accumulated knowledge to identify root causes. It connects to observability tools including Amazon CloudWatch, Datadog, Dynatrace, New Relic, and Splunk, along with runbooks and deployment pipelines.
Commonwealth Bank of Australia tested the DevOps agent by replicating a complex network and identity management issue that typically requires hours for experienced engineers to diagnose. The agent identified the root cause in under 15 minutes.
"AWS DevOps Agent thinks and acts like a seasoned DevOps engineer, helping our engineers build a banking infrastructure that's faster, more resilient, and designed to deliver better experiences for our customers," said Jason Sandry, head of cloud services at Commonwealth Bank.
Amazon makes its case against Google and Microsoft in the AI coding wars
The announcement arrives amid a fierce battle among technology giants to dominate the emerging market for AI-powered development tools. Google has made significant noise in recent weeks with its own AI coding capabilities, while Microsoft continues to advance GitHub Copilot and its broader AI development toolkit.
Singh argued that AWS holds distinct advantages rooted in the company's 20-year history operating cloud infrastructure and Amazon's own massive software engineering organization.
"AWS has been the cloud of choice for 20 years, so we have two decades of knowledge building and running it, and working with customers who've been building and running applications on it," Singh said. "The learnings from operating AWS, the knowledge our customers have, the experience we've built using these tools ourselves every day to build real-world applications—all of that is embodied in these frontier agents."
He drew a distinction between tools suitable for prototypes versus production systems. "There's a lot of things out there that you can use to build your prototype or your toy application. But if you want to build production applications, there's a lot of knowledge that we bring in as AWS that apply here."
The safeguards Amazon built to keep autonomous agents from going rogue
The prospect of AI systems operating autonomously for days raises immediate questions about what happens when they go off track. Singh described multiple safeguards built into the system.
All learnings accumulated by the agents are logged and visible, allowing engineers to understand what knowledge influences the agent's decisions. Teams can even remove specific learnings if they discover the agent has absorbed incorrect information from team communications.
"You can go in and even redact that from its knowledge like, 'No, we don't want you to ever use this knowledge,'" Singh said. "You can look at the knowledge like it's almost—it's like looking at your neurons inside your brain. You can disconnect some."
Engineers can also monitor agent activity in real-time and intervene when necessary, either redirecting the agent or taking over entirely. Most critically, the agents never commit code directly to production systems. That responsibility remains with human engineers.
"These agents are never going to check the code into production. That is still the human's responsibility," Singh emphasized. "You are still, as an engineer, responsible for the code you're checking in, whether it's generated by you or by an agent working autonomously."
What frontier agents mean for the future of software engineering jobs
The announcement inevitably raises concerns about the impact on software engineering jobs. Singh pushed back against the notion that frontier agents will replace developers, framing them instead as tools that amplify human capabilities.
"Software engineering is craft. What's changing is not, 'Hey, agents are doing all the work.' The craft of software engineering is changing—how you use agents, how do you set up your code base, how do you set up your prompts, how do you set up your rules, how do you set up your knowledge bases so that agents can be effective," he said.
Singh noted that senior engineers who had drifted away from hands-on coding are now writing more code than ever. "It's actually easier for them to become software engineers," he said.
He pointed to an internal example where a team completed a project in 78 days that would have taken 18 months using traditional practices. "Because they were able to use AI. And the thing that made it work was not just the fact that they were using AI, but how they organized and set up their practices of how they built that software were maximized around that."
How Amazon plans to make AI-generated code more trustworthy over time
Singh outlined several areas where frontier agents will evolve over the coming years. Multi-agent architectures, where systems of specialized agents coordinate to solve complex problems, represent a major frontier. So does the integration of formal verification techniques to increase confidence in AI-generated code.
AWS recently introduced property-based testing in Kiro, which uses automated reasoning to extract testable properties from specifications and generate thousands of test scenarios automatically.
"If you have a shopping cart application, every way an order can be canceled, and how it might be canceled, and the way refunds are handled in Germany versus the US—if you're writing a unit test, maybe two, Germany and US, but now, because you have this property-based testing approach, your agent can create a scenario for every country you operate in and test all of them automatically for you," Singh explained.
Building trust in autonomous systems remains the central challenge. "Right now you still require tons of human guardrails at every step to make sure that the right thing happens. And as we get better at these techniques, you will use less and less, and you'll be able to trust the agents a lot more," he said.
Amazon's bigger bet on autonomous AI stretches far beyond writing code
The frontier agents announcement arrived alongside a cascade of other news at re:Invent 2025. AWS kicked off the conference with major announcements on agentic AI capabilities, customer service innovations, and multicloud networking.
Amazon expanded its Nova portfolio with four new models delivering industry-leading price-performance across reasoning, multimodal processing, conversational AI, code generation, and agentic tasks. Nova Forge pioneers "open training," giving organizations access to pre-trained model checkpoints and the ability to blend proprietary data with Amazon Nova-curated datasets.
AWS also added 18 new open weight models to Amazon Bedrock, reinforcing its commitment to offering a broad selection of fully managed models from leading AI providers. The launch includes new models from Mistral AI, Google's Gemma 3, MiniMax's M2, NVIDIA's Nemotron, and OpenAI's GPT OSS Safeguard.
On the infrastructure side, Amazon EC2 Trn3 UltraServers, powered by AWS's first 3nm AI chip, pack up to 144 Trainium3 chips into a single integrated system, delivering up to 4.4x more compute performance and 4x greater energy efficiency than the previous generation. AWS AI Factories provides enterprises and government organizations with dedicated AWS AI infrastructure deployed in their own data centers, combining NVIDIA GPUs, Trainium chips, AWS networking, and AI services like Amazon Bedrock and SageMaker AI.
All three frontier agents launched in preview on Tuesday. Pricing will be announced when the services reach general availability.
Singh made clear the company sees applications far beyond coding. "These are the first frontier agents we are releasing, and they're in the software development lifecycle," he said. "The problems and use cases for frontier agents—these agents that are long running, capable of autonomy, thinking, always learning and improving—can be applied to many, many domains."
Amazon, after all, operates satellite networks, runs robotics warehouses, and manages one of the world's largest e-commerce platforms. If autonomous agents can learn to write code on their own, the company is betting they can eventually learn to do just about anything else.
Mistral AI, Europe's most prominent artificial intelligence startup, is releasing its most ambitious product suite to date: a family of 10 open-source models designed to run everywhere from smartphones and autonomous drones to enterprise cloud systems, marking a major escalation in the company's challenge to both U.S. tech giants and surging Chinese competitors.
The Mistral 3 family, launching today, includes a new flagship model called Mistral Large 3 and a suite of smaller "Ministral 3" models optimized for edge computing applications. All models will be released under the permissive Apache 2.0 license, allowing unrestricted commercial use — a sharp contrast to the closed systems offered by OpenAI, Google, and Anthropic.
The release is a pointed bet by Mistral that the future of artificial intelligence lies not in building ever-larger proprietary systems, but in offering businesses maximum flexibility to customize and deploy AI tailored to their specific needs, often using smaller models that can run without cloud connectivity.
"The gap between closed and open source is getting smaller, because more and more people are contributing to open source, which is great," Guillaume Lample, Mistral's chief scientist and co-founder, said in an exclusive interview with VentureBeat. "We are catching up fast."
Why Mistral is choosing flexibility over frontier performance in the AI race
The strategic calculus behind Mistral 3 diverges sharply from recent model releases by industry leaders. While OpenAI, Google, and Anthropic have focused recent launches on increasingly capable "agentic" systems — AI that can autonomously execute complex multi-step tasks — Mistral is prioritizing breadth, efficiency, and what Lample calls "distributed intelligence."
Mistral Large 3, the flagship model, employs a Mixture of Experts architecture with 41 billion active parameters drawn from a total pool of 675 billion parameters. The model can process both text and images, handles context windows up to 256,000 tokens, and was trained with particular emphasis on non-English languages — a rarity among frontier AI systems.
"Most AI labs focus on their native language, but Mistral Large 3 was trained on a wide variety of languages, making advanced AI useful for billions who speak different native languages," the company said in a statement reviewed ahead of the announcement.
But the more significant departure lies in the Ministral 3 lineup: nine compact models across three sizes (14 billion, 8 billion, and 3 billion parameters) and three variants tailored for different use cases. Each variant serves a distinct purpose: base models for extensive customization, instruction-tuned models for general chat and task completion, and reasoning-optimized models for complex logic requiring step-by-step deliberation.
The smallest Ministral 3 models can run on devices with as little as 4 gigabytes of video memory using 4-bit quantization — making frontier AI capabilities accessible on standard laptops, smartphones, and embedded systems without requiring expensive cloud infrastructure or even internet connectivity. This approach reflects Mistral's belief that AI's next evolution will be defined not by sheer scale, but by ubiquity: models small enough to run on drones, in vehicles, in robots, and on consumer devices.
How fine-tuned small models beat expensive large models for enterprise customers
Lample's comments reveal a business model fundamentally different from that of closed-source competitors. Rather than competing primarily on benchmark performance, Mistral is targeting enterprise customers frustrated by the cost and inflexibility of proprietary systems.
"Sometimes customers say, 'Is there a use case where the best closed-source model isn't working?' If that's the case, then they're essentially stuck," Lample explained. "There's nothing they can do. It's the best model available, and it's not working out of the box."
This is where Mistral's approach diverges. When a generic model fails, the company deploys engineering teams to work directly with customers, analyzing specific problems, creating synthetic training data, and fine-tuning smaller models to outperform larger general-purpose systems on narrow tasks.
"In more than 90% of cases, a small model can do the job, especially if it's fine-tuned. It doesn't have to be a model with hundreds of billions of parameters, just a 14-billion or 24-billion parameter model," Lample said. "So it's not only much cheaper, but also faster, plus you have all the benefits: you don't need to worry about privacy, latency, reliability, and so on."
The economic argument is compelling. Multiple enterprise customers have approached Mistral after building prototypes with expensive closed-source models, only to find deployment costs prohibitive at scale, according to Lample.
"They come back to us a couple of months later because they realize, 'We built this prototype, but it's way too slow and way too expensive,'" he said.
Where Mistral 3 fits in the increasingly crowded open-source AI market
Mistral's release comes amid fierce competition on multiple fronts. OpenAI recently released GPT-5.1 with enhanced agentic capabilities. Google launched Gemini 3 with improved multimodal understanding. Anthropic released Opus 4.5 on the same day as this interview, with similar agent-focused features.
But Lample argues those comparisons miss the point. "It's a little bit behind. But I think what matters is that we are catching up fast," he acknowledged regarding performance against closed models. "I think we are maybe playing a strategic long game."
That long game involves a different competitive set: primarily open-source models from Chinese companies like DeepSeek and Alibaba's Qwen series, which have made remarkable strides in recent months.
Mistral differentiates itself through multilingual capabilities that extend far beyond English or Chinese, multimodal integration handling both text and images in a unified model, and what the company characterizes as superior customization through easier fine-tuning.
"One key difference with the models themselves is that we focused much more on multilinguality," Lample said. "If you look at all the top models from [Chinese competitors], they're all text-only. They have visual models as well, but as separate systems. We wanted to integrate everything into a single model."
The multilingual emphasis aligns with Mistral's broader positioning as a European AI champion focused on digital sovereignty — the principle that organizations and nations should maintain control over their AI infrastructure and data.
Building beyond models: Mistral's full-stack enterprise AI platform strategy
Mistral 3's release builds on an increasingly comprehensive enterprise AI platform that extends well beyond model development. The company has assembled a full-stack offering that differentiates it from pure model providers.
Recent product launches include Mistral Agents API, which combines language models with built-in connectors for code execution, web search, image generation, and persistent memory across conversations; Magistral, the company's reasoning model designed for domain-specific, transparent, and multilingual reasoning; and Mistral Code, an AI-powered coding assistant bundling models, an in-IDE assistant, and local deployment options with enterprise tooling.
The consumer-facing Le Chat assistant has been enhanced with Deep Research mode for structured research reports, voice capabilities, and Projects for organizing conversations into context-rich folders. More recently, Le Chat gained a connector directory with 20+ enterprise integrations powered by the Model Context Protocol (MCP), spanning tools like Databricks, Snowflake, GitHub, Atlassian, Asana, and Stripe.
In October, Mistral unveiled AI Studio, a production AI platform providing observability, agent runtime, and AI registry capabilities to help enterprises track output changes, monitor usage, run evaluations, and fine-tune models using proprietary data.
Mistral now positions itself as a full-stack, global enterprise AI company, offering not just models but an application-building layer through AI Studio, compute infrastructure, and forward-deployed engineers to help businesses realize return on investment.
Why open source AI matters for customization, transparency and sovereignty
Mistral's commitment to open-source development under permissive licenses is both an ideological stance and a competitive strategy in an AI landscape increasingly dominated by closed systems.
Lample elaborated on the practical benefits: "I think something that people don't realize — but our customers know this very well — is how much better any model can actually improve if you fine tune it on the task of interest. There's a huge gap between a base model and one that's fine-tuned for a specific task, and in many cases, it outperforms the closed-source model."
The approach enables capabilities impossible with closed systems: organizations can fine-tune models on proprietary data that never leaves their infrastructure, customize architectures for specific workflows, and maintain complete transparency into how AI systems make decisions — critical for regulated industries like finance, healthcare, and defense.
This positioning has attracted government and public sector partnerships. The company launched "AI for Citizens" in July 2025, an initiative to "help States and public institutions strategically harness AI for their people by transforming public services" and has secured strategic partnerships with France's army and job agency, Luxembourg's government, and various European public sector organizations.
Mistral's transatlantic AI collaboration goes beyond European borders
While Mistral is frequently characterized as Europe's answer to OpenAI, the company views itself as a transatlantic collaboration rather than a purely European venture. The company has teams across both continents, with co-founders spending significant time with customers and partners in the United States, and these models are being trained in partnerships with U.S.-based teams and infrastructure providers.
This transatlantic positioning may prove strategically important as geopolitical tensions around AI development intensify. The recent ASML investment, a €1.7 billion ($1.5 billion) funding round led by the Dutch semiconductor equipment manufacturer, signals deepening collaboration across the Western semiconductor and AI value chain at a moment when both Europe and the United States are seeking to reduce dependence on Chinese technology.
Mistral's investor base reflects this dynamic: the Series C round included participation from U.S. firms Andreessen Horowitz, General Catalyst, Lightspeed, and Index Ventures alongside European investors like France's state-backed Bpifrance and global players like DST Global and Nvidia.
Founded in May 2023 by former Google DeepMind and Meta researchers, Mistral has raised roughly $1.05 billion (€1 billion) in funding. The company was valued at $6 billion in a June 2024 Series B, then more than doubled its valuation in a September Series C.
Can customization and efficiency beat raw performance in enterprise AI?
The Mistral 3 release crystallizes a fundamental question facing the AI industry: Will enterprises ultimately prioritize the absolute cutting-edge capabilities of proprietary systems, or will they choose open, customizable alternatives that offer greater control, lower costs, and independence from big tech platforms?
Mistral's answer is unambiguous. The company is betting that as AI moves from prototype to production, the factors that matter most shift dramatically. Raw benchmark scores matter less than total cost of ownership. Slight performance edges matter less than the ability to fine-tune for specific workflows. Cloud-based convenience matters less than data sovereignty and edge deployment.
It's a wager with significant risks. Despite Lample's optimism about closing the performance gap, Mistral's models still trail the absolute frontier. The company's revenue, while growing, reportedly remains modest relative to its nearly $14 billion valuation. And competition intensifies from both well-funded Chinese rivals making remarkable open-source progress and U.S. tech giants increasingly offering their own smaller, more efficient models.
But if Mistral is right — if the future of AI looks less like a handful of cloud-based oracles and more like millions of specialized systems running everywhere from factory floors to smartphones — then the company has positioned itself at the center of that transformation.
The release of Mistral 3 is the most comprehensive expression yet of that vision: 10 models, spanning every size category, optimized for every deployment scenario, available to anyone who wants to build with them.
Whether "distributed intelligence" becomes the industry's dominant paradigm or remains a compelling alternative serving a narrower market will determine not just Mistral's fate, but the broader question of who controls the AI future — and whether that future will be open.
For now, the race is on. And Mistral is betting it can win not by building the biggest model, but by building everywhere else.
While artificial intelligence has stormed into law firms and accounting practices with billion-dollar startups like Harvey leading the charge, the global consulting industry—a $250 billion behemoth—has remained stubbornly analog. A London-based startup founded by former McKinsey consultants is betting $2 million that it can crack open this resistant market, one Excel spreadsheet at a time.
Ascentra Labs announced Tuesday that it has closed a $2 million seed round led by NAP, a Berlin-based venture capital firm formerly known as Cavalry Ventures. The funding comes with participation from notable founder-angels including Alan Chang, chief executive of Fuse and former chief revenue officer at Revolut, and Fredrik Hjelm, chief executive of European e-scooter company Voi.
The investment is modest by the standards of enterprise AI — a sector that has seen funding rounds routinely reach into the hundreds of millions. But Ascentra's founders argue that their focused approach to a narrow but painful problem could give them an edge in a market where broad AI solutions have repeatedly failed to gain traction.
Consultants spend countless hours on Excel survey analysis that even top firms haven't automated
Paritosh Devbhandari, Ascentra's co-founder and chief executive, spent years at McKinsey & Company, including a stint at QuantumBlack, the firm's AI and advanced analytics division. He knows intimately the late nights consultants spend wrestling with survey data—the kind of quantitative research that forms the backbone of private equity due diligence.
"Before starting the company, I was working at McKinsey, specifically on the private equity team," Devbhandari explained in an exclusive interview with VentureBeat. The work, he said, involves analyzing encoded survey responses from customers, suppliers, and market participants during potential acquisitions.
"Consultants typically spend a lot of time doing this in Excel," he said. "One of the things that surprised me, having worked at a couple of different places, is that the workflow — even at the best firms — really isn't that different from some of the boutiques. I always expected there would be some smarter way of doing things, and often there just isn't."
That gap between expectation and reality became the foundation for Ascentra. The company's platform ingests raw survey data files and outputs formatted Excel workbooks complete with traceable formulas — the kind of deliverable a junior associate would spend hours constructing manually.
AI has transformed legal work but consulting presents unique technical challenges that have blocked adoption
The disparity between AI adoption in law versus consulting raises an obvious question: if the consulting market is so large and the workflows so manual, why hasn't venture capital flooded the space the way it has legal tech?
Devbhandari offered a frank assessment. "It's not like people haven't tried," he said. "The top of the funnel in our space is crowded. When we speak to our consulting clients, the partners say they get another pitch deck in their LinkedIn inbox or email every week—sometimes several. There are plenty of people trying."
The barriers, he argued, are structural. Professional services firms move slowly on technology adoption, demanding extensive security credentials and customer references before granting even a pilot opportunity. "I think that's where 90% of startups in professional services, writ large, fall down," he said.
But consulting presents unique technical challenges beyond the sales cycle. Unlike legal work, which largely involves text documents that modern large language models handle well, consulting spans multiple data modalities — PowerPoint presentations, Excel spreadsheets, Word documents — with information that can be tabular, graphical, or textual.
"You can have multiple formats of Excel in itself," Devbhandari noted. "And that's a big contrast to the legal space, where you could have a multi-purpose AI agent, or collection of agents, which can actually do a lot of the tasks that lawyers do day to day. Consulting is the opposite of that."
Ascentra's private equity focus reflects a calculated bet on repeatable workflows
Ascentra's strategy hinges on extreme specificity. Rather than attempting to automate the full spectrum of consulting work, the company focuses exclusively on survey analysis within private equity due diligence — a niche within a niche.
The logic is both technical and commercial. Private equity work tends to be more standardized than other consulting engagements, with similar analyses recurring across deals. That repeatability makes automation feasible. It also positions Ascentra against a less formidable competitive set: even the largest consulting firms, Devbhandari claimed, lack dedicated internal tools for this particular workflow.
"Survey analysis automation is so specific that even the biggest and best firms haven't developed anything in-house for it," he said.
The company claims that three of the world's top five consulting firms now use its platform, with early adopters reporting time savings of 60 to 80 percent on active due diligence projects. But there's a notable caveat: Ascentra cannot publicly name any of these clients.
"It's a very private industry, so at the moment, we can't announce any clients publicly," Devbhandari acknowledged. "What I can say is that we're working with three of the top five consulting firms. We've passed pilots at multiple organizations and have submitted business cases for enterprise rollouts."
Eliminating AI hallucinations becomes critical when billion-dollar deals hang in the balance
For an AI company selling into quantitative workflows, accuracy is existential. Consultants delivering analysis to private equity clients face enormous pressure to be precise—a single error in a financial model can undermine credibility and, potentially, billion-dollar investment decisions.
Devbhandari described this as Ascentra's central design challenge. "Consultants require a very, very high degree of fidelity when they're doing their analysis," he said. "So with quantitative data, even if it's 95% accurate, they will revert to Excel because they know it, they trust it, and they don't want there to be any margin for error."
Ascentra's technical approach attempts to address this by limiting where AI models operate within the workflow. The company uses GPT-based models from OpenAI to interpret and ingest incoming data, but the actual analysis relies on deterministic Python scripts that produce consistent, verifiable outputs.
"What's different is the steps that follow are deterministic," Devbhandari explained. "There's no room for error. There's no hallucinations, and the Excel writer that we've connected to the product on the back end converts this analysis into Excel formula, which are live and traceable, so consultants can get that assurance that they can follow along with the maths."
Whether this hybrid approach delivers on its promise of eliminating hallucinations while maintaining useful AI capabilities will be tested as the platform scales across more complex use cases and client environments.
Enterprise security certifications give Ascentra an edge over less prepared competitors
Selling software to major consulting firms requires clearing an unusually high security bar. These organizations handle sensitive client data across industries, and their vendor security assessments can take months to complete.
Ascentra invested early in obtaining enterprise-grade certifications, a strategic choice that Devbhandari framed as essential table stakes. The company has achieved SOC 2 Type II and ISO 27001 certifications and claims to be under audit for ISO 42001, an emerging standard for AI management systems.
Data handling policies also reflect the sensitivity of the target market. Client data is deleted within 30 to 45 days, depending on contractual terms, and Ascentra does not use customer data to train its models.
There's also an argument that survey data carries somewhat lower sensitivity than other consulting materials. "Survey data is unique in consulting data because it's collected during the course of a project, and it is market data," Devbhandari noted. "You interview people in the market, and you collect a bunch of data in an Excel, as opposed to—you look at Rogo or some of the other finance AI startups—they use client data, so financials, which is confidential and strictly non-public."
Per-project pricing aligns with how consulting firms actually spend money
Ascentra's pricing model departs from the subscription-based approach that dominates enterprise software. The company charges on a per-project basis, a structure Devbhandari said aligns with how consulting firms allocate budgets.
"Project budgets are in consulting set on a per project basis," he explained. "You'll have central budgets which are for things like Microsoft, right, very central things that every team will use all of the time. And then you have project budgets which are for the teams that are using specific resources, teams or products nowadays."
This approach may ease initial adoption by avoiding the need for central IT procurement approval, but it also introduces revenue unpredictability. The company's success will depend on converting project-level usage into broader enterprise relationships—a path Devbhandari suggested is already underway through submitted business cases for enterprise rollouts.
AI may not eliminate consulting jobs, but it will fundamentally transform what consultants do
Perhaps the most interesting tension in Devbhandari's vision concerns what AI ultimately means for consulting employment. He pushed back on predictions that AI will eliminate consulting jobs while simultaneously describing an industry on the cusp of fundamental transformation.
"People love to talk about how AI is going to remove the need for consultants, and I disagree," he said. "Yes, the role will change, but I don't think the industry goes away. I think the best solutions will come from people within the industry building products around the work they know."
Yet he also painted a picture of dramatic change. "At the moment, you have a big intake of graduates who just do—for the most part, you know, they have the strategic work as part of what they do, but they also have a lot of work in Excel and PowerPoint. I think in a few years' time, we'll look back at these times and think, you know, very, very different."
The honest answer, he acknowledged, is that no one truly knows how this plays out. "I don't think even AI leaders truly know what that looks like yet," he said of whether productivity gains will translate to more work or fewer workers.
Ascentra plans to use seed funding to expand its U.S. presence and go-to-market team
The $2 million will primarily fund Ascentra's expansion into the United States, where more than 80 percent of its customers are already based. Devbhandari plans to relocate there personally as the company builds out go-to-market capabilities.
"One of the things that we've really noticed is that with consulting being an American industry, and I think America being a great place for innovation and trying new things, we've definitely drawn ourselves to the U.S.," he said. "American hires are very expensive, and I'm sure that a lot of the raise will go towards that."
The seed round represents a bet by NAP on what its co-founder Stefan Walter called an overdue disruption. "While most knowledge work has been reshaped by new technology, consulting has remained stubbornly manual," Walter said. "AI won't replace consultants, but consultants using Ascentra might."
The startup now faces the hard work of converting pilot wins into lasting enterprise contracts
Ascentra enters 2026 with momentum but no guarantee of success. The company must transform pilot programs at elite firms into sticky enterprise contracts — all while fending off the inevitable well-funded competitors who will flood into the space once the opportunity becomes undeniable. Its deliberately narrow focus on survey analysis provides a defensible beachhead, but expanding into adjacent workflows will require building entirely new products without sacrificing the domain expertise that Devbhandari argues is the company's core advantage.
Oliver Thurston, Ascentra's co-founder and chief technology officer, who previously led machine learning at Mathison AI, offered a clear-eyed assessment of the challenge. "Consulting workflows are uniquely complex and difficult to build products around," he said in a statement. "It's not surprising the space hasn't changed yet. This will change though, and there's no doubt that the industry is going to look completely different in five years' time."
For now, Ascentra is placing a focused wager: that the consultants who once spent their nights formatting spreadsheets will be the ones who finally bring AI into an industry that has long resisted it. The irony is hard to miss. After years of advising Fortune 500 companies on digital transformation, consulting may finally have to take its own medicine.
- The cloud giant says its latest AI chip delivers better energy efficiency, enhanced performance, and affordability for AI workloads.
- Data centers rely heavily on water for cooling, but alternative sources can reduce strain on municipal systems and support sustainability.
- The facility is slated to come online in the first quarter of 2026.
Chinese artificial intelligence startup DeepSeek released two powerful new AI models on Sunday that the company claims match or exceed the capabilities of OpenAI's GPT-5 and Google's Gemini-3.0-Pro — a development that could reshape the competitive landscape between American tech giants and their Chinese challengers.
The Hangzhou-based company launched DeepSeek-V3.2, designed as an everyday reasoning assistant, alongside DeepSeek-V3.2-Speciale, a high-powered variant that achieved gold-medal performance in four elite international competitions: the 2025 International Mathematical Olympiad, the International Olympiad in Informatics, the ICPC World Finals, and the China Mathematical Olympiad.
The release carries profound implications for American technology leadership. DeepSeek has once again demonstrated that it can produce frontier AI systems despite U.S. export controls that restrict China's access to advanced Nvidia chips — and it has done so while making its models freely available under an open-source MIT license.
"People thought DeepSeek gave a one-time breakthrough but we came back much bigger," wrote Chen Fang, who identified himself as a contributor to the project, on X (formerly Twitter). The release drew swift reactions online, with one user declaring: "Rest in peace, ChatGPT."
How DeepSeek's sparse attention breakthrough slashes computing costs
At the heart of the new release lies DeepSeek Sparse Attention, or DSA — a novel architectural innovation that dramatically reduces the computational burden of running AI models on long documents and complex tasks.
Traditional AI attention mechanisms, the core technology allowing language models to understand context, scale poorly as input length increases. Processing a document twice as long typically requires four times the computation. DeepSeek's approach breaks this constraint using what the company calls a "lightning indexer" that identifies only the most relevant portions of context for each query, ignoring the rest.
According to DeepSeek's technical report, DSA reduces inference costs by roughly half compared to previous models when processing long sequences. The architecture "substantially reduces computational complexity while preserving model performance," the report states.
Processing 128,000 tokens — roughly equivalent to a 300-page book — now costs approximately $0.70 per million tokens for decoding, compared to $2.40 for the previous V3.1-Terminus model. That represents a 70% reduction in inference costs.
The 685-billion-parameter models support context windows of 128,000 tokens, making them suitable for analyzing lengthy documents, codebases, and research papers. DeepSeek's technical report notes that independent evaluations on long-context benchmarks show V3.2 performing on par with or better than its predecessor "despite incorporating a sparse attention mechanism."
The benchmark results that put DeepSeek in the same league as GPT-5
DeepSeek's claims of parity with America's leading AI systems rest on extensive testing across mathematics, coding, and reasoning tasks — and the numbers are striking.
On AIME 2025, a prestigious American mathematics competition, DeepSeek-V3.2-Speciale achieved a 96.0% pass rate, compared to 94.6% for GPT-5-High and 95.0% for Gemini-3.0-Pro. On the Harvard-MIT Mathematics Tournament, the Speciale variant scored 99.2%, surpassing Gemini's 97.5%.
The standard V3.2 model, optimized for everyday use, scored 93.1% on AIME and 92.5% on HMMT — marginally below frontier models but achieved with substantially fewer computational resources.
Most striking are the competition results. DeepSeek-V3.2-Speciale scored 35 out of 42 points on the 2025 International Mathematical Olympiad, earning gold-medal status. At the International Olympiad in Informatics, it scored 492 out of 600 points — also gold, ranking 10th overall. The model solved 10 of 12 problems at the ICPC World Finals, placing second.
These results came without internet access or tools during testing. DeepSeek's report states that "testing strictly adheres to the contest's time and attempt limits."
On coding benchmarks, DeepSeek-V3.2 resolved 73.1% of real-world software bugs on SWE-Verified, competitive with GPT-5-High at 74.9%. On Terminal Bench 2.0, measuring complex coding workflows, DeepSeek scored 46.4%—well above GPT-5-High's 35.2%.
The company acknowledges limitations. "Token efficiency remains a challenge," the technical report states, noting that DeepSeek "typically requires longer generation trajectories" to match Gemini-3.0-Pro's output quality.
Why teaching AI to think while using tools changes everything
Beyond raw reasoning, DeepSeek-V3.2 introduces "thinking in tool-use" — the ability to reason through problems while simultaneously executing code, searching the web, and manipulating files.
Previous AI models faced a frustrating limitation: each time they called an external tool, they lost their train of thought and had to restart reasoning from scratch. DeepSeek's architecture preserves the reasoning trace across multiple tool calls, enabling fluid multi-step problem solving.
To train this capability, the company built a massive synthetic data pipeline generating over 1,800 distinct task environments and 85,000 complex instructions. These included challenges like multi-day trip planning with budget constraints, software bug fixes across eight programming languages, and web-based research requiring dozens of searches.
The technical report describes one example: planning a three-day trip from Hangzhou with constraints on hotel prices, restaurant ratings, and attraction costs that vary based on accommodation choices. Such tasks are "hard to solve but easy to verify," making them ideal for training AI agents.
DeepSeek employed real-world tools during training — actual web search APIs, coding environments, and Jupyter notebooks — while generating synthetic prompts to ensure diversity. The result is a model that generalizes to unseen tools and environments, a critical capability for real-world deployment.
DeepSeek's open-source gambit could upend the AI industry's business model
Unlike OpenAI and Anthropic, which guard their most powerful models as proprietary assets, DeepSeek has released both V3.2 and V3.2-Speciale under the MIT license — one of the most permissive open-source frameworks available.
Any developer, researcher, or company can download, modify, and deploy the 685-billion-parameter models without restriction. Full model weights, training code, and documentation are available on Hugging Face, the leading platform for AI model sharing.
The strategic implications are significant. By making frontier-capable models freely available, DeepSeek undermines competitors charging premium API prices. The Hugging Face model card notes that DeepSeek has provided Python scripts and test cases "demonstrating how to encode messages in OpenAI-compatible format" — making migration from competing services straightforward.
For enterprise customers, the value proposition is compelling: frontier performance at dramatically lower cost, with deployment flexibility. But data residency concerns and regulatory uncertainty may limit adoption in sensitive applications — particularly given DeepSeek's Chinese origins.
Regulatory walls are rising against DeepSeek in Europe and America
DeepSeek's global expansion faces mounting resistance. In June, Berlin's data protection commissioner Meike Kamp declared that DeepSeek's transfer of German user data to China is "unlawful" under EU rules, asking Apple and Google to consider blocking the app.
The German authority expressed concern that "Chinese authorities have extensive access rights to personal data within the sphere of influence of Chinese companies." Italy ordered DeepSeek to block its app in February. U.S. lawmakers have moved to ban the service from government devices, citing national security concerns.
Questions also persist about U.S. export controls designed to limit China's AI capabilities. In August, DeepSeek hinted that China would soon have "next generation" domestically built chips to support its models. The company indicated its systems work with Chinese-made chips from Huawei and Cambricon without additional setup.
DeepSeek's original V3 model was reportedly trained on roughly 2,000 older Nvidia H800 chips — hardware since restricted for China export. The company has not disclosed what powered V3.2 training, but its continued advancement suggests export controls alone cannot halt Chinese AI progress.
What DeepSeek's release means for the future of AI competition
The release arrives at a pivotal moment. After years of massive investment, some analysts question whether an AI bubble is forming. DeepSeek's ability to match American frontier models at a fraction of the cost challenges assumptions that AI leadership requires enormous capital expenditure.
The company's technical report reveals that post-training investment now exceeds 10% of pre-training costs — a substantial allocation credited for reasoning improvements. But DeepSeek acknowledges gaps: "The breadth of world knowledge in DeepSeek-V3.2 still lags behind leading proprietary models," the report states. The company plans to address this by scaling pre-training compute.
DeepSeek-V3.2-Speciale remains available through a temporary API until December 15, when its capabilities will merge into the standard release. The Speciale variant is designed exclusively for deep reasoning and does not support tool calling — a limitation the standard model addresses.
For now, the AI race between the United States and China has entered a new phase. DeepSeek's release demonstrates that open-source models can achieve frontier performance, that efficiency innovations can slash costs dramatically, and that the most powerful AI systems may soon be freely available to anyone with an internet connection.
As one commenter on X observed: "Deepseek just casually breaking those historic benchmarks set by Gemini is bonkers."
The question is no longer whether Chinese AI can compete with Silicon Valley. It's whether American companies can maintain their lead when their Chinese rival gives comparable technology away for free.
When Liquid AI, a startup founded by MIT computer scientists back in 2023, introduced its Liquid Foundation Models series 2 (LFM2) in July 2025, the pitch was straightforward: deliver the fastest on-device foundation models on the market using the new "liquid" architecture, with training and inference efficiency that made small models a serious alternative to cloud-only large language models (LLMs) such as OpenAI's GPT series and Google's Gemini.
The initial release shipped dense checkpoints at 350M, 700M, and 1.2B parameters, a hybrid architecture heavily weighted toward gated short convolutions, and benchmark numbers that placed LFM2 ahead of similarly sized competitors like Qwen3, Llama 3.2, and Gemma 3 on both quality and CPU throughput. The message to enterprises was clear: real-time, privacy-preserving AI on phones, laptops, and vehicles no longer required sacrificing capability for latency.
In the months since that launch, Liquid has expanded LFM2 into a broader product line — adding task-and-domain-specialized variants, a small video ingestion and analysis model, and an edge-focused deployment stack called LEAP — and positioned the models as the control layer for on-device and on-prem agentic systems.
Now, with the publication of the detailed, 51-page LFM2 technical report on arXiv, the company is going a step further: making public the architecture search process, training data mixture, distillation objective, curriculum strategy, and post-training pipeline behind those models.
And unlike earlier open models, LFM2 is built around a repeatable recipe: a hardware-in-the-loop search process, a training curriculum that compensates for smaller parameter budgets, and a post-training pipeline tuned for instruction following and tool use.
Rather than just offering weights and an API, Liquid is effectively publishing a detailed blueprint that other organizations can use as a reference for training their own small, efficient models from scratch, tuned to their own hardware and deployment constraints.
A model family designed around real constraints, not GPU labs
The technical report begins with a premise enterprises are intimately familiar with: real AI systems hit limits long before benchmarks do. Latency budgets, peak memory ceilings, and thermal throttling define what can actually run in production—especially on laptops, tablets, commodity servers, and mobile devices.
To address this, Liquid AI performed architecture search directly on target hardware, including Snapdragon mobile SoCs and Ryzen laptop CPUs. The result is a consistent outcome across sizes: a minimal hybrid architecture dominated by gated short convolution blocks and a small number of grouped-query attention (GQA) layers. This design was repeatedly selected over more exotic linear-attention and SSM hybrids because it delivered a better quality-latency-memory Pareto profile under real device conditions.
This matters for enterprise teams in three ways:
-
Predictability. The architecture is simple, parameter-efficient, and stable across model sizes from 350M to 2.6B.
-
Operational portability. Dense and MoE variants share the same structural backbone, simplifying deployment across mixed hardware fleets.
-
On-device feasibility. Prefill and decode throughput on CPUs surpass comparable open models by roughly 2× in many cases, reducing the need to offload routine tasks to cloud inference endpoints.
Instead of optimizing for academic novelty, the report reads as a systematic attempt to design models enterprises can actually ship.
This is notable and more practical for enterprises in a field where many open models quietly assume access to multi-H100 clusters during inference.
A training pipeline tuned for enterprise-relevant behavior
LFM2 adopts a training approach that compensates for the smaller scale of its models with structure rather than brute force. Key elements include:
-
10–12T token pre-training and an additional 32K-context mid-training phase, which extends the model’s useful context window without exploding compute costs.
-
A decoupled Top-K knowledge distillation objective that sidesteps the instability of standard KL distillation when teachers provide only partial logits.
-
A three-stage post-training sequence—SFT, length-normalized preference alignment, and model merging—designed to produce more reliable instruction following and tool-use behavior.
For enterprise AI developers, the significance is that LFM2 models behave less like “tiny LLMs” and more like practical agents able to follow structured formats, adhere to JSON schemas, and manage multi-turn chat flows. Many open models at similar sizes fail not due to lack of reasoning ability, but due to brittle adherence to instruction templates. The LFM2 post-training recipe directly targets these rough edges.
In other words: Liquid AI optimized small models for operational reliability, not just scoreboards.
Multimodality designed for device constraints, not lab demos
The LFM2-VL and LFM2-Audio variants reflect another shift: multimodality built around token efficiency.
Rather than embedding a massive vision transformer directly into an LLM, LFM2-VL attaches a SigLIP2 encoder through a connector that aggressively reduces visual token count via PixelUnshuffle. High-resolution inputs automatically trigger dynamic tiling, keeping token budgets controllable even on mobile hardware. LFM2-Audio uses a bifurcated audio path—one for embeddings, one for generation—supporting real-time transcription or speech-to-speech on modest CPUs.
For enterprise platform architects, this design points toward a practical future where:
-
document understanding happens directly on endpoints such as field devices;
-
audio transcription and speech agents run locally for privacy compliance;
-
multimodal agents operate within fixed latency envelopes without streaming data off-device.
The through-line is the same: multimodal capability without requiring a GPU farm.
Retrieval models built for agent systems, not legacy search
LFM2-ColBERT extends late-interaction retrieval into a footprint small enough for enterprise deployments that need multilingual RAG without the overhead of specialized vector DB accelerators.
This is particularly meaningful as organizations begin to orchestrate fleets of agents. Fast local retrieval—running on the same hardware as the reasoning model—reduces latency and provides a governance win: documents never leave the device boundary.
Taken together, the VL, Audio, and ColBERT variants show LFM2 as a modular system, not a single model drop.
The emerging blueprint for hybrid enterprise AI architectures
Across all variants, the LFM2 report implicitly sketches what tomorrow’s enterprise AI stack will look like: hybrid local-cloud orchestration, where small, fast models operating on devices handle time-critical perception, formatting, tool invocation, and judgment tasks, while larger models in the cloud offer heavyweight reasoning when needed.
Several trends converge here:
-
Cost control. Running routine inference locally avoids unpredictable cloud billing.
-
Latency determinism. TTFT and decode stability matter in agent workflows; on-device eliminates network jitter.
-
Governance and compliance. Local execution simplifies PII handling, data residency, and auditability.
-
Resilience. Agentic systems degrade gracefully if the cloud path becomes unavailable.
Enterprises adopting these architectures will likely treat small on-device models as the “control plane” of agentic workflows, with large cloud models serving as on-demand accelerators.
LFM2 is one of the clearest open-source foundations for that control layer to date.
The strategic takeaway: on-device AI is now a design choice, not a compromise
For years, organizations building AI features have accepted that “real AI” requires cloud inference. LFM2 challenges that assumption. The models perform competitively across reasoning, instruction following, multilingual tasks, and RAG—while simultaneously achieving substantial latency gains over other open small-model families.
For CIOs and CTOs finalizing 2026 roadmaps, the implication is direct: small, open, on-device models are now strong enough to carry meaningful slices of production workloads.
LFM2 will not replace frontier cloud models for frontier-scale reasoning. But it offers something enterprises arguably need more: a reproducible, open, and operationally feasible foundation for agentic systems that must run anywhere, from phones to industrial endpoints to air-gapped secure facilities.
In the broadening landscape of enterprise AI, LFM2 is less a research milestone and more a sign of architectural convergence. The future is not cloud or edge—it’s both, operating in concert. And releases like LFM2 provide the building blocks for organizations prepared to build that hybrid future intentionally rather than accidentally.
-
A stealth artificial intelligence startup founded by an MIT researcher emerged this morning with an ambitious claim: its new AI model can control computers better than systems built by OpenAI and Anthropic — at a fraction of the cost.
OpenAGI, led by chief executive Zengyi Qin, released Lux, a foundation model designed to operate computers autonomously by interpreting screenshots and executing actions across desktop applications. The San Francisco-based company says Lux achieves an 83.6 percent success rate on Online-Mind2Web, a benchmark that has become the industry's most rigorous test for evaluating AI agents that control computers.
That score is a significant leap over the leading models from well-funded competitors. OpenAI's Operator, released in January, scores 61.3 percent on the same benchmark. Anthropic's Claude Computer Use achieves 56.3 percent.
"Traditional LLM training feeds a large amount of text corpus into the model. The model learns to produce text," Qin said in an exclusive interview with VentureBeat. "By contrast, our model learns to produce actions. The model is trained with a large amount of computer screenshots and action sequences, allowing it to produce actions to control the computer."
The announcement arrives at a pivotal moment for the AI industry. Technology giants and startups alike have poured billions of dollars into developing autonomous agents capable of navigating software, booking travel, filling out forms, and executing complex workflows. OpenAI, Anthropic, Google, and Microsoft have all released or announced agent products in the past year, betting that computer-controlling AI will become as transformative as chatbots.
Yet independent research has cast doubt on whether current agents are as capable as their creators suggest.
Why university researchers built a tougher benchmark to test AI agents—and what they discovered
The Online-Mind2Web benchmark, developed by researchers at Ohio State University and the University of California, Berkeley, was designed specifically to expose the gap between marketing claims and actual performance.
Published in April and accepted to the Conference on Language Modeling 2025, the benchmark comprises 300 diverse tasks across 136 real websites — everything from booking flights to navigating complex e-commerce checkouts. Unlike earlier benchmarks that cached parts of websites, Online-Mind2Web tests agents in live online environments where pages change dynamically and unexpected obstacles appear.
The results, according to the researchers, painted "a very different picture of the competency of current agents, suggesting over-optimism in previously reported results."
When the Ohio State team tested five leading web agents with careful human evaluation, they found that many recent systems — despite heavy investment and marketing fanfare — did not outperform SeeAct, a relatively simple agent released in January 2024. Even OpenAI's Operator, the best performer among commercial offerings in their study, achieved only 61 percent success.
"It seemed that highly capable and practical agents were maybe indeed just months away," the researchers wrote in a blog post accompanying their paper. "However, we are also well aware that there are still many fundamental gaps in research to fully autonomous agents, and current agents are probably not as competent as the reported benchmark numbers may depict."
The benchmark has gained traction as an industry standard, with a public leaderboard hosted on Hugging Face tracking submissions from research groups and companies.
How OpenAGI trained its AI to take actions instead of just generating text
OpenAGI's claimed performance advantage stems from what the company calls "Agentic Active Pre-training," a training methodology that differs fundamentally from how most large language models learn.
Conventional language models train on vast text corpora, learning to predict the next word in a sequence. The resulting systems excel at generating coherent text but were not designed to take actions in graphical environments.
Lux, according to Qin, takes a different approach. The model trains on computer screenshots paired with action sequences, learning to interpret visual interfaces and determine which clicks, keystrokes, and navigation steps will accomplish a given goal.
"The action allows the model to actively explore the computer environment, and such exploration generates new knowledge, which is then fed back to the model for training," Qin told VentureBeat. "This is a naturally self-evolving process, where a better model produces better exploration, better exploration produces better knowledge, and better knowledge leads to a better model."
This self-reinforcing training loop, if it functions as described, could help explain how a smaller team might achieve results that elude larger organizations. Rather than requiring ever-larger static datasets, the approach would allow the model to continuously improve by generating its own training data through exploration.
OpenAGI also claims significant cost advantages. The company says Lux operates at roughly one-tenth the cost of frontier models from OpenAI and Anthropic while executing tasks faster.
Unlike browser-only competitors, Lux can control Slack, Excel, and other desktop applications
A critical distinction in OpenAGI's announcement: Lux can control applications across an entire desktop operating system, not just web browsers.
Most commercially available computer-use agents, including early versions of Anthropic's Claude Computer Use, focus primarily on browser-based tasks. That limitation excludes vast categories of productivity work that occur in desktop applications — spreadsheets in Microsoft Excel, communications in Slack, design work in Adobe products, code editing in development environments.
OpenAGI says Lux can navigate these native applications, a capability that would substantially expand the addressable market for computer-use agents. The company is releasing a developer software development kit alongside the model, allowing third parties to build applications on top of Lux.
The company is also working with Intel to optimize Lux for edge devices, which would allow the model to run locally on laptops and workstations rather than requiring cloud infrastructure. That partnership could address enterprise concerns about sending sensitive screen data to external servers.
"We are partnering with Intel to optimize our model on edge devices, which will make it the best on-device computer-use model," Qin said.
The company confirmed it is in exploratory discussions with AMD and Microsoft about additional partnerships.
What happens when you ask an AI agent to copy your bank details
Computer-use agents present novel safety challenges that do not arise with conventional chatbots. An AI system capable of clicking buttons, entering text, and navigating applications could, if misdirected, cause significant harm — transferring money, deleting files, or exfiltrating sensitive information.
OpenAGI says it has built safety mechanisms directly into Lux. When the model encounters requests that violate its safety policies, it refuses to proceed and alerts the user.
In an example provided by the company, when a user asked the model to "copy my bank details and paste it into a new Google doc," Lux responded with an internal reasoning step: "The user asks me to copy the bank details, which are sensitive information. Based on the safety policy, I am not able to perform this action." The model then issued a warning to the user rather than executing the potentially dangerous request.
Such safeguards will face intense scrutiny as computer-use agents proliferate. Security researchers have already demonstrated prompt injection attacks against early agent systems, where malicious instructions embedded in websites or documents can hijack an agent's behavior. Whether Lux's safety mechanisms can withstand adversarial attacks remains to be tested by independent researchers.
The MIT researcher who built two of GitHub's most downloaded AI models
Qin brings an unusual combination of academic credentials and entrepreneurial experience to OpenAGI.
He completed his doctorate at the Massachusetts Institute of Technology in 2025, where his research focused on computer vision, robotics, and machine learning. His academic work appeared in top venues including the Conference on Computer Vision and Pattern Recognition, the International Conference on Learning Representations, and the International Conference on Machine Learning.
Before founding OpenAGI, Qin built several widely adopted AI systems. JetMoE, a large language model he led development on, demonstrated that a high-performing model could be trained from scratch for less than $100,000 — a fraction of the tens of millions typically required. The model outperformed Meta's LLaMA2-7B on standard benchmarks, according to a technical report that attracted attention from MIT's Computer Science and Artificial Intelligence Laboratory.
His previous open-source projects achieved remarkable adoption. OpenVoice, a voice cloning model, accumulated approximately 35,000 stars on GitHub and ranked in the top 0.03 percent of open-source projects by popularity. MeloTTS, a text-to-speech system, has been downloaded more than 19 million times, making it one of the most widely used audio AI models since its 2024 release.
Qin also co-founded MyShell, an AI agent platform that has attracted six million users who have collectively built more than 200,000 AI agents. Users have had more than one billion interactions with agents on the platform, according to the company.
Inside the billion-dollar race to build AI that controls your computer
The computer-use agent market has attracted intense interest from investors and technology giants over the past year.
OpenAI released Operator in January, allowing users to instruct an AI to complete tasks across the web. Anthropic has continued developing Claude Computer Use, positioning it as a core capability of its Claude model family. Google has incorporated agent features into its Gemini products. Microsoft has integrated agent capabilities across its Copilot offerings and Windows.
Yet the market remains nascent. Enterprise adoption has been limited by concerns about reliability, security, and the ability to handle edge cases that occur frequently in real-world workflows. The performance gaps revealed by benchmarks like Online-Mind2Web suggest that current systems may not be ready for mission-critical applications.
OpenAGI enters this competitive landscape as an independent alternative, positioning superior benchmark performance and lower costs against the massive resources of its well-funded rivals. The company's Lux model and developer SDK are available beginning today.
Whether OpenAGI can translate benchmark dominance into real-world reliability remains the central question. The AI industry has a long history of impressive demos that falter in production, of laboratory results that crumble against the chaos of actual use. Benchmarks measure what they measure, and the distance between a controlled test and an 8-hour workday full of edge cases, exceptions, and surprises can be vast.
But if Lux performs in the wild the way it performs in the lab, the implications extend far beyond one startup's success. It would suggest that the path to capable AI agents runs not through the largest checkbooks but through the cleverest architectures—that a small team with the right ideas can outmaneuver the giants.
The technology industry has seen that story before. It rarely stays true for long.
- Researchers built an inexpensive device that circumvents chipmakers’ confidential computing protections and reveals weaknesses in scalable memory encryption.


