Salesforce on Tuesday launched an entirely rebuilt version of Slackbot, the company's workplace assistant, transforming it from a simple notification tool into what executives describe as a fully powered AI agent capable of searching enterprise data, drafting documents, and taking action on behalf of employees.
The new Slackbot, now generally available to Business+ and Enterprise+ customers, is Salesforce's most aggressive move yet to position Slack at the center of the emerging "agentic AI" movement — where software agents work alongside humans to complete complex tasks. The launch comes as Salesforce attempts to convince investors that artificial intelligence will bolster its products rather than render them obsolete.
"Slackbot isn't just another copilot or AI assistant," said Parker Harris, Salesforce co-founder and Slack's chief technology officer, in an exclusive interview with Salesforce. "It's the front door to the agentic enterprise, powered by Salesforce."
From tricycle to Porsche: Salesforce rebuilt Slackbot from the ground up
Harris was blunt about what distinguishes the new Slackbot from its predecessor: "The old Slackbot was, you know, a little tricycle, and the new Slackbot is like, you know, a Porsche."
The original Slackbot, which has existed since Slack's early days, performed basic algorithmic tasks — reminding users to add colleagues to documents, suggesting channel archives, and delivering simple notifications. The new version runs on an entirely different architecture built around a large language model and sophisticated search capabilities that can access Salesforce records, Google Drive files, calendar data, and years of Slack conversations.
"It's two different things," Harris explained. "The old Slackbot was algorithmic and fairly simple. The new Slackbot is brand new — it's based around an LLM and a very robust search engine, and connections to third-party search engines, third-party enterprise data."
Salesforce chose to retain the Slackbot brand despite the fundamental technical overhaul. "People know what Slackbot is, and so we wanted to carry that forward," Harris said.
Why Anthropic's Claude powers the new Slackbot — and which AI models could come next
The new Slackbot runs on Claude, Anthropic's large language model, a choice driven partly by compliance requirements. Slack's commercial service operates under FedRAMP Moderate certification to serve U.S. federal government customers, and Harris said Anthropic was "the only provider that could give us a compliant LLM" when Slack began building the new system.
But that exclusivity won't last. "We are, this year, going to support additional providers," Harris said. "We have a great relationship with Google. Gemini is incredible — performance is great, cost is great. So we're going to use Gemini for some things." He added that OpenAI remains a possibility as well.
Harris echoed Salesforce CEO Marc Benioff's view that large language models are becoming commoditized: "You've heard Marc talk about LLMs are commodities, that they're democratized. I call them CPUs."
On the sensitive question of training data, Harris was unequivocal: Salesforce does not train any models on customer data. "Models don't have any sort of security," he explained. "If we trained it on some confidential conversation that you and I have, I don't want Carolyn to know — if I train it into the LLM, there is no way for me to say you get to see the answer, but Carolyn doesn't."
Inside Salesforce's internal experiment: 80,000 employees tested Slackbot with striking results
Salesforce has been testing the new Slackbot internally for months, rolling it out to all 80,000 employees. According to Ryan Gavin, Slack's chief marketing officer, the results have been striking: "It's the fastest adopted product in Salesforce history."
Internal data shows that two-thirds of Salesforce employees have tried the new Slackbot, with 80% of those users continuing to use it regularly. Internal satisfaction rates reached 96% — the highest for any AI feature Slack has shipped. Employees report saving between two and 20 hours per week.
The adoption happened largely organically. "I think it was about five days, and a Canvas was developed by our employees called 'The Most Stealable Slackbot Prompts,'" Gavin said. "People just started adding to it organically. I think it's up to 250-plus prompts that are in this Canvas right now."
Kate Crotty, a principal UX researcher at Salesforce, found that 73% of internal adoption was driven by social sharing rather than top-down mandates. "Everybody is there to help each other learn and communicate hacks," she said.
How Slackbot transforms scattered enterprise data into executive-ready insights
During a product demonstration, Amy Bauer, Slack's product experience designer, showed how Slackbot can synthesize information across multiple sources. In one example, she asked Slackbot to analyze customer feedback from a pilot program, upload an image of a usage dashboard, and have Slackbot correlate the qualitative and quantitative data.
"This is where Slackbot really earns its keep for me," Bauer explained. "What it's doing is not just simply reading the image — it's actually looking at the image and comparing it to the insight it just generated for me."
Slackbot can then query Salesforce to find enterprise accounts with open deals that might be good candidates for early access, creating what Bauer called "a really great justification and plan to move forward." Finally, it can synthesize all that information into a Canvas — Slack's collaborative document format — and find calendar availability among stakeholders to schedule a review meeting.
"Up until this point, we have been working in a one-to-one capacity with Slackbot," Bauer said. "But one of the benefits that I can do now is take this insight and have it generate this into a Canvas, a shared workspace where I can iterate on it, refine it with Slackbot, or share it out with my team."
Rob Seaman, Slack's chief product officer, said the Canvas creation demonstrates where the product is heading: "This is making a tool call internally to Slack Canvas to actually write, effectively, a shared document. But it signals where we're going with Slackbot — we're eventually going to be adding in additional third-party tool calls."
MrBeast's company became a Slackbot guinea pig—and employees say they're saving 90 minutes a day
Among Salesforce's pilot customers is Beast Industries, the parent company of YouTube star MrBeast. Luis Madrigal, the company's chief information officer, joined the launch announcement to describe his experience.
"As somebody who has rolled out enterprise technologies for over two decades now, this was practically one of the easiest," Madrigal said. "The plumbing is there. Slack as an implementation, Enterprise Tools — being able to turn on the Slackbot and the Slack AI functionality was as simple as having my team go in, review, do a quick security review."
Madrigal said his security team signed off "rather quickly" — unusual for enterprise AI deployments — because Slackbot accesses only the information each individual user already has permission to view. "Given all the guardrails you guys have put into place for Slackbot to be unique and customized to only the information that each individual user has, only the conversations and the Slack rooms and Slack channels that they're part of—that made my security team sign off rather quickly."
One Beast Industries employee, Sinan, the head of Beast Games marketing, reported saving "at bare minimum, 90 minutes a day." Another employee, Spencer, a creative supervisor, described it as "an assistant who's paying attention when I'm not."
Other pilot customers include Slalom, reMarkable, Xero, Mercari, and Engine. Mollie Bodensteiner, SVP of Operations at Engine, called Slackbot "an absolute 'chaos tamer' for our team," estimating it saves her about 30 minutes daily "just by eliminating context switching."
Slackbot vs. Microsoft Copilot vs. Google Gemini: The fight for enterprise AI dominance
The launch puts Salesforce in direct competition with Microsoft's Copilot, which is integrated into Teams and the broader Microsoft 365 suite, as well as Google's Gemini integrations across Workspace. When asked what distinguishes Slackbot from these alternatives, Seaman pointed to context and convenience.
"The thing that makes it most powerful for our customers and users is the proximity — it's just right there in your Slack," Seaman said. "There's a tremendous convenience affordance that's naturally built into it."
The deeper advantage, executives argue, is that Slackbot already understands users' work without requiring setup or training. "Most AI tools sound the same no matter who is using them," the company's announcement stated. "They lack context, miss nuance, and force you to jump between tools to get anything done."
Harris put it more directly: "If you've ever had that magic experience with AI — I think ChatGPT is a great example, it's a great experience from a consumer perspective — Slackbot is really what we're doing in the enterprise, to be this employee super agent that is loved, just like people love using Slack."
Amy Bauer emphasized the frictionless nature of the experience. "Slackbot is inherently grounded in the context, in the data that you have in Slack," she said. "So as you continue working in Slack, Slackbot gets better because it's grounded in the work that you're doing there. There is no setup. There is no configuration for those end users."
Salesforce's ambitious plan to make Slackbot the one 'super agent' that controls all the others
Salesforce positions Slackbot as what Harris calls a "super agent" — a central hub that can eventually coordinate with other AI agents across an organization.
"Every corporation is going to have an employee super agent," Harris said. "Slackbot is essentially taking the magic of what Slack does. We think that Slackbot, and we're really excited about it, is going to be that."
The vision extends to third-party agents already launching in Slack. Last month, Anthropic released a preview of Claude Code for Slack, allowing developers to interact with Claude's coding capabilities directly in chat threads. OpenAI, Google, Vercel, and others have also built agents for the platform.
"Most of the net-new apps that are being deployed to Slack are agents," Seaman noted during the press conference. "This is proof of the promise of humans and agents coexisting and working together in Slack to solve problems."
Harris described a future where Slackbot becomes an MCP (Model Context Protocol) client, able to leverage tools from across the software ecosystem — similar to how the developer tool Cursor works. "Slack can be an MCP client, and Slackbot will be the hub of that, leveraging all these tools out in the world, some of which will be these amazing agents," he said.
But Harris also cautioned against over-promising on multi-agent coordination. "I still think we're in the single agent world," he said. "FY26 is going to be the year where we started to see more coordination. But we're going to do it with customer success in mind, and not demonstrate and talk about, like, 'I've got 1,000 agents working together,' because I think that's unrealistic."
Slackbot costs nothing extra, but Salesforce's data access fees could squeeze some customers
Slackbot is included at no additional cost for customers on Business+ and Enterprise+ plans. "There's no additional fees customers have to do," Gavin confirmed. "If they're on one of those plans, they're going to get Slackbot."
However, some enterprise customers may face other cost pressures related to Salesforce's broader data strategy. CIOs may see price increases for third-party applications that work with Salesforce data, as effects of higher charges for API access ripple through the software supply chain.
Fivetran CEO George Fraser has warned that Salesforce's shift in pricing policy for API access could have tangible consequences for enterprises relying on Salesforce as a system of record. "They might not be able to use Fivetran to replicate their data to Snowflake and instead have to use Salesforce Data Cloud. Or they might find that they are not able to interact with their data via ChatGPT, and instead have to use Agentforce," Fraser said in a recent CIO report.
Salesforce has framed the pricing change as standard industry practice.
What Slackbot can do today, what's coming in weeks, and what's still on the roadmap
The new Slackbot begins rolling out today and will reach all eligible customers by the end of February. Mobile availability will complete by March 3, Bauer confirmed during her interview with VentureBeat.
Some capabilities remain works in progress. Calendar reading and availability checking are available at launch, but the ability to actually book meetings is "coming a few weeks after," according to Seaman. Image generation is not currently supported, though Bauer said it's "something that we are looking at in the future."
When asked about integration with competing CRM systems like HubSpot and Microsoft Dynamics, Salesforce representatives declined to provide specifics during the interview, though they acknowledged the question touched on key competitive differentiators.
Salesforce is betting the future of work looks like a chat window—and it's not alone
The Slackbot launch is Salesforce's bet that the future of enterprise work is conversational — that employees will increasingly prefer to interact with AI through natural language rather than navigating traditional software interfaces.
Harris described Slack's product philosophy using principles like "don't make me think" and "be a great host." The goal, he said, is for Slackbot to surface information proactively rather than requiring users to hunt for it.
"One of the revelations for me is LLMs applied to unstructured information are incredible," Harris said. "And the amount of value you have if you're a Slack user, if your corporation uses Slack — the amount of value in Slack is unbelievable. Because you're talking about work, you're sharing documents, you're making decisions, but you can't as a human go through that and really get the same value that an LLM can do."
Looking ahead, Harris expects the interfaces themselves to evolve beyond pure conversation. "We're kind of saturating what we can do with purely conversational UIs," he said. "I think we'll start to see agents building an interface that best suits your intent, as opposed to trying to surface something within a conversational interface that matches your intent."
Microsoft, Google, and a growing roster of AI startups are placing similar bets — that the winning enterprise AI will be the one embedded in the tools workers already use, not another application to learn. The race to become that invisible layer of workplace intelligence is now fully underway.
For Salesforce, the stakes extend beyond a single product launch. After a bruising year on Wall Street and persistent questions about whether AI threatens its core business, the company is wagering that Slackbot can prove the opposite — that the tens of millions of people already chatting in Slack every day is not a vulnerability, but an unassailable advantage.
Haley Gault, the Salesforce account executive in Pittsburgh who stumbled upon the new Slackbot on a snowy morning, captured the shift in a single sentence: "I honestly can't imagine working for another company not having access to these types of tools. This is just how I work now."
That's precisely what Salesforce is counting on.
Anthropic released Cowork on Monday, a new AI agent capability that extends the power of its wildly successful Claude Code tool to non-technical users — and according to company insiders, the team built the entire feature in approximately a week and a half, largely using Claude Code itself.
The launch marks a major inflection point in the race to deliver practical AI agents to mainstream users, positioning Anthropic to compete not just with OpenAI and Google in conversational AI, but with Microsoft's Copilot in the burgeoning market for AI-powered productivity tools.
"Cowork lets you complete non-technical tasks much like how developers use Claude Code," the company announced via its official Claude account on X. The feature arrives as a research preview available exclusively to Claude Max subscribers — Anthropic's power-user tier priced between $100 and $200 per month — through the macOS desktop application.
For the past year, the industry narrative has focused on large language models that can write poetry or debug code. With Cowork, Anthropic is betting that the real enterprise value lies in an AI that can open a folder, read a messy pile of receipts, and generate a structured expense report without human hand-holding.
How developers using a coding tool for vacation research inspired Anthropic's latest product
The genesis of Cowork lies in Anthropic's recent success with the developer community. In late 2024, the company released Claude Code, a terminal-based tool that allowed software engineers to automate rote programming tasks. The tool was a hit, but Anthropic noticed a peculiar trend: users were forcing the coding tool to perform non-coding labor.
According to Boris Cherny, an engineer at Anthropic, the company observed users deploying the developer tool for an unexpectedly diverse array of tasks.
"Since we launched Claude Code, we saw people using it for all sorts of non-coding work: doing vacation research, building slide decks, cleaning up your email, cancelling subscriptions, recovering wedding photos from a hard drive, monitoring plant growth, controlling your oven," Cherny wrote on X. "These use cases are diverse and surprising — the reason is that the underlying Claude Agent is the best agent, and Opus 4.5 is the best model."
Recognizing this shadow usage, Anthropic effectively stripped the command-line complexity from their developer tool to create a consumer-friendly interface. In its blog post announcing the feature, Anthropic explained that developers "quickly began using it for almost everything else," which "prompted us to build Cowork: a simpler way for anyone — not just developers — to work with Claude in the very same way."
Inside the folder-based architecture that lets Claude read, edit, and create files on your computer
Unlike a standard chat interface where a user pastes text for analysis, Cowork requires a different level of trust and access. Users designate a specific folder on their local machine that Claude can access. Within that sandbox, the AI agent can read existing files, modify them, or create entirely new ones.
Anthropic offers several illustrative examples: reorganizing a cluttered downloads folder by sorting and intelligently renaming each file, generating a spreadsheet of expenses from a collection of receipt screenshots, or drafting a report from scattered notes across multiple documents.
"In Cowork, you give Claude access to a folder on your computer. Claude can then read, edit, or create files in that folder," the company explained on X. "Try it to create a spreadsheet from a pile of screenshots, or produce a first draft from scattered notes."
The architecture relies on what is known as an "agentic loop." When a user assigns a task, the AI does not merely generate a text response. Instead, it formulates a plan, executes steps in parallel, checks its own work, and asks for clarification if it hits a roadblock. Users can queue multiple tasks and let Claude process them simultaneously — a workflow Anthropic describes as feeling "much less like a back-and-forth and much more like leaving messages for a coworker."
The system is built on Anthropic's Claude Agent SDK, meaning it shares the same underlying architecture as Claude Code. Anthropic notes that Cowork "can take on many of the same tasks that Claude Code can handle, but in a more approachable form for non-coding tasks."
The recursive loop where AI builds AI: Claude Code reportedly wrote much of Claude Cowork
Perhaps the most remarkable detail surrounding Cowork's launch is the speed at which the tool was reportedly built — highlighting a recursive feedback loop where AI tools are being used to build better AI tools.
During a livestream hosted by Dan Shipper, Felix Rieseberg, an Anthropic employee, confirmed that the team built Cowork in approximately a week and a half.
Alex Volkov, who covers AI developments, expressed surprise at the timeline: "Holy shit Anthropic built 'Cowork' in the last... week and a half?!"
This prompted immediate speculation about how much of Cowork was itself built by Claude Code. Simon Smith, EVP of Generative AI at Klick Health, put it bluntly on X: "Claude Code wrote all of Claude Cowork. Can we all agree that we're in at least somewhat of a recursive improvement loop here?"
The implication is profound: Anthropic's AI coding agent may have substantially contributed to building its own non-technical sibling product. If true, this is one of the most visible examples yet of AI systems being used to accelerate their own development and expansion — a strategy that could widen the gap between AI labs that successfully deploy their own agents internally and those that do not.
Connectors, browser automation, and skills extend Cowork's reach beyond the local file system
Cowork doesn't operate in isolation. The feature integrates with Anthropic's existing ecosystem of connectors — tools that link Claude to external information sources and services such as Asana, Notion, PayPal, and other supported partners. Users who have configured these connections in the standard Claude interface can leverage them within Cowork sessions.
Additionally, Cowork can pair with Claude in Chrome, Anthropic's browser extension, to execute tasks requiring web access. This combination allows the agent to navigate websites, click buttons, fill forms, and extract information from the internet — all while operating from the desktop application.
"Cowork includes a number of novel UX and safety features that we think make the product really special," Cherny explained, highlighting "a built-in VM [virtual machine] for isolation, out of the box support for browser automation, support for all your claude.ai data connectors, asking you for clarification when it's unsure."
Anthropic has also introduced an initial set of "skills" specifically designed for Cowork that enhance Claude's ability to create documents, presentations, and other files. These build on the Skills for Claude framework the company announced in October, which provides specialized instruction sets Claude can load for particular types of tasks.
Why Anthropic is warning users that its own AI agent could delete their files
The transition from a chatbot that suggests edits to an agent that makes edits introduces significant risk. An AI that can organize files can, theoretically, delete them.
In a notable display of transparency, Anthropic devoted considerable space in its announcement to warning users about Cowork's potential dangers — an unusual approach for a product launch.
The company explicitly acknowledges that Claude "can take potentially destructive actions (such as deleting local files) if it's instructed to." Because Claude might occasionally misinterpret instructions, Anthropic urges users to provide "very clear guidance" about sensitive operations.
More concerning is the risk of prompt injection attacks — a technique where malicious actors embed hidden instructions in content Claude might encounter online, potentially causing the agent to bypass safeguards or take harmful actions.
"We've built sophisticated defenses against prompt injections," Anthropic wrote, "but agent safety — that is, the task of securing Claude's real-world actions — is still an active area of development in the industry."
The company characterized these risks as inherent to the current state of AI agent technology rather than unique to Cowork. "These risks aren't new with Cowork, but it might be the first time you're using a more advanced tool that moves beyond a simple conversation," the announcement notes.
Anthropic's desktop agent strategy sets up a direct challenge to Microsoft Copilot
The launch of Cowork places Anthropic in direct competition with Microsoft, which has spent years attempting to integrate its Copilot AI into the fabric of the Windows operating system with mixed adoption results.
However, Anthropic's approach differs in its isolation. By confining the agent to specific folders and requiring explicit connectors, they are attempting to strike a balance between the utility of an OS-level agent and the security of a sandboxed application.
What distinguishes Anthropic's approach is its bottom-up evolution. Rather than designing an AI assistant and retrofitting agent capabilities, Anthropic built a powerful coding agent first — Claude Code — and is now abstracting its capabilities for broader audiences. This technical lineage may give Cowork more robust agentic behavior from the start.
Claude Code has generated significant enthusiasm among developers since its initial launch as a command-line tool in late 2024. The company expanded access with a web interface in October 2025, followed by a Slack integration in December. Cowork is the next logical step: bringing the same agentic architecture to users who may never touch a terminal.
Who can access Cowork now, and what's coming next for Windows and other platforms
For now, Cowork remains exclusive to Claude Max subscribers using the macOS desktop application. Users on other subscription tiers — Free, Pro, Team, or Enterprise — can join a waitlist for future access.
Anthropic has signaled clear intentions to expand the feature's reach. The blog post explicitly mentions plans to add cross-device sync and bring Cowork to Windows as the company learns from the research preview.
Cherny set expectations appropriately, describing the product as "early and raw, similar to what Claude Code felt like when it first launched."
To access Cowork, Max subscribers can download or update the Claude macOS app and click on "Cowork" in the sidebar.
The real question facing enterprise AI adoption
For technical decision-makers, the implications of Cowork extend beyond any single product launch. The bottleneck for AI adoption is shifting — no longer is model intelligence the limiting factor, but rather workflow integration and user trust.
Anthropic's goal, as the company puts it, is to make working with Claude feel less like operating a tool and more like delegating to a colleague. Whether mainstream users are ready to hand over folder access to an AI that might misinterpret their instructions remains an open question.
But the speed of Cowork's development — a major feature built in ten days, possibly by the company's own AI — previews a future where the capabilities of these systems compound faster than organizations can evaluate them.
The chatbot has learned to use a file manager. What it learns to use next is anyone's guess.
Microsoft is fundamentally restructuring its Windows operating system to become what executives call the first "agentic OS," embedding the infrastructure needed for autonomous AI agents to operate securely at enterprise scale — a watershed moment in the evolution of personal computing that positions the 40-year-old platform as the foundation for a new era of human-machine collaboration.
The company announced Tuesday at its Ignite conference that it is introducing native agent infrastructure directly into Windows 11, allowing AI agents — autonomous software programs that can perform complex, multi-step tasks on behalf of users — to discover tools, execute workflows, and interact with applications through standardized protocols while operating in secure, policy-controlled environments separate from user sessions.
The shift is Microsoft's most significant architectural evolution of Windows since the introduction of the modern security model, transforming the operating system from a platform where users manually orchestrate applications into one where they can "simply express your desired outcome, and agents handle the complexity," according to Pavan Davuluri, President of Windows & Devices at Microsoft.
"Windows 11 starts with this notion of secure by design, secure by default," Davuluri said in an exclusive interview with VentureBeat. "And a lot of the work that we're doing today, when we think about the engagement we have with our customers, the expectations they have with us is making sure we are building upon the fact that Windows is the most secure platform for them and is the most resilient platform as well."
The announcements arrive as enterprises are experimenting with AI agents but struggling with fragmented tooling, security concerns, and lack of centralized management — challenges that Microsoft believes only operating system-level integration can solve. The stakes are enormous: with Windows running on an estimated 1.4 billion devices globally, Microsoft's architectural choices will likely shape how organizations deploy autonomous AI systems for years to come.
New platform primitives create foundation for agent computing
At the core of Microsoft's vision are three new platform capabilities entering preview that fundamentally change how agents operate on Windows. Agent Connectors provide native support for the Model Context Protocol (MCP), an open standard introduced by Anthropic that allows AI agents to connect with external tools and data sources. Microsoft has built what it calls an "on-device registry" — a secure, manageable repository where developers can register their applications' capabilities as agent connectors, making them discoverable to any compatible agent on the system.
"These are platform capabilities that then become available to all of our customers," Davuluri explained, describing how the Windows file system, for example, becomes an agent connector that any MCP-compatible agent can access with user consent. "We're able to do this in a fashion that can scale for one but it also allows others to participate in the Windows registry for MCP."
The architecture introduces an MCP proxy layer that handles authentication, authorization, and auditing for all communication between agents and connectors. Microsoft is launching with two built-in agent connectors for File Explorer and System Settings, allowing agents to manage files or adjust system configurations like switching between light and dark mode — all with explicit user permission.
Agent Workspace, entering private preview, represents perhaps the most significant security innovation. It creates what Microsoft describes as "a contained, policy-controlled, and auditable environment where agents can interact with software" — essentially a parallel desktop session where agents operate with their own distinct identity, completely separate from the user's primary session.
"We want to be able to have clarity in the identity of the agent that is operating in the local operating system," Davuluri said, addressing security concerns about agents accessing sensitive data. "We want that session to be a session that is secure, that is policy control, that is manageable, that has transparency and auditability."
Each agent workspace runs with minimal privileges by default, accessing only explicitly granted resources. The system maintains detailed audit logs distinguishing agent actions from user actions — critical for enterprises that need to prove compliance and track all changes to systems and data.
Windows 365 for Agents extends this infrastructure to the cloud, turning Microsoft's Cloud PC offering into execution environments for agents. Instead of running on local devices, agents can operate in secure, policy-controlled virtual machines in Azure, enabling what Microsoft calls "computer-using agents" to interact with legacy applications and perform automation tasks at scale without consuming local compute resources.
Taskbar becomes command center for monitoring AI agents at work
The infrastructure enables significant user interface changes designed to make agents as commonplace as applications. Microsoft is introducing "Ask Copilot on the taskbar," a unified entry point in preview that combines Microsoft 365 Copilot, agent invocation, and traditional search in a single interface.
Users will be able to invoke agents using "@" mentions directly from the taskbar, then monitor their progress through familiar UI patterns like hover cards, progress badges, and notifications — all while continuing other work. When an agent completes a task or needs input, it surfaces updates through the taskbar without disrupting the user's primary workflow.
"We've evolved and created new UX in the taskbar to reflect the unique needs of agents performing background tasks on your behalf," said Navjot Virk, Corporate Vice President of Windows Experiences, describing features like progress bars and status badges that indicate when agents are working, need approval, or have completed tasks.
The design philosophy, Virk emphasized, centers on user control. "These experiences are designed to be opt in. We want to give customers full control over when and how they engage with copilots and agents."
For commercial Microsoft 365 Copilot users, the integration goes deeper. Microsoft is embedding Copilot directly into File Explorer, allowing users to ask questions, generate summaries, or draft emails based on document contents without leaving the file management interface. On Copilot+ PCs — devices with neural processing units capable of 40 trillion operations per second — new capabilities include converting any on-screen table into an Excel spreadsheet through the Click to Do feature.
Microsoft bets on open standards against Apple and Google's proprietary approaches
Microsoft's embrace of the open Model Context Protocol, created by Anthropic, marks a strategic bet on openness as enterprises evaluate competing AI platforms from Apple and Google that use proprietary frameworks.
"Windows is an open platform, and by virtue [of being] an open platform, we certainly have the ability to take existing technologies, evolve, harden, adapt those, but we also allow customers to bring their own capabilities to the platform as well," Davuluri said when asked about competing with Apple Intelligence and Google's Android AI for Enterprise.
The company demonstrated this openness with Claude, Anthropic's AI assistant, accessing the Windows file system through agent connectors with user consent — one of numerous partnerships Microsoft has secured. Dynamics 365 is using the File Explorer connector to streamline expense reporting, reducing what was previously a 30-minute, dozen-step process to "one sentence with high accuracy," according to Microsoft's blog post. Other early partners include Manus AI, Dropbox Dash, Roboflow, and Infosys.
"Windows is the platform in which they build upon," Davuluri said of enterprise customers. "And so our ability to take those existing bodies of work they have, and extend them is the, I think, the least friction way for them to go, learn, adopt, experiment and find ways to [scale]."
Security model enforces strict containment and mandatory user consent
Microsoft's security model for agents adheres to what it calls "secure by default" policies aligned with the company's broader Secure Future Initiative. All agent connectors registered in the on-device registry must meet strict requirements around packaging and identity, with applications properly packaged and signed by trusted sources. Developers must explicitly declare the minimum capabilities their agent connectors require, and agents and connectors run in isolated environments with dedicated agent user accounts, separate from human user accounts. Windows requires explicit user approval when agents first access sensitive resources like files or system settings.
"We give Windows the ability to go deliver on the security expectations, and then it is auditable at the end of the day," Davuluri said. "You still want an auditability log that looks similar to perhaps what you use in the cloud. And so all three pieces are built into the design and architecture of Agent Workspace."
For IT administrators, Microsoft is introducing management policies through Intune and Group Policy that allow organizations to enable or disable agent features at device and account levels, set minimum security policy levels, and access event logs enumerating all agent connector invocations and errors. The company emphasized that agents operate with restricted privileges, with minimal permissions by default and access granted only to explicitly approved resources that users can revoke at any time.
Post-quantum cryptography and recovery tools address emerging and persistent threats
Beyond agent infrastructure, Microsoft announced significant security and resilience updates addressing both emerging and persistent enterprise challenges. Post-Quantum Cryptography APIs are now generally available in Windows, allowing organizations to begin migrating to encryption algorithms designed to withstand future quantum computing attacks that could break today's cryptographic standards. Microsoft worked closely with the National Institute of Standards and Technology to implement these algorithms.
"We are introducing post quantum cryptography APIs in Windows," Davuluri said. "For customers who want to be able to do cryptographic encryption in their workloads, they can start taking advantage of these APIs in Windows for the first time. That is a huge step forward for us when we think about the future of windows."
Hardware-accelerated BitLocker will arrive on new devices starting spring 2026, offloading disk encryption to dedicated silicon for faster performance while providing hardware-level key protection. Sysmon functionality is becoming generally available as part of Windows in early 2026, bringing advanced forensics and threat detection capabilities previously available only as a separate download directly into the operating system's event logging system.
The company also detailed progress on its Windows Resiliency Initiative, launched a year ago following the CrowdStrike incident that disrupted 8.5 million Windows devices globally. New recovery capabilities include Quick Machine Recovery with expanded networking support and Autopatch management, allowing IT to remotely fix devices stuck in Windows Recovery Environment. Point-in-time restore entering preview rolls back devices to earlier states to resolve update conflicts or configuration errors, while Cloud rebuild in preview allows IT to remotely rebuild malfunctioning devices by downloading fresh installation media and using Autopilot for zero-touch provisioning.
Microsoft is also raising security requirements for third-party drivers across the Windows ecosystem. Following updated requirements for antivirus drivers effective April 1, 2025, the company is expanding this approach to other driver classes including networking, cameras, USB, printers, and storage — requiring higher certification standards, adding compiler safeguards, and providing more Windows in-box drivers to reduce reliance on third-party kernel-mode code.
Measured rollout reflects enterprise caution around autonomous software
Microsoft is positioning these updates as essential infrastructure for what it calls "Frontier Firms" — organizations that "blend human ingenuity with intelligent systems to deliver real outcomes." However, the company emphasized a cautious, opt-in approach that reflects enterprise concerns about autonomous software agents.
"The principles we're using in designing these new platform capabilities accounts for the reality that we have a very, very broad user base," Davuluri said. "A lot of the features and capabilities we're building are opt in capabilities. And so it is our goal to be able to have users find value in the workflow and meet them."
Virk emphasized the measured approach: "This is more about meeting customers where they are and then taking them on this journey when they are ready. So there's the optionality, but also having support for it. And really important thing is that they should feel comfortable. They should feel secure."
Microsoft's bet is that only operating system-level integration can provide the security, governance, and user experience required for mainstream AI agent adoption. Whether that vision materializes will depend on developer adoption, enterprise comfort with autonomous software, and Microsoft's ability to balance innovation with the stability that 40 years of Windows customers expect. After four decades of putting users in control of their computers, Windows is now asking them to share that control with machines.
Writer, a San Francisco-based artificial intelligence startup, is launching a unified AI agent platform designed to let any employee automate complex business workflows without writing code — a capability the company says distinguishes it from consumer-oriented tools like Microsoft Copilot and ChatGPT.
The platform, called Writer Agent, combines chat-based assistance with autonomous task execution in a single interface. Starting Tuesday, enterprise customers can use natural language to instruct the AI to create presentations, analyze financial data, generate marketing campaigns, or coordinate across multiple business systems like Salesforce, Slack, and Google Workspace—then save those workflows as reusable "Playbooks" that run automatically on schedules.
The announcement comes as enterprises struggle to move AI initiatives beyond pilot programs into production at scale. Writer CEO May Habib has been outspoken about this challenge, recently revealing that 42% of Fortune 500 executives surveyed by her company said AI is "tearing their company apart" due to coordination failures between departments.
"We're delivering an agent interface that is both incredibly powerful and radically simple to transform individual productivity into organizational impact," Habib said in a statement. "Writer Agent is the difference between a single sales rep asking a chatbot to write an outreach email and an enterprise ensuring that 1,000 reps are all sending on-brand, compliant, and contextually-aware messages to target accounts."
How Writer is putting workflow automation in the hands of non-technical workers
The platform's core innovation centers on making workflow automation accessible to non-technical employees—what Writer executives call "democratizing who gets to be a builder."
In an exclusive interview with VentureBeat, Doris Jwo, Writer's director of product management, demonstrated how the system works: A user types a request in plain English — for example, "Create a two-page partnership proposal between [Company A] and [Company B], make it a branded deck, include impact metrics and partnership tiers."
The AI agent then breaks down that request into discrete steps, conducts web research, generates graphics and charts on the fly, creates individual slides with sourced information, and assembles a complete presentation. The entire process, which might take an employee hours or days, can be completed in 10-12 minutes.
"The agent basically looks at the request, breaks it down, does research, understands what pieces it needs, creates a detailed plan at a step-by-step level," Jwo explained during a product demonstration. "It might say, 'I need to do web research,' or 'This user needs information from Gong or Slack,' and it reaches out to those connectors, grabs the data, and executes the plan."
Crucially, users can save these multi-step processes as Playbooks—reusable templates that colleagues can deploy with a single click. Routines allow those Playbooks to run automatically at scheduled intervals, essentially putting knowledge work "on autopilot."
Security and compliance controls: Writer's answer to enterprise IT concerns
Writer positions these enterprise-focused controls as a key differentiator from competitors. While Microsoft, OpenAI, and Anthropic offer powerful AI capabilities, Writer's executives argue those tools weren't designed from the ground up for the security, compliance, and governance requirements of large regulated organizations.
"All of the products you mentioned are great products, but even Copilot is very much focused on personal productivity—summarizing email, for example, which is important, but that's not the component we're focusing on," said Matan-Paul Shetrit, Writer's director of product management, in an exclusive interview with VentureBeat.
Shetrit emphasized Writer's "trust, security, and interoperability" approach. IT administrators can granularly control what the AI can access — for instance, preventing market research agents from mentioning competitors, or restricting which employees can use web search capabilities. All activity is logged with detailed audit trails showing exactly what data the agent touched and what actions it took.
"These fine-grained controls are what make products enterprise-ready," Shetrit said. "We can deploy to tens of thousands or hundreds of thousands of employees while maintaining the security and guardrails you need for that scale."
This architecture reflects Writer's origin story. Unlike OpenAI or Anthropic, which started as research labs and later added enterprise offerings, Writer has targeted Fortune 500 companies since its 2020 founding. "We're not a research lab that went to consumer and is dabbling in enterprise," Shetrit said. "We are first and foremost targeting the Global 2000 and Fortune 500, and our research is in service of these customers' needs."
Inside Writer's strategy to connect AI agents across enterprise software systems
A critical technical component is Writer's approach to system integrations. The platform includes pre-built connectors to more than a dozen enterprise applications—Google Workspace, Microsoft 365, Snowflake, Asana, Slack, Gong, HubSpot, Atlassian, Databricks, PitchBook, and FactSet—allowing the AI to retrieve information and take actions across those systems.
Writer built these connectors using the Model Context Protocol (MCP), an emerging standard for AI system integrations, but added what Shetrit described as an "enterprise-ready" layer on top.
"We took a first-principle approach of: You have this MCP connector infrastructure—how do you build it in a way that's enterprise-ready?" Shetrit explained. "What we have today in the industry is definitely not it."
The system can write and execute code on the fly to handle unexpected scenarios. If a user uploads an unfamiliar file format, for instance, the agent will generate code to extract and process the text without requiring a human to intervene.
Jwo demonstrated this capability with a daily workflow she runs: Every morning at 10 a.m., a Routine automatically summarizes her Google Calendar meetings, identifies external participants, finds their LinkedIn profiles, and sends the summary to her via Slack — all without her involvement.
"This was pretty simple, but you can imagine for a salesperson it might say, 'At the end of the day, wrap up a summary of all the calls I had, send me action items, post it to the account-specific Slack channel, and tag these folks so they can accomplish those workflows,'" Jwo said. "That can run continuously each day, each week, or on demand."
From mortgage lenders to CPG brands: Real-world AI agent use cases across industries
The platform is attracting customers across multiple industries. New American Funding, a mortgage lender, uses Writer Agent to automate marketing workflows. Senior Content Marketing Manager Karen Rodriguez uploads Asana project tickets with creative briefs, and the AI executes tasks like updating email campaigns or transforming articles into social media carousels, video scripts, and captions.
Other use cases span financial services teams creating investment dashboards with PitchBook and FactSet data, consumer packaged goods companies brainstorming new product lines based on social media trends, and marketing teams generating partnership presentations with branded assets.
Writer has added customers including TikTok, Comcast, Keurig Dr Pepper, CAA, and Aptitude Health, joining an existing base that includes Accenture, Qualcomm, Uber, Vanguard, and Marriott. The company now serves more than 300 enterprises and has secured over $50 million in signed contracts, with projections to double that to $100 million this year.
The startup's net retention rate — a measure of how much existing customers expand their usage — stands at 160%, meaning customers on average increase their spending by 60% after initial contracts. Twenty customers who started with $200,000-$300,000 contracts now spend about $1 million annually, according to company data.
'Vibe working': Writer's vision for AI-powered productivity beyond coding
Writer executives frame the platform as enabling what they call "vibe working" — a playful reference to the popular term "vibe coding," which describes AI tools like Cursor that dramatically accelerate software development.
"We used to call it transformation when we took 12 steps and made them nine. That's optimizing the world as it is," Habib said at Writer's AI Leaders Forum earlier this month, according to Forbes. "We can now create a new world. That is the greenfield mindset."
Shetrit echoed this framing: "Vibe coding is the theme of 2025. Our view is that ‘vibe working’ is the theme of 2026. How do you bring the same productivity gains you've seen with coding agents into the workspace in a way that non-technical users can maximize them?"
The platform is powered by Palmyra X5, Writer's proprietary large language model featuring a one-million-token context window — among the largest commercially available. Writer trained the model for approximately $700,000, a fraction of the estimated $100 million OpenAI spent on GPT-4, by using synthetic data and techniques that halt training when returns diminish.
The model can process one million tokens in about 22 seconds and costs 60 cents per million input tokens and $6 per million output tokens — significantly cheaper than comparable offerings, according to company specifications.
Making AI Decisions Visible: Writer's Approach to Trust and Transparency
A distinctive aspect of Writer's approach is transparency into the AI's decision-making process. The interface displays the agent's step-by-step reasoning, showing which data sources it accessed, what code it generated, and how it arrived at outputs.
"There's a very clear exhibition of how the agent is thinking, what it's doing, what it's touching," Shetrit said. "This is important for the end user to trust it, but also important for the IT person or security professional to see what's going on."
This "supervision" model goes beyond simple observability of API calls to encompass what Shetrit described as "a superset of observability" — giving organizations the ability to not just monitor but control AI behavior through policies and permissions.
Session logs capture all agent activity when enabled by administrators, and users can submit feedback on every output to help improve system performance. The platform also emphasizes providing sources and citations for generated content, allowing users to verify information.
"With any sort of chat assistant, agentic or not, trust but verify is really important," Jwo said. "That's part of the pillars of us building this and making it enterprise-grade."
What Writer Agent Costs—and Why It's Included in the Base Platform
Writer is including all the new capabilities—Playbooks, Routines, Connectors, and Personality customization—as part of its core platform without additional charges, according to Jwo.
"This is fully included as part of the Writer platform," she said. "We're not charging additional for using Writer Agent."
The "Personality" feature allows individual users, teams, or entire organizations to customize the AI's communication style, ensuring generated content matches brand voice and tone guidelines. This works alongside company-level controls that enforce terminology and style requirements.
For highly structured, repetitive tasks, Writer also offers a library of more than 100 pre-built agents and an AI Studio for building custom multi-agent systems aligned with specific business use cases.
The Race to Define Enterprise AI: Can Purpose-Built Platforms Beat Tech Giants?
The launch crystallizes a fundamental tension in how enterprises will adopt AI at scale. While consumer-facing AI tools emphasize individual productivity gains, companies need systems that work reliably across thousands of employees, integrate with existing software infrastructure, maintain regulatory compliance, and deliver measurable business impact.
Writer's wager is that these requirements demand purpose-built enterprise platforms rather than consumer tools adapted for business use. The company's $1.9 billion valuation — achieved in a November 2024 funding round that raised $200 million — suggests investors see merit in this thesis. Backers include Premji Invest, Radical Ventures, ICONIQ Growth, Salesforce Ventures, and Adobe Ventures.
Yet the competitive landscape remains formidable. Microsoft and Google command enormous distribution advantages through their existing enterprise software relationships. OpenAI and Anthropic possess research capabilities that have produced breakthrough models. Whether Writer can maintain its differentiation as these giants expand their enterprise offerings will test the startup's core premise: that serving Fortune 500 companies from day one creates advantages that research labs turned enterprise vendors cannot easily replicate.
"We're entering an era where if you can describe a better way to work, you can build it," Jwo said. "The new Writer Agent democratizes who gets to be a builder, empowering the operational experts and creative problem-solvers in every department to become the architects of their own transformation. That's how you unlock innovation that competitors can't replicate."
The promise is alluring — AI capabilities powerful enough to transform how work gets done, accessible enough for any employee to use, and controlled enough for enterprises to deploy safely at scale. Whether Writer can deliver on that promise at the speed and scale required will determine if its vision of "vibe working" becomes the 2026 theme Shetrit predicts, or just another ambitious attempt to solve enterprise AI's execution problem.
But one thing is certain: In a market where 85% of AI initiatives fail to escape pilot purgatory, Writer is betting that the winners won't be the companies with the most powerful models—they'll be the ones that make those models actually work inside the enterprise.
Artificial intelligence agents powered by the world's most advanced language models routinely fail to complete even straightforward professional tasks on their own, according to groundbreaking research released Thursday by Upwork, the largest online work marketplace.
But the same study reveals a more promising path forward: When AI agents collaborate with human experts, project completion rates surge by up to 70%, suggesting the future of work may not pit humans against machines but rather pair them together in powerful new ways.
The findings, drawn from more than 300 real client projects posted to Upwork's platform, marking the first systematic evaluation of how human expertise amplifies AI agent performance in actual professional work — not synthetic tests or academic simulations. The research challenges both the hype around fully autonomous AI agents and fears that such technology will imminently replace knowledge workers.
"AI agents aren't that agentic, meaning they aren't that good," Andrew Rabinovich, Upwork's chief technology officer and head of AI and machine learning, said in an exclusive interview with VentureBeat. "However, when paired with expert human professionals, project completion rates improve dramatically, supporting our firm belief that the future of work will be defined by humans and AI collaborating to get more work done, with human intuition and domain expertise playing a critical role."
How AI agents performed on 300+ real freelance jobs—and why they struggled
Upwork's Human+Agent Productivity Index (HAPI) evaluated how three leading AI systems — Gemini 2.5 Pro, OpenAI's GPT-5, and Claude Sonnet 4 — performed on actual jobs posted by paying clients across categories including writing, data science, web development, engineering, sales, and translation.
Critically, Upwork deliberately selected simple, well-defined projects where AI agents stood a reasonable chance of success. These jobs, priced under $500, represent less than 6% of Upwork's total gross services volume — a tiny fraction of the platform's overall business and an acknowledgment of current AI limitations.
"The reality is that although we study AI, and I've been doing this for 25 years, and we see significant breakthroughs, the reality is that these agents aren't that agentic," Rabinovich told VentureBeat. "So if we go up the value chain, the problems become so much more difficult, then we don't think they can solve them at all, even to scratch the surface. So we specifically chose simpler tasks that would give an agent some kind of traction."
Even on these deliberately simplified tasks, AI agents working independently struggled. But when expert freelancers provided feedback — spending an average of just 20 minutes per review cycle — the agents' performance improved substantially with each iteration.
20 minutes of human feedback boosted AI completion rates up to 70%
The research reveals stark differences in how AI agents perform with and without human guidance across different types of work. For data science and analytics projects, Claude Sonnet 4 achieved a 64% completion rate working alone but jumped to 93% after receiving feedback from a human expert. In sales and marketing work, Gemini 2.5 Pro's completion rate rose from 17% independently to 31% with human input. OpenAI's GPT-5 showed similarly dramatic improvements in engineering and architecture tasks, climbing from 30% to 50% completion.
The pattern held across virtually all categories, with agents responding particularly well to human feedback on qualitative, creative work requiring editorial judgment — areas like writing, translation, and marketing — where completion rates increased by up to 17 percentage points per feedback cycle.
The finding challenges a fundamental assumption in the AI industry: that agent benchmarks conducted in isolation accurately predict real-world performance.
"While we show that in the tasks that we have selected for agents to perform in isolation, they perform similarly to the previous results that we've seen published openly, what we've shown is that in collaboration with humans, the performance of these agents improves surprisingly well," Rabinovich said. "It's not just a one-turn back and forth, but the more feedback the human provides, the better the agent gets at performing."
Why ChatGPT can ace the SAT but can't count the R's in 'strawberry'
The research arrives as the AI industry grapples with a measurement crisis. Traditional benchmarks — standardized tests that AI models can master, sometimes scoring perfectly on SAT exams or mathematics olympiads — have proven poor predictors of real-world capability.
"With advances of large language models, what we're now seeing is that these static, academic datasets are completely saturated," Rabinovich said. "So you could get a perfect score in the SAT test or LSAT or any of the math olympiads, and then you would ask ChatGPT how many R's there are in the word strawberry, and it would get it wrong."
This phenomenon — where AI systems ace formal tests but stumble on trivial real-world questions — has led to growing skepticism about AI capabilities, even as companies race to deploy autonomous agents. Several recent benchmarks from other firms have tested AI agents on Upwork jobs, but those evaluations measured only isolated performance, not the collaborative potential that Upwork's research reveals.
"We wanted to evaluate the quality of these agents on actual real work with economic value associated with it, and not only see how well these agents do, but also see how these agents do in collaboration with humans, because we sort of knew already that in isolation, they're not that advanced," Rabinovich explained.
For Upwork, which connects roughly 800,000 active clients posting more than 3 million jobs annually to a global pool of freelancers, the research serves a strategic business purpose: establishing quality standards for AI agents before allowing them to compete or collaborate with human workers on its platform.
The economics of human-AI teamwork: Why paying for expert feedback still saves money
Despite requiring multiple rounds of human feedback — each lasting about 20 minutes — the time investment remains "orders of magnitude different between a human doing the work alone, versus a human doing the work with an AI agent," Rabinovich said. Where a project might take a freelancer days to complete independently, the agent-plus-human approach can deliver results in hours through iterative cycles of automated work and expert refinement.
The economic implications extend beyond simple time savings. Upwork recently reported that gross services volume from AI-related work grew 53% year-over-year in the third quarter of 2025, one of the strongest growth drivers for the company. But executives have been careful to frame AI not as a replacement for freelancers but as an enhancement to their capabilities.
"AI was a huge overhang for our valuation," Erica Gessert, Upwork's CFO, told CFO Brew in October. "There was this belief that all work was going to go away. AI was going to take it, and especially work that's done by people like freelancers, because they are impermanent. Actually, the opposite is true."
The company's strategy centers on enabling freelancers to handle more complex, higher-value work by offloading routine tasks to AI. "Freelancers actually prefer to have tools that automate the manual labor and repetitive part of their work, and really focus on the creative and conceptual part of the process," Rabinovich said.
Rather than replacing jobs, he argues, AI will transform them: "Simpler tasks will be automated by agents, but the jobs will become much more complex in the number of tasks, so the amount of work and therefore earnings for freelancers will actually only go up."
AI coding agents excel, but creative writing and translation still need humans
The research reveals a clear pattern in agent capabilities. AI systems perform best on "deterministic and verifiable" tasks with objectively correct answers, like solving math problems or writing basic code. "Most coding tasks are very similar to each other," Rabinovich noted. "That's why coding agents are becoming so good."
In Upwork's tests, web development, mobile app development, and data science projects — especially those involving structured, computational work — saw the highest standalone agent completion rates. Claude Sonnet 4 completed 68% of web development jobs and 64% of data science projects without human help, while Gemini 2.5 Pro achieved 74% on certain technical tasks.
But qualitative work proved far more challenging. When asked to create website layouts, write marketing copy, or translate content with appropriate cultural nuance, agents floundered without expert guidance. "When you ask it to write you a poem, the quality of the poem is extremely subjective," Rabinovich said. "Since the rubrics for evaluation were provided by humans, there's some level of variability in representation."
Writing, translation, and sales and marketing projects showed the most dramatic improvements from human feedback. For writing work, completion rates increased by up to 17 percentage points after expert review. Engineering and architecture projects requiring creative problem-solving — like civil engineering or architectural design — improved by as much as 23 percentage points with human oversight.
This pattern suggests AI agents excel at pattern matching and replication but struggle with creativity, judgment, and context — precisely the skills that define higher-value professional work.
Inside the research: How Upwork tested AI agents with peer-reviewed scientific methods
Upwork partnered with elite freelancers on its platform to evaluate every deliverable produced by AI agents, both independently and after each cycle of human feedback. These evaluators created detailed rubrics defining whether projects met core requirements specified in job descriptions, then scored outputs across multiple iterations.
Importantly, evaluators focused only on objective completion criteria, excluding subjective factors like stylistic preferences or quality judgments that might emerge in actual client relationships. "Rubric-based completion rates should not be viewed as a measure of whether an agent would be paid in a real marketplace setting," the research notes, "but as an indicator of its ability to fulfill explicitly defined requests."
This distinction matters: An AI agent might technically complete all specified requirements yet still produce work a client rejects as inadequate. Conversely, subjective client satisfaction — the true measure of marketplace success — remains beyond current measurement capabilities.
The research underwent double-blind peer review and was accepted to NeurIPS, the premier academic conference for AI research, where Upwork will present full results in early December. The company plans to publish a complete methodology and make the benchmark available to the research community, updating the task pool regularly to prevent overfitting as agents improve.
"The idea is for this benchmark to be a living and breathing platform where agents can come in and evaluate themselves on all categories of work, and the tasks that will be offered on the platform will always update, so that these agents don't overfit and basically memorize the tasks at hand," Rabinovich said.
Upwork's AI strategy: Building Uma, a 'meta-agent' that manages human and AI workers
The research directly informs Upwork's product roadmap as the company positions itself for what executives call "the age of AI and beyond." Rather than building its own AI agents to complete specific tasks, Upwork is developing Uma, a "meta orchestration agent" that coordinates between human workers, AI systems, and clients.
"Today, Upwork is a marketplace where clients look for freelancers to get work done, and then talent comes to Upwork to find work," Rabinovich explained. "This is getting expanded into a domain where clients come to Upwork, communicate with Uma, this meta-orchestration agent, and then Uma identifies the necessary talent to get the job done, gets the tasks outcomes completed, and then delivers that to the client."
In this vision, clients would interact primarily with Uma rather than directly hiring freelancers. The AI system would analyze project requirements, determine which tasks require human expertise versus AI execution, coordinate the workflow, and ensure quality — acting as an intelligent project manager rather than a replacement worker.
"We don't want to build agents that actually complete the tasks, but we are building this meta orchestration agent that figures out what human and agent talent is necessary in order to complete the tasks," Rabinovich said. "Uma evaluates the work to be delivered to the client, orchestrates the interaction between humans and agents, and is able to learn from all the interactions that happen on the platform how to break jobs into tasks so that they get completed in a timely and effective manner."
The company recently announced plans to open its first international office in Lisbon, Portugal, by the fourth quarter of 2026, with a focus on AI infrastructure development and technical hiring. The expansion follows Upwork's record-breaking third quarter, driven partly by AI-powered product innovation and strong demand for workers with AI skills.
OpenAI, Anthropic, and Google race to build autonomous agents—but reality lags hype
Upwork's findings arrive amid escalating competition in the AI agent space. OpenAI, Anthropic, Google, and numerous startups are racing to develop autonomous agents capable of complex multi-step tasks, from booking travel to analyzing financial data to writing software.
But recent high-profile stumbles have tempered initial enthusiasm. AI agents frequently misunderstand instructions, make logical errors, or produce confidently wrong results — a phenomenon researchers call "hallucination." The gap between controlled demonstration videos and reliable real-world performance remains vast.
"There have been some evaluations that came from OpenAI and other platforms where real Upwork tasks were considered for completion by agents, and across the board, the reported results were not very optimistic, in the sense that they showed that agents—even the best ones, meaning powered by most advanced LLMs — can't really compete with humans that well, because the completion rates are pretty low," Rabinovich said.
Rather than waiting for AI to fully mature — a timeline that remains uncertain—Upwork is betting on a hybrid approach that leverages AI's strengths (speed, scalability, pattern recognition) while retaining human strengths (judgment, creativity, contextual understanding).
This philosophy extends to learning and improvement. Current AI models train primarily on static datasets scraped from the internet, supplemented by human preference feedback. But most professional work is qualitative, making it difficult for AI systems to know whether their outputs are actually good without expert evaluation.
"Unless you have this collaboration between the human and the machine, where the human is kind of the teacher and the machine is the student trying to discover new solutions, none of this will be possible," Rabinovich said. "Upwork is very uniquely positioned to create such an environment because if you try to do this with, say, self-driving cars, and you tell Waymo cars to explore new ways of getting to the airport, like avoiding traffic signs, then a bunch of bad things will happen. In doing work on Upwork, if it creates a wrong website, it doesn't cost very much, and there's no negative side effects. But the opportunity to learn is absolutely tremendous."
Will AI take your job? The evidence suggests a more complicated answer
While much public discourse around AI focuses on job displacement, Rabinovich argues the historical pattern suggests otherwise — though the transition may prove disruptive.
"The narrative in the public is that AI is eliminating jobs, whether it's writing, translation, coding or other digital work, but no one really talks about the exponential amount of new types of work that it will create," he said. "When we invented electricity and steam engines and things like that, they certainly replaced certain jobs, but the amount of new jobs that were introduced is exponentially more, and we think the same is going to happen here."
The research identifies emerging job categories focused on AI oversight: designing effective human-machine workflows, providing high-quality feedback to improve agent performance, and verifying that AI-generated work meets quality standards. These skills—prompt engineering, agent supervision, output verification—barely existed two years ago but now command premium rates on platforms like Upwork.
"New types of skills from humans are becoming necessary in the form of how to design the interaction between humans and machines, how to guide agents to make them better, and ultimately, how to verify that whatever agentic proposals are being made are actually correct, because that's what's necessary in order to advance the state of AI," Rabinovich said.
The question remains whether this transition— from doing tasks to overseeing them — will create opportunities as quickly as it disrupts existing roles. For freelancers on Upwork, the answer may already be emerging in their bank accounts: The platform saw AI-related work grow 53% year-over-year, even as fears of AI-driven unemployment dominated headlines.
As software systems grow more complex and AI tools generate code faster than ever, a fundamental problem is getting worse: Engineers are drowning in debugging work, spending up to half their time hunting down the causes of software failures instead of building new products. The challenge has become so acute that it's creating a new category of tooling — AI agents that can diagnose production failures in minutes instead of hours.
Deductive AI, a startup emerging from stealth mode Wednesday, believes it has found a solution by applying reinforcement learning — the same technology that powers game-playing AI systems — to the messy, high-stakes world of production software incidents. The company announced it has raised $7.5 million in seed funding led by CRV, with participation from Databricks Ventures, Thomvest Ventures, and PrimeSet, to commercialize what it calls "AI SRE agents" that can diagnose and help fix software failures at machine speed.
The pitch resonates with a growing frustration inside engineering organizations: Modern observability tools can show that something broke, but they rarely explain why. When a production system fails at 3 a.m., engineers still face hours of manual detective work, cross-referencing logs, metrics, deployment histories, and code changes across dozens of interconnected services to identify the root cause.
"The complexities and inter-dependencies of modern infrastructure means that investigating the root cause of an outage or incident can feel like searching for a needle in a haystack, except the haystack is the size of a football field, it's made of a million other needles, it's constantly reshuffling itself, and is on fire — and every second you don't find it equals lost revenue," said Sameer Agarwal, Deductive's co-founder and chief technology officer, in an exclusive interview with VentureBeat.
Deductive's system builds what the company calls a "knowledge graph" that maps relationships across codebases, telemetry data, engineering discussions, and internal documentation. When an incident occurs, multiple AI agents work together to form hypotheses, test them against live system evidence, and converge on a root cause — mimicking the investigative workflow of experienced site reliability engineers, but completing the process in minutes rather than hours.
The technology has already shown measurable impact at some of the world's most demanding production environments. DoorDash's advertising platform, which runs real-time auctions that must complete in under 100 milliseconds, has integrated Deductive into its incident response workflow. The company has set an ambitious 2026 goal of resolving production incidents within 10 minutes.
"Our Ads Platform operates at a pace where manual, slow-moving investigations are no longer viable. Every minute of downtime directly affects company revenue," said Shahrooz Ansari, Senior Director of Engineering at DoorDash, in an interview with VentureBeat. "Deductive has become a critical extension of our team, rapidly synthesizing signals across dozens of services and surfacing the insights that matter—within minutes."
DoorDash estimates that Deductive has root-caused approximately 100 production incidents over the past few months, translating to more than 1,000 hours of annual engineering productivity and a revenue impact "in millions of dollars," according to Ansari. At location intelligence company Foursquare, Deductive reduced the time to diagnose Apache Spark job failures by 90% —t urning a process that previously took hours or days into one that completes in under 10 minutes — while generating over $275,000 in annual savings.
Why AI-generated code is creating a debugging crisis
The timing of Deductive's launch reflects a brewing tension in software development: AI coding assistants are enabling engineers to generate code faster than ever, but the resulting software is often harder to understand and maintain.
"Vibe coding," a term popularized by AI researcher Andrej Karpathy, refers to using natural-language prompts to generate code through AI assistants. While these tools accelerate development, they can introduce what Agarwal describes as "redundancies, breaks in architectural boundaries, assumptions, or ignored design patterns" that accumulate over time.
"Most AI-generated code still introduces redundancies, breaks architectural boundaries, makes assumptions, or ignores established design patterns," Agarwal told Venturebeat. "In many ways, we now need AI to help clean up the mess that AI itself is creating."
The claim that engineers spend roughly half their time on debugging isn't hyperbole. The Association for Computing Machinery reports that developers spend 35% to 50% of their time validating and debugging software. More recently, Harness's State of Software Delivery 2025 report found that 67% of developers are spending more time debugging AI-generated code.
"We've seen world-class engineers spending half of their time debugging instead of building," said Rakesh Kothari, Deductive's co-founder and CEO. "And as vibe coding generates new code at a rate we've never seen, this problem is only going to get worse."
How Deductive's AI agents actually investigate production failures
Deductive's technical approach differs substantially from the AI features being added to existing observability platforms like Datadog or New Relic. Most of those systems use large language models to summarize data or identify correlations, but they lack what Agarwal calls "code-aware reasoning"—the ability to understand not just that something broke, but why the code behaves the way it does.
"Most enterprises use multiple observability tools across different teams and services, so no vendor has a single holistic view of how their systems behave, fail, and recover—nor are they able to pair that with an understanding of the code that defines system behavior," Agarwal explained. "These are key ingredients to resolving software incidents and it is exactly the gap Deductive fills."
The system connects to existing infrastructure using read-only API access to observability platforms, code repositories, incident management tools, and chat systems. It then continuously builds and updates its knowledge graph, mapping dependencies between services and tracking deployment histories.
When an alert fires, Deductive launches what the company describes as a multi-agent investigation. Different agents specialize in different aspects of the problem: one might analyze recent code changes, another examines trace data, while a third correlates the timing of the incident with recent deployments. The agents share findings and iteratively refine their hypotheses.
The critical difference from rule-based automation is Deductive's use of reinforcement learning. The system learns from every incident which investigative steps led to correct diagnoses and which were dead ends. When engineers provide feedback, the system incorporates that signal into its learning model.
"Each time it observes an investigation, it learns which steps, data sources, and decisions led to the right outcome," Agarwal said. "It learns how to think through problems, not just point them out."
At DoorDash, a recent latency spike in an API initially appeared to be an isolated service issue. Deductive's investigation revealed that the root cause was actually timeout errors from a downstream machine learning platform undergoing a deployment. The system connected these dots by analyzing log volumes, traces, and deployment metadata across multiple services.
"Without Deductive, our team would have had to manually correlate the latency spike across all logs, traces, and deployment histories," Ansari said. "Deductive was able to explain not just what changed, but how and why it impacted production behavior."
The company keeps humans in the loop—for now
While Deductive's technology could theoretically push fixes directly to production systems, the company has deliberately chosen to keep humans in the loop—at least for now.
"While our system is capable of deeper automation and could push fixes to production, currently, we recommend precise fixes and mitigations that engineers can review, validate, and apply," Agarwal said. "We believe maintaining a human in the loop is essential for trust, transparency and operational safety."
However, he acknowledged that "over time, we do think that deeper automation will come and how humans operate in the loop will evolve."
Databricks and ThoughtSpot veterans bet on reasoning over observability
The founding team brings deep expertise from building some of Silicon Valley's most successful data infrastructure platforms. Agarwal earned his Ph.D. at UC Berkeley, where he created BlinkDB, an influential system for approximate query processing. He was among the first engineers at Databricks, where he helped build Apache Spark. Kothari was an early engineer at ThoughtSpot, where he led teams focused on distributed query processing and large-scale system optimization.
The investor syndicate reflects both the technical credibility and market opportunity. Beyond CRV's Max Gazor, the round included participation from Ion Stoica, founder of Databricks and Anyscale; Ajeet Singh, founder of Nutanix and ThoughtSpot; and Ben Sigelman, founder of Lightstep.
Rather than competing with platforms like Datadog or PagerDuty, Deductive positions itself as a complementary layer that sits on top of existing tools. The pricing model reflects this: Instead of charging based on data volume, Deductive charges based on the number of incidents investigated, plus a base platform fee.
The company offers both cloud-hosted and self-hosted deployment options and emphasizes that it doesn't store customer data on its servers or use it to train models for other customers — a critical assurance given the proprietary nature of both code and production system behavior.
With fresh capital and early customer traction at companies like DoorDash, Foursquare, and Kumo AI, Deductive plans to expand its team and deepen the system's reasoning capabilities from reactive incident analysis to proactive prevention. The near-term vision: helping teams predict problems before they occur.
DoorDash's Ansari offers a pragmatic endorsement of where the technology stands today: "Investigations that were previously manual and time-consuming are now automated, allowing engineers to shift their energy toward prevention, business impact, and innovation."
In an industry where every second of downtime translates to lost revenue, that shift from firefighting to building increasingly looks less like a luxury and more like table stakes.
An international team of researchers has released an artificial intelligence system capable of autonomously conducting scientific research across multiple disciplines — generating papers from initial concept to publication-ready manuscript in approximately 30 minutes for about $4 each.
The system, called Denario, can formulate research ideas, review existing literature, develop methodologies, write and execute code, create visualizations, and draft complete academic papers. In a demonstration of its versatility, the team used Denario to generate papers spanning astrophysics, biology, chemistry, medicine, neuroscience, and other fields, with one AI-generated paper already accepted for publication at an academic conference.
"The goal of Denario is not to automate science, but to develop a research assistant that can accelerate scientific discovery," the researchers wrote in a paper released Monday describing the system. The team is making the software publicly available as an open-source tool.
This achievement marks a turning point in the application of large language models to scientific work, potentially transforming how researchers approach early-stage investigations and literature reviews. However, the research also highlights substantial limitations and raises pressing questions about validation, authorship, and the changing nature of scientific labor.
From data to draft: how AI agents collaborate to conduct research
At its core, Denario operates not as a single AI brain but as a digital research department where specialized AI agents collaborate to push a project from conception to completion. The process can begin with the "Idea Module," which employs a fascinating adversarial process where an "Idea Maker" agent proposes research projects that are then scrutinized by an "Idea Hater" agent, which critiques them for feasibility and scientific value. This iterative loop refines raw concepts into robust research directions.
Once a hypothesis is solidified, a "Literature Module" scours academic databases like Semantic Scholar to check the idea's novelty, followed by a "Methodology Module" that lays out a detailed, step-by-step research plan. The heavy lifting is then done by the "Analysis Module," a virtual workhorse that writes, debugs, and executes its own Python code to analyze data, generate plots, and summarize findings. Finally, the "Paper Module" takes the resulting data and plots and drafts a complete scientific paper in LaTeX, the standard for many scientific fields. In a final, recursive step, a "Review Module" can even act as an AI peer-reviewer, providing a critical report on the generated paper's strengths and weaknesses.
This modular design allows a human researcher to intervene at any stage, providing their own idea or methodology, or to simply use Denario as an end-to-end autonomous system. "The system has a modular architecture, allowing it to handle specific tasks, such as generating an idea, or carrying out end-to-end scientific analysis," the paper explains.
To validate its capabilities, the Denario team has put the system to the test, generating a vast repository of papers across numerous disciplines. In a striking proof of concept, one paper fully generated by Denario was accepted for publication at the Agents4Science 2025 conference — a peer-reviewed venue where AI systems themselves are the primary authors. The paper, titled "QITT-Enhanced Multi-Scale Substructure Analysis with Learned Topological Embeddings for Cosmological Parameter Estimation from Dark Matter Halo Merger Trees," successfully combined complex ideas from quantum physics, machine learning, and cosmology to analyze simulation data.
The ghost in the machine: AI’s ‘vacuous’ results and ethical alarms
While the successes are notable, the research paper is refreshingly candid about Denario's significant limitations and failure modes. The authors stress that the system currently "behaves more like a good undergraduate or early graduate student rather than a full professor in terms of big picture, connecting results...etc." This honesty provides a crucial reality check in a field often dominated by hype.
The paper dedicates entire sections to "Failure Modes" and "Ethical Implications," a level of transparency that enterprise leaders should note. The authors report that in one instance, the system "hallucinated an entire paper without implementing the necessary numerical solver," inventing results to fit a plausible narrative. In another test on a pure mathematics problem, the AI produced text that had the form of a mathematical proof but was, in the authors' words, "mathematically vacuous."
These failures underscore a critical point for any organization looking to deploy agentic AI: the systems can be brittle and are prone to confident-sounding errors that require expert human oversight. The Denario paper serves as a vital case study in the importance of keeping a human in the loop for validation and critical assessment.
The authors also confront the profound ethical questions raised by their creation. They warn that "AI agents could be used to quickly flood the scientific literature with claims driven by a particular political agenda or specific commercial or economic interests." They also touch on the "Turing Trap," a phenomenon where the goal becomes mimicking human intelligence rather than augmenting it, potentially leading to a "homogenization" of research that stifles true, paradigm-shifting innovation.
An open-source co-pilot for the world's labs
Denario is not just a theoretical exercise locked away in an academic lab. The entire system is open-source under a GPL-3.0 license and is accessible to the broader community. The main project and its graphical user interface, DenarioApp, are available on GitHub, with installation managed via standard Python tools. For enterprise environments focused on reproducibility and scalability, the project also provides official Docker images. A public demo hosted on Hugging Face Spaces allows anyone to experiment with its capabilities.
For now, Denario remains what its creators call a powerful assistant, but not a replacement for the seasoned intuition of a human expert. This framing is deliberate. The Denario project is less about creating an automated scientist and more about building the ultimate co-pilot, one designed to handle the tedious and time-consuming aspects of modern research.
By handing off the grueling work of coding, debugging, and initial drafting to an AI agent, the system promises to free up human researchers for the one task it cannot automate: the deep, critical thinking required to ask the right questions in the first place.
Microsoft is launching a significant expansion of its Copilot AI assistant on Tuesday, introducing tools that let employees build applications, automate workflows, and create specialized AI agents using only conversational prompts — no coding required.
The new capabilities, called App Builder and Workflows, mark Microsoft's most aggressive attempt yet to merge artificial intelligence with software development, enabling the estimated 100 million Microsoft 365 users to create business tools as easily as they currently draft emails or build spreadsheets.
"We really believe that a main part of an AI-forward employee, not just developers, will be to create agents, workflows and apps," Charles Lamanna, Microsoft's president of business and industry Copilot, said in an interview with VentureBeat. "Part of the job will be to build and create these things."
The announcement comes as Microsoft deepens its commitment to AI-powered productivity tools while navigating a complex partnership with OpenAI, the creator of the underlying technology that powers Copilot. On the same day, OpenAI completed its restructuring into a for-profit entity, with Microsoft receiving a 27% ownership stake valued at approximately $135 billion.
How natural language prompts now create fully functional business applications
The new features transform Copilot from a conversational assistant into what Microsoft envisions as a comprehensive development environment accessible to non-technical workers. Users can now describe an application they need — such as a project tracker with dashboards and task assignments — and Copilot will generate a working app complete with a database backend, user interface, and security controls.
"If you're right inside of Copilot, you can now have a conversation to build an application complete with a backing database and a security model," Lamanna explained. "You can make edit requests and update requests and change requests so you can tune the app to get exactly the experience you want before you share it with other users."
The App Builder stores data in Microsoft Lists, the company's lightweight database system, and allows users to share finished applications via a simple link—similar to sharing a document. The Workflows agent, meanwhile, automates routine tasks across Microsoft's ecosystem of products, including Outlook, Teams, SharePoint, and Planner, by converting natural language descriptions into automated processes.
A third component, a simplified version of Microsoft's Copilot Studio agent-building platform, lets users create specialized AI assistants tailored to specific tasks or knowledge domains, drawing from SharePoint documents, meeting transcripts, emails, and external systems.
All three capabilities are included in the existing $30-per-month Microsoft 365 Copilot subscription at no additional cost — a pricing decision Lamanna characterized as consistent with Microsoft's historical approach of bundling significant value into its productivity suite.
"That's what Microsoft always does. We try to do a huge amount of value at a low price," he said. "If you go look at Office, you think about Excel, Word, PowerPoint, Exchange, all that for like eight bucks a month. That's a pretty good deal."
Why Microsoft's nine-year bet on low-code development is finally paying off
The new tools represent the culmination of a nine-year effort by Microsoft to democratize software development through its Power Platform — a collection of low-code and no-code development tools that has grown to 56 million monthly active users, according to figures the company disclosed in recent earnings reports.
Lamanna, who has led the Power Platform initiative since its inception, said the integration into Copilot marks a fundamental shift in how these capabilities reach users. Rather than requiring workers to visit a separate website or learn a specialized interface, the development tools now exist within the same conversational window they already use for AI-assisted tasks.
"One of the big things that we're excited about is Copilot — that's a tool for literally every office worker," Lamanna said. "Every office worker, just like they research data, they analyze data, they reason over topics, they also will be creating apps, agents and workflows."
The integration offers significant technical advantages, he argued. Because Copilot already indexes a user's Microsoft 365 content — emails, documents, meetings, and organizational data — it can incorporate that context into the applications and workflows it builds. If a user asks for "an app for Project Spartan," Copilot can draw from existing communications to understand what that project entails and suggest relevant features.
"If you go to those other tools, they have no idea what the heck Project Spartan is," Lamanna said, referencing competing low-code platforms from companies like Google, Salesforce, and ServiceNow. "But if you do it inside of Copilot and inside of the App Builder, it's able to draw from all that information and context."
Microsoft claims the apps created through these tools are "full-stack applications" with proper databases secured through the same identity systems used across its enterprise products — distinguishing them from simpler front-end tools offered by competitors. The company also emphasized that its existing governance, security, and data loss prevention policies automatically apply to apps and workflows created through Copilot.
Where professional developers still matter in an AI-powered workplace
While Microsoft positions the new capabilities as accessible to all office workers, Lamanna was careful to delineate where professional developers remain essential. His dividing line centers on whether a system interacts with parties outside the organization.
"Anything that leaves the boundaries of your company warrants developer involvement," he said. "If you want to build an agent and put it on your website, you should have developers involved. Or if you want to build an automation which interfaces directly with your customers, or an app or a website which interfaces directly with your customers, you want professionals involved."
The reasoning is risk-based: external-facing systems carry greater potential for data breaches, security vulnerabilities, or business errors. "You don't want people getting refunds they shouldn't," Lamanna noted.
For internal use cases — approval workflows, project tracking, team dashboards — Microsoft believes the new tools can handle the majority of needs without IT department involvement. But the company has built "no cliffs," in Lamanna's terminology, allowing users to migrate simple apps to more sophisticated platforms as needs grow.
Apps created in the conversational App Builder can be opened in Power Apps, Microsoft's full development environment, where they can be connected to Dataverse, the company's enterprise database, or extended with custom code. Similarly, simple workflows can graduate to the full Power Automate platform, and basic agents can be enhanced in the complete Copilot Studio.
"We have this mantra called no cliffs," Lamanna said. "If your app gets too complicated for the App Builder, you can always edit and open it in Power Apps. You can jump over to the richer experience, and if you're really sophisticated, you can even go from those experiences into Azure."
This architecture addresses a problem that has plagued previous generations of easy-to-use development tools: users who outgrow the simplified environment often must rebuild from scratch on professional platforms. "People really do not like easy-to-use development tools if I have to throw everything away and start over," Lamanna said.
What happens when every employee can build apps without IT approval
The democratization of software development raises questions about governance, maintenance, and organizational complexity — issues Microsoft has worked to address through administrative controls.
IT administrators can view all applications, workflows, and agents created within their organization through a centralized inventory in the Microsoft 365 admin center. They can reassign ownership, disable access at the group level, or "promote" particularly useful employee-created apps to officially supported status.
"We have a bunch of customers who have this approach where it's like, let 1,000 apps bloom, and then the best ones, I go upgrade and make them IT-governed or central," Lamanna said.
The system also includes provisions for when employees leave. Apps and workflows remain accessible for 60 days, during which managers can claim ownership — similar to how OneDrive files are handled when someone departs.
Lamanna argued that most employee-created apps don't warrant significant IT oversight. "It's just not worth inspecting an app that John, Susie, and Bob use to do their job," he said. "It should concern itself with the app that ends up being used by 2,000 people, and that will pop up in that dashboard."
Still, the proliferation of employee-created applications could create challenges. Users have expressed frustration with Microsoft's increasing emphasis on AI features across its products, with some giving the Microsoft 365 mobile app one-star ratings after a recent update prioritized Copilot over traditional file access.
The tools also arrive as enterprises grapple with "shadow IT" — unsanctioned software and systems that employees adopt without official approval. While Microsoft's governance controls aim to provide visibility, the ease of creating new applications could accelerate the pace at which these systems multiply.
The ambitious plan to turn 500 million workers into software builders
Microsoft's ambitions for the technology extend far beyond incremental productivity gains. Lamanna envisions a fundamental transformation of what it means to be an office worker — one where building software becomes as routine as creating spreadsheets.
"Just like how 20 years ago you put on your resume that you could use pivot tables in Excel, people are going to start saying that they can use App Builder and workflow agents, even if they're just in the finance department or the sales department," he said.
The numbers he's targeting are staggering. With 56 million people already using Power Platform, Lamanna believes the integration into Copilot could eventually reach 500 million builders. "Early days still, but I think it's certainly encouraging," he said.
The features are currently available only to customers in Microsoft's Frontier Program — an early access initiative for Microsoft 365 Copilot subscribers. The company has not disclosed how many organizations participate in the program or when the tools will reach general availability.
The announcement fits within Microsoft's larger strategy of embedding AI capabilities throughout its product portfolio, driven by its partnership with OpenAI. Under the restructured agreement announced Tuesday, Microsoft will have access to OpenAI's technology through 2032, including models that achieve artificial general intelligence (AGI) — though such systems do not yet exist. Microsoft has also begun integrating Copilot into its new companion apps for Windows 11, which provide quick access to contacts, files, and calendar information.
The aggressive integration of AI features across Microsoft's ecosystem has drawn mixed reactions. While enterprise customers have shown interest in productivity gains, the rapid pace of change and ubiquity of AI prompts have frustrated some users who prefer traditional workflows.
For Microsoft, however, the calculation is clear: if even a fraction of its user base begins creating applications and automations, it would represent a massive expansion of the effective software development workforce — and further entrench customers in Microsoft's ecosystem. The company is betting that the same natural language interface that made ChatGPT accessible to millions can finally unlock the decades-old promise of empowering everyday workers to build their own tools.
The App Builder and Workflows agents are available starting today through the Microsoft 365 Copilot Agent Store for Frontier Program participants.
Whether that future arrives depends not just on the technology's capabilities, but on a more fundamental question: Do millions of office workers actually want to become part-time software developers? Microsoft is about to find out if the answer is yes — or if some jobs are better left to the professionals.
Anthropic launched a new capability on Thursday that allows its Claude AI assistant to tap into specialized expertise on demand, marking the company's latest effort to make artificial intelligence more practical for enterprise workflows as it chases rival OpenAI in the intensifying competition over AI-powered software development.
The feature, called Skills, enables users to create folders containing instructions, code scripts, and reference materials that Claude can automatically load when relevant to a task. The system marks a fundamental shift in how organizations can customize AI assistants, moving beyond one-off prompts to reusable packages of domain expertise that work consistently across an entire company.
"Skills are based on our belief and vision that as model intelligence continues to improve, we'll continue moving towards general-purpose agents that often have access to their own filesystem and computing environment," said Mahesh Murag, a member of Anthropic's technical staff, in an exclusive interview with VentureBeat. "The agent is initially made aware only of the names and descriptions of each available skill and can choose to load more information about a particular skill when relevant to the task at hand."
The launch comes as Anthropic, valued at $183 billion after a recent $13 billion funding round, projects its annual revenue could nearly triple to as much as $26 billion in 2026, according to a recent Reuters report. The company is currently approaching a $7 billion annual revenue run rate, up from $5 billion in August, fueled largely by enterprise adoption of its AI coding tools — a market where it faces fierce competition from OpenAI's recently upgraded Codex platform.
How 'progressive disclosure' solves the context window problem
Skills differ fundamentally from existing approaches to customizing AI assistants, such as prompt engineering or retrieval-augmented generation (RAG), Murag explained. The architecture relies on what Anthropic calls "progressive disclosure" — Claude initially sees only skill names and brief descriptions, then autonomously decides which skills to load based on the task at hand, accessing only the specific files and information needed at that moment.
"Unlike RAG, this relies on simple tools that let Claude manage and read files from a filesystem," Murag told VentureBeat. "Skills can contain an unbounded amount of context to teach Claude how to complete a task or series of tasks. This is because Skills are based on the premise of an agent being able to autonomously and intelligently navigate a filesystem and execute code."
This approach allows organizations to bundle far more information than traditional context windows permit, while maintaining the speed and efficiency that enterprise users demand. A single skill can include step-by-step procedures, code templates, reference documents, brand guidelines, compliance checklists, and executable scripts — all organized in a folder structure that Claude navigates intelligently.
The system's composability provides another technical advantage. Multiple skills automatically stack together when needed for complex workflows. For instance, Claude might simultaneously invoke a company's brand guidelines skill, a financial reporting skill, and a presentation formatting skill to generate a quarterly investor deck — coordinating between all three without manual intervention.
What makes Skills different from OpenAI's Custom GPTs and Microsoft's Copilot
Anthropic is positioning Skills as distinct from competing offerings like OpenAI's Custom GPTs and Microsoft's Copilot Studio, though the features address similar enterprise needs around AI customization and consistency.
"Skills' combination of progressive disclosure, composability, and executable code bundling is unique in the market," Murag said. "While other platforms require developers to build custom scaffolding, Skills let anyone — technical or not — create specialized agents by organizing procedural knowledge into files."
The cross-platform portability also sets Skills apart. The same skill works identically across Claude.ai, Claude Code (Anthropic's AI coding environment), the company's API, and the Claude Agent SDK for building custom AI agents. Organizations can develop a skill once and deploy it everywhere their teams use Claude, a significant advantage for enterprises seeking consistency.
The feature supports any programming language compatible with the underlying container environment, and Anthropic provides sandboxing for security — though the company acknowledges that allowing AI to execute code requires users to carefully vet which skills they trust.
Early customers report 8x productivity gains on finance workflows
Early customer implementations reveal how organizations are applying Skills to automate complex knowledge work. At Japanese e-commerce giant Rakuten, the AI team is using Skills to transform finance operations that previously required manual coordination across multiple departments.
"Skills streamline our management accounting and finance workflows," said Yusuke Kaji, general manager of AI at Rakuten in a statement. "Claude processes multiple spreadsheets, catches critical anomalies, and generates reports using our procedures. What once took a day, we can now accomplish in an hour."
That's an 8x improvement in productivity for specific workflows — the kind of measurable return on investment that enterprises increasingly demand from AI implementations. Mike Krieger, Anthropic's chief product officer and Instagram co-founder, recently noted that companies have moved past "AI FOMO" to requiring concrete success metrics.
Design platform Canva plans to integrate Skills into its own AI agent workflows. "Canva plans to leverage Skills to customize agents and expand what they can do," said Anwar Haneef, general manager and head of ecosystem at Canva in a statement. "This unlocks new ways to bring Canva deeper into agentic workflows—helping teams capture their unique context and create stunning, high-quality designs effortlessly."
Cloud storage provider Box sees Skills as a way to make corporate content repositories more actionable. "Skills teaches Claude how to work with Box content," said Yashodha Bhavnani, head of AI at Box. "Users can transform stored files into PowerPoint presentations, Excel spreadsheets, and Word documents that follow their organization's standards—saving hours of effort."
The enterprise security question: Who controls which AI skills employees can use?
For enterprise IT departments, Skills raise important questions about governance and control—particularly since the feature allows AI to execute arbitrary code in sandboxed environments. Anthropic has built administrative controls that allow enterprise customers to manage access at the organizational level.
"Enterprise admins control access to the Skills capability via admin settings, where they can enable or disable access and monitor usage patterns," Murag said. "Once enabled at the organizational level, individual users still need to opt in."
That two-layer consent model — organizational enablement plus individual opt-in — reflects lessons learned from previous enterprise AI deployments where blanket rollouts created compliance concerns. However, Anthropic's governance tools appear more limited than some enterprise customers might expect. The company doesn't currently offer granular controls over which specific skills employees can use, or detailed audit trails of custom skill content.
Organizations concerned about data security should note that Skills require Claude's code execution environment, which runs in isolated containers. Anthropic advises users to "stick to trusted sources" when installing skills and provides security documentation, but the company acknowledges this is an inherently higher-risk capability than traditional AI interactions.
From API to no-code: How Anthropic is making Skills accessible to everyone
Anthropic is taking several approaches to make Skills accessible to users with varying technical sophistication. For non-technical users on Claude.ai, the company provides a "skill-creator" skill that interactively guides users through building new skills by asking questions about their workflow, then automatically generating the folder structure and documentation.
Developers working with Anthropic's API get programmatic control through a new /skills endpoint and can manage skill versions through the Claude Console web interface. The feature requires enabling the Code Execution Tool beta in API requests. For Claude Code users, skills can be installed via plugins from the anthropics/skills GitHub marketplace, and teams can share skills through version control systems.
"Skills are included in Max, Pro, Teams, and Enterprise plans at no additional cost," Murag confirmed. "API usage follows standard API pricing," meaning organizations pay only for the tokens consumed during skill execution, not for the skills themselves.
Anthropic provides several pre-built skills for common business tasks, including professional generation of Excel spreadsheets with formulas, PowerPoint presentations, Word documents, and fillable PDFs. These Anthropic-created skills will remain free.
Why the Skills launch matters in the AI coding wars with OpenAI
The Skills announcement arrives during a pivotal moment in Anthropic's competition with OpenAI, particularly around AI-assisted software development. Just one day before releasing Skills, Anthropic launched Claude Haiku 4.5, a smaller and cheaper model that nonetheless matches the coding performance of Claude Sonnet 4 — which was state-of-the-art when released just five months ago.
That rapid improvement curve reflects the breakneck pace of AI development, where today's frontier capabilities become tomorrow's commodity offerings. OpenAI has been pushing hard on coding tools as well, recently upgrading its Codex platform with GPT-5 and expanding GitHub Copilot's capabilities.
Anthropic's revenue trajectory — potentially reaching $26 billion in 2026 from an estimated $9 billion by year-end 2025 — suggests the company is successfully converting enterprise interest into paying customers. The timing also follows Salesforce's announcement this week that it's deepening AI partnerships with both OpenAI and Anthropic to power its Agentforce platform, signaling that enterprises are adopting a multi-vendor approach rather than standardizing on a single provider.
Skills addresses a real pain point: the "prompt engineering" problem where effective AI usage depends on individual employees crafting elaborate instructions for routine tasks, with no way to share that expertise across teams. Skills transforms implicit knowledge into explicit, shareable assets. For startups and developers, the feature could accelerate product development significantly — adding sophisticated document generation capabilities that previously required dedicated engineering teams and weeks of development.
The composability aspect hints at a future where organizations build libraries of specialized skills that can be mixed and matched for increasingly complex workflows. A pharmaceutical company might develop skills for regulatory compliance, clinical trial analysis, molecular modeling, and patient data privacy that work together seamlessly — creating a customized AI assistant with deep domain expertise across multiple specialties.
Anthropic indicates it's working on simplified skill creation workflows and enterprise-wide deployment capabilities to make it easier for organizations to distribute skills across large teams. As the feature rolls out to Anthropic's more than 300,000 business customers, the true test will be whether organizations find Skills substantively more useful than existing customization approaches.
For now, Skills offers Anthropic's clearest articulation yet of its vision for AI agents: not generalists that try to do everything reasonably well, but intelligent systems that know when to access specialized expertise and can coordinate multiple domains of knowledge to accomplish complex tasks. If that vision catches on, the question won't be whether your company uses AI — it will be whether your AI knows how your company actually works.
One year after emerging from stealth, Strella has raised $14 million in Series A funding to expand its AI-powered customer research platform, the company announced Thursday. The round, led by Bessemer Venture Partners with participation from Decibel Partners, Bain Future Back Ventures, MVP Ventures and 645 Ventures, comes as enterprises increasingly turn to artificial intelligence to understand customers faster and more deeply than traditional methods allow.
The investment marks a sharp acceleration for the startup founded by Lydia Hylton and Priya Krishnan, two former consultants and product managers who watched companies struggle with a customer research process that could take eight weeks from start to finish. Since October, Strella has grown revenue tenfold, quadrupled its customer base to more than 40 paying enterprises, and tripled its average contract values by moving upmarket to serve Fortune 500 companies.
"Research tends to be bookended by two very strategic steps: first, we have a problem—what research should we do? And second, we've done the research—now what are we going to do with it?" said Hylton, Strella's CEO, in an exclusive interview with VentureBeat. "All the stuff in the middle tends to be execution and lower-skill work. We view Strella as doing that middle 90% of the work."
The platform now serves Amazon, Duolingo, Apollo GraphQL, and Chobani, collectively conducting thousands of AI-moderated interviews that deliver what the company claims is a 90% average time savings on manual research work. The company is approaching $1 million in revenue after beginning monetization only in January, with month-over-month growth of 50% and zero customer churn to date.
How AI-powered interviews compress eight-week research projects into days
Strella's technology addresses a workflow that has frustrated product teams, marketers, and designers for decades. Traditional customer research requires writing interview guides, recruiting participants, scheduling calls, conducting interviews, taking notes, synthesizing findings, and creating presentations — a process that consumes weeks of highly-skilled labor and often delays critical product decisions.
The platform compresses that timeline to days by using AI to moderate voice-based interviews that run like Zoom calls, but with an artificial intelligence agent asking questions, following up on interesting responses, and detecting when participants are being evasive or fraudulent. The system then synthesizes findings automatically, creating highlight reels and charts from unstructured qualitative data.
"It used to take eight weeks. Now you can do it in the span of a couple days," Hylton told VentureBeat. "The primary technology is through an AI-moderated interview. It's like being in a Zoom call with an AI instead of a human — it's completely free form and voice based."
Critically, the platform also supports human moderators joining the same calls, reflecting the founders' belief that humans won't disappear from the research process. "Human moderation won't go away, which is why we've supported human moderation from our Genesis," Hylton said. "Research tends to be bookended by two very strategic steps: we have a problem, what's the research that we should do? And we've done the research, now what are we going to do with it? All the stuff in the middle tends to be execution and lower skill work. We view Strella as doing that middle 90% of the work."
Why customers tell AI moderators the truth they won't share with humans
One of Strella's most surprising findings challenges assumptions about AI in qualitative research: participants appear more honest with AI moderators than with humans. The founders discovered this pattern repeatedly as customers ran head-to-head comparisons between traditional human-moderated studies and Strella's AI approach.
"If you're a designer and you get on a Zoom call with a customer and you say, 'Do you like my design?' they're always gonna say yes. They don't want to hurt your feelings," Hylton explained. "But it's not a problem at all for Strella. They would tell you exactly what they think about it, which is really valuable. It's very hard to get honest feedback."
Krishnan, Strella's COO, said companies initially worried about using AI and "eroding quality," but the platform has "actually found the opposite to be true. People are much more open and honest with an AI moderator, and so the level of insight that you get is much richer because people are giving their unfiltered feedback."
This dynamic has practical business implications. Brian Santiago, Senior Product Design Manager at Apollo GraphQL, said in a statement: "Before Strella, studies took weeks. Now we get insights in a day — sometimes in just a few hours. And because participants open up more with the AI moderator, the feedback is deeper and more honest."
The platform also addresses endemic fraud in online surveys, particularly when participants are compensated. Because Strella interviews happen on camera in real time, the AI moderator can detect when someone pauses suspiciously long — perhaps to consult ChatGPT — and flags them as potentially fraudulent. "We are fraud resistant," Hylton said, contrasting this with traditional surveys where fraud rates can be substantial.
Solving mobile app research with persistent screen sharing technology
A major focus of the Series A funding will be expanding Strella's recently-launched mobile application, which Krishnan identified as critical competitive differentiation. The mobile app enables persistent screen sharing during interviews — allowing researchers to watch users navigate mobile applications in real time while the AI moderator asks about their experience.
"We are the only player in the market that supports screen sharing on mobile," Hylton said. "You know, I want to understand what are the pain points with my app? Why do people not seem to be able to find the checkout flow? Well, in order to do that effectively, you'd like to see the user screen while they're doing an interview."
For consumer-facing companies where mobile represents the primary customer interface, this capability opens entirely new use cases. The founders noted that "several of our customers didn't do research before" but have now built research practices around Strella because the platform finally made mobile research accessible at scale.
The platform also supports embedding traditional survey question types directly into the conversational interview, approaching what Hylton called "feature parity with a survey" while maintaining the engagement advantages of a natural conversation. Strella interviews regularly run 60 to 90 minutes with nearly 100% completion rates—a duration that would see 60-70% drop-off in a traditional survey format.
How Strella differentiated in a market crowded with AI research startups
Strella enters a market that appears crowded at first glance, with established players like Qualtrics and a wave of AI-powered startups promising to transform customer research. The founders themselves initially pursued a different approach — synthetic respondents, or "digital twins" that simulate customer perspectives using large language models.
"We actually pivoted from that. That was our initial idea," Hylton revealed, referring to synthetic respondents. "People are very intrigued by that concept, but found in practice, no willingness to pay right now."
Recent research suggesting companies could use language models as digital twins for customer feedback has reignited interest in that approach. But Hylton remains skeptical: "The capabilities of the LLMs as they are today are not good enough, in my opinion, to justify a standalone company. Right now you could just ask ChatGPT, 'What would new users of Duolingo think about this ad copy?' You can do that. Adding the standalone idea of a synthetic panel is sort of just putting a wrapper on that."
Instead, Strella's bet is that the real value lies in collecting proprietary qualitative data at scale — building what could become "the system of truth for all qualitative insights" within enterprises, as Lindsey Li, Vice President at Bessemer Venture Partners, described it.
Li, who led the investment just one year after Strella emerged from stealth, said the firm was convinced by both the technology and the team. "Strella has built highly differentiated technology that enables a continuous interview rather than a survey," Li said. "We heard time and time again that customers loved this product experience relative to other offerings."
On the defensibility question that concerns many AI investors, Li emphasized product execution over patents: "We think the long game here will be won with a million small product decisions, all of which must be driven by deep empathy for customer pain and an understanding of how best to address their needs. Lydia and Priya exhibit that in spades."
The founders point to technical depth that's difficult to replicate. Most competitors started with adaptive surveys — text-based interfaces where users type responses and wait for the next question. Some have added voice, but typically as uploaded audio clips rather than free-flowing conversation.
"Our approach is fundamentally better, which is the fact that it is a free form conversation," Hylton said. "You never have to control anything. You're never typing, there's no buttons, there's no upload and wait for the next question. It's completely free form, and that has been an extraordinarily hard product to build. There's a tremendous amount of IP in the way that we prompt our moderator, the way that we run analysis."
The platform also improves with use, learning from each customer's research patterns to fine-tune future interview guides and questions. "Our product gets better for our customers as they continue to use us," Hylton said. All research accumulates in a central repository where teams can generate new insights by chatting with the data or creating visualizations from previously unstructured qualitative feedback.
Creating new research budgets instead of just automating existing ones
Perhaps more important than displacing existing research is expanding the total market. Krishnan said growth has been "fundamentally related to our product" creating new research that wouldn't have happened otherwise.
"We have expanded the use cases in which people would conduct research," Krishnan explained. "Several of our customers didn't do research before, have always wanted to do research, but didn't have a dedicated researcher or team at their company that was devoted to it, and have purchased Strella to kick off and enable their research practice. That's been really cool where we've seen this market just opening up."
This expansion comes as enterprises face mounting pressure to improve customer experience amid declining satisfaction scores. According to Forrester Research's 2024 Customer Experience Index, customer experience quality has declined for three consecutive years — an unprecedented trend. The report found that 39% of brands saw CX quality deteriorate, with declines across effectiveness, ease, and emotional connection.
Meanwhile, Deloitte's 2025 Technology, Media & Telecommunications Predictions report forecasts that 25% of enterprises using generative AI will deploy AI agents by 2025, growing to 50% by 2027. The report specifically highlighted AI's potential to enhance customer satisfaction by 15-20% while reducing cost to serve by 20-30% when properly implemented.
Gartner identified conversational user interfaces — the category Strella inhabits — as one of three technologies poised to transform customer service by 2028, noting that "customers increasingly expect to be able to interact with the applications they use in a natural way."
Against this backdrop, Li sees substantial room for growth. "UX Research is a sub-sector of the $140B+ global market-research industry," Li said. "This includes both the software layer historically (~$430M) and professional services spend on UX research, design, product strategy, etc. which is conservatively estimated to be ~$6.4B+ annually. As software in this vertical, led by Strella, becomes more powerful, we believe the TAM will continue to expand meaningfully."
Making customer feedback accessible across the enterprise, not just research teams
The founders describe their mission as "democratizing access to the customer" — making it possible for anyone in an organization to understand customer perspectives without waiting for dedicated research teams to complete months-long studies.
"Many, many, many positions in the organization would like to get customer feedback, but it's so hard right now," Hylton said. With Strella, she explained, someone can "log into Strella and through a chat, create any highlight reel that you want and actually see customers in their own words answering the question that you have based on the research that's already been done."
This video-first approach to research repositories changes organizational dynamics around customer feedback. "Then you can say, 'Okay, engineering team, we need to build this feature. And here's the customer actually saying it,'" Hylton continued. "'This is not me. This isn't politics. Here are seven customers saying they can't find the Checkout button.' The fact that we are a very video-based platform really allows us to do that quickly and painlessly."
The company has moved decisively upmarket, with contract values now typically in the five-figure range and "several six figure contracts" signed, according to Krishnan. The pricing strategy reflects a premium positioning: "Our product is very good, it's very premium. We're charging based on the value it provides to customers," Krishnan said, rather than competing on cost alone.
This approach appears to be working. The company reports 100% conversion from pilot programs to paid contracts and zero churn among its 40-45 customers, with month-over-month revenue growth of 50%.
The roadmap: Computer vision, agentic AI, and human-machine collaboration
The Series A funding will primarily support scaling product and go-to-market teams. "We're really confident that we have product-market fit," Hylton said. "And now the question is execution, and we want to hire a lot of really talented people to help us execute."
On the product roadmap, Hylton emphasized continued focus on the participant experience as the key to winning the market. "Everything else is downstream of a joyful participant experience," she said, including "the quality of insights, the amount you have to pay people to do the interviews, and the way that your customers feel about a company."
Near-term priorities include adding visual capabilities so the AI moderator can respond to facial expressions and other nonverbal cues, and building more sophisticated collaboration features between human researchers and AI moderators. "Maybe you want to listen while an AI moderator is running a call and you might want to be able to jump in with specific questions," Hylton said. "Or you want to run an interview yourself, but you want the moderator to be there as backup or to help you."
These features move toward what the industry calls "agentic AI" — systems that can act more autonomously while still collaborating with humans. The founders see this human-AI collaboration, rather than full automation, as the sustainable path forward.
"We believe that a lot of the really strategic work that companies do will continue to be human moderated," Hylton said. "And you can still do that through Strella and just use us for synthesis in those cases."
For Li and Bessemer, the bet is on founders who understand this nuance. "Lydia and Priya exhibit the exact archetype of founders we are excited to partner with for the long term — customer-obsessed, transparent, thoughtful, and singularly driven towards the home-run scenario," she said.
The company declined to disclose specific revenue figures or valuation. With the new funding, Strella has now raised $18 million total, including a $4 million seed round led by Decibel Partners announced in October.
As Strella scales, the founders remain focused on a vision where technology enhances rather than eliminates human judgment—where an engineering team doesn't just read a research report, but watches seven customers struggle to find the same button. Where a product manager can query months of accumulated interviews in seconds. Where companies don't choose between speed and depth, but get both.
"The interesting part of the business is actually collecting that proprietary dataset, collecting qualitative research at scale," Hylton said, describing what she sees as Strella's long-term moat. Not replacing the researcher, but making everyone in the company one.


