• Alfred Wahlforss was running out of options. His startup, Listen Labs, needed to hire over 100 engineers, but competing against Mark Zuckerberg's $100 million offers seemed impossible. So he spent $5,000 — a fifth of his marketing budget — on a billboard in San Francisco displaying what looked like gibberish: five strings of random numbers.

    The numbers were actually AI tokens. Decoded, they led to a coding challenge: build an algorithm to act as a digital bouncer at Berghain, the Berlin nightclub famous for rejecting nearly everyone at the door. Within days, thousands attempted the puzzle. 430 cracked it. Some got hired. The winner flew to Berlin, all expenses paid.

    That unconventional approach has now attracted $69 million in Series B funding, led by Ribbit Capital with participation from Evantic and existing investors Sequoia Capital, Conviction, and Pear VC. The round values Listen Labs at $500 million and brings its total capital to $100 million. In nine months since launch, the company has grown annualized revenue by 15x to eight figures and conducted over one million AI-powered interviews.

    "When you obsess over customers, everything else follows," Wahlforss said in an interview with VentureBeat. "Teams that use Listen bring the customer into every decision, from marketing to product, and when the customer is delighted, everyone is."

    Why traditional market research is broken, and what Listen Labs is building to fix it

    Listen's AI researcher finds participants, conducts in-depth interviews, and delivers actionable insights in hours, not weeks. The platform replaces the traditional choice between quantitative surveys — which provide statistical precision but miss nuance—and qualitative interviews, which deliver depth but cannot scale.

    Wahlforss explained the limitation of existing approaches: "Essentially surveys give you false precision because people end up answering the same question... You can't get the outliers. People are actually not honest on surveys." The alternative, one-on-one human interviews, "gives you a lot of depth. You can ask follow up questions. You can kind of double check if they actually know what they're talking about. And the problem is you can't scale that."

    The platform works in four steps: users create a study with AI assistance, Listen recruits participants from its global network of 30 million people, an AI moderator conducts in-depth interviews with follow-up questions, and results are packaged into executive-ready reports including key themes, highlight reels, and slide decks.

    What distinguishes Listen's approach is its use of open-ended video conversations rather than multiple-choice forms. "In a survey, you can kind of guess what you should answer, and you have four options," Wahlforss said. "Oh, they probably want me to buy high income. Let me click on that button versus an open ended response. It just generates much more honesty."

    The dirty secret of the $140 billion market research industry: rampant fraud

    Listen finds and qualifies the right participants in its global network of 30 million people. But building that panel required confronting what Wahlforss called "one of the most shocking things that we've learned when we entered this industry"—rampant fraud.

    "Essentially, there's a financial transaction involved, which means there will be bad players," he explained. "We actually had some of the largest companies, some of them have billions in revenue, send us people who claim to be kind of enterprise buyers to our platform and our system immediately detected, like, fraud, fraud, fraud, fraud, fraud."

    The company built what it calls a "quality guard" that cross-references LinkedIn profiles with video responses to verify identity, checks consistency across how participants answer questions, and flags suspicious patterns. The result, according to Wahlforss: "People talk three times more. They're much more honest when they talk about sensitive topics like politics and mental health."

    Emeritus, an online education company that uses Listen, reported that approximately 20% of survey responses previously fell into the fraudulent or low-quality category. With Listen, they reduced this to almost zero. "We did not have to replace any responses because of fraud or gibberish information," said Gabrielli Tiburi, Assistant Manager of Customer Insights at Emeritus.

    How Microsoft, Sweetgreen, and Chubbies are using AI interviews to build better products

    The speed advantage has proven central to Listen's pitch. Traditional customer research at Microsoft could take four to six weeks to generate insights. "By the time we get to them, either the decision has been made or we lose out on the opportunity to actually influence it," said Romani Patel, Senior Research Manager at Microsoft.

    With Listen, Microsoft can now get insights in days, and in many cases, within hours.

    The platform has already powered several high-profile initiatives. Microsoft used Listen Labs to collect global customer stories for its 50th anniversary celebration. "We wanted users to share how Copilot is empowering them to bring their best self forward," Patel said, "and we were able to collect those user video stories within a day." Traditionally, that kind of work would have taken six to eight weeks.

    Simple Modern, an Oklahoma-based drinkware company, used Listen to test a new product concept. The process took about an hour to write questions, an hour to launch the study, and 2.5 hours to receive feedback from 120 people across the country. "We went from 'Should we even have this product?' to 'How should we launch it?'" said Chris Hoyle, the company's Chief Marketing Officer.

    Chubbies, the shorts brand, achieved a 24x increase in youth research participation—growing from 5 to 120 participants — by using Listen to overcome the scheduling challenges of traditional focus groups with children. "There's school, sports, dinner, and homework," explained Lauren Neville, Director of Insights and Innovation. "I had to find a way to hear from them that fit into their schedules."

    The company also discovered product issues through AI interviews that might have gone undetected otherwise. Wahlforss described how the AI "through conversations, realized there were like issues with the the kids short line, and decided to, like, interview hundreds of kids. And I understand that there were issues in the liner of the shorts and that they were, like, scratchy, quote, unquote, according to the people interviewed." The redesigned product became "a blockbuster hit."

    The Jevons paradox explains why cheaper research creates more demand, not less

    Listen Labs is entering a massive but fragmented market. Wahlforss cited research from Andreessen Horowitz estimating the market research industry at roughly $140 billion annually, populated by legacy players — some with more than a billion dollars in revenue — that he believes are vulnerable to disruption.

    "There are very much existing budget lines that we are replacing," Wahlforss said. "Why we're replacing them is that one, they're super costly. Two, they're kind of stuck in this old paradigm of choosing between a survey or interview, and they also take months to work with."

    But the more intriguing dynamic may be that AI-powered research doesn't just replace existing spending — it creates new demand. Wahlforss invoked the Jevons paradox, an economic principle that occurs when technological advancements make a resource more efficient to use, but increased efficiency leads to increased overall consumption rather than decreased consumption.

    "What I've noticed is that as something gets cheaper, you don't need less of it. You want more of it," Wahlforss explained. "There's infinite demand for customer understanding. So the researchers on the team can do an order of magnitude more research, and also other people who weren't researchers before can now do that as part of their job."

    Inside the elite engineering team that built Listen Labs before they had a working toilet

    Listen Labs traces its origins to a consumer app that Wahlforss and his co-founder built after meeting at Harvard. "We built this consumer app that got 20,000 downloads in one day," Wahlforss recalled. "We had all these users, and we were thinking like, okay, what can we do to get to know them better? And we built this prototype of what Listen is today."

    The founding team brings an unusual pedigree. Wahlforss's co-founder "was the national champion in competitive programming in Germany, and he worked at Tesla Autopilot." The company claims that 30% of its engineering team are medalists from the International Olympiad in Informatics — the same competition that produced the founders of Cognition, the AI coding startup.

    The Berghain billboard stunt generated approximately 5 million views across social media, according to Wahlforss. It reflected the intensity of the talent war in the Bay Area.

    "We had to do these things because some of our, like early employees, joined the company before we had a working toilet," he said. "But now we fixed that situation."

    The company grew from 5 to 40 employees in 2024 and plans to reach 150 this year. It hires engineers for non-engineering roles across marketing, growth, and operations — a bet that in the AI era, technical fluency matters everywhere.

    Synthetic customers and automated decisions: what Listen Labs is building next

    Wahlforss outlined an ambitious product roadmap that pushes into more speculative territory. The company is building "the ability to simulate your customers, so you can take all of those interviews we've done, and then extrapolate based on that and create synthetic users or simulated user voices."

    Beyond simulation, Listen aims to enable automated action based on research findings. "Can you not just make recommendations, but also create spawn agents to either change things in code or some customer churns? Can you give them a discount and try to bring them back?"

    Wahlforss acknowledged the ethical implications. "Obviously, as you said, there's kind of ethical concerns there. Of like, automated decision making overall can be bad, but we will have considerable guardrails to make sure that the companies are always in the loop."

    The company already handles sensitive data with care. "We don't train on any of the data," Wahlforss said. "We will also scrub any sensitive PII automatically so the model can detect that. And there are times when, for example, you work with investors, where if you accidentally mention something that could be material, non public information, the AI can actually detect that and remove any information like that."

    How AI could reshape the future of product development

    Perhaps the most provocative implication of Listen's model is how it could reshape product development itself. Wahlforss described a customer — an Australian startup — that has adopted what amounts to a continuous feedback loop.

    "They're based in Australia, so they're coding during the day, and then in their night, they're releasing a Listen study with an American audience. Listen validates whatever they built during the day, and they get feedback on that. They can then plug that feedback directly into coding tools like Claude Code and iterate."

    The vision extends Y Combinator's famous dictum — "write code, talk to users" — into an automated cycle. "Write code is now getting automated. And I think like talk to users will be as well, and you'll have this kind of infinite loop where you can start to ship this truly amazing product, almost kind of autonomously."

    Whether that vision materializes depends on factors beyond Listen's control — the continued improvement of AI models, enterprise willingness to trust automated research, and whether speed truly correlates with better products. A 2024 MIT study found that 95% of AI pilots fail to move into production, a statistic Wahlforss cited as the reason he emphasizes quality over demos.

    "I'm constantly have to emphasize like, let's make sure the quality is there and the details are right," he said.

    But the company's growth suggests appetite for the experiment. Microsoft's Patel said Listen has "removed the drudgery of research and brought the fun and joy back into my work." Chubbies is now pushing its founder to give everyone in the company a login. Sling Money, a stablecoin payments startup, can create a survey in ten minutes and receive results the same day.

    "It's a total game changer," said Ali Romero, Sling Money's marketing manager.

    Wahlforss has a different phrase for what he's building. When asked about the tension between speed and rigor — the long-held belief that moving fast means cutting corners — he cited Nat Friedman, the former GitHub CEO and Listen investor, who keeps a list of one-liners on his website.

    One of them: "Slow is fake."

    It's an aggressive claim for an industry built on methodological caution. But Listen Labs is betting that in the AI era, the companies that listen fastest will be the ones that win. The only question is whether customers will talk back.

  • Salesforce on Tuesday launched an entirely rebuilt version of Slackbot, the company's workplace assistant, transforming it from a simple notification tool into what executives describe as a fully powered AI agent capable of searching enterprise data, drafting documents, and taking action on behalf of employees.

    The new Slackbot, now generally available to Business+ and Enterprise+ customers, is Salesforce's most aggressive move yet to position Slack at the center of the emerging "agentic AI" movement — where software agents work alongside humans to complete complex tasks. The launch comes as Salesforce attempts to convince investors that artificial intelligence will bolster its products rather than render them obsolete.

    "Slackbot isn't just another copilot or AI assistant," said Parker Harris, Salesforce co-founder and Slack's chief technology officer, in an exclusive interview with Salesforce. "It's the front door to the agentic enterprise, powered by Salesforce."

    From tricycle to Porsche: Salesforce rebuilt Slackbot from the ground up

    Harris was blunt about what distinguishes the new Slackbot from its predecessor: "The old Slackbot was, you know, a little tricycle, and the new Slackbot is like, you know, a Porsche."

    The original Slackbot, which has existed since Slack's early days, performed basic algorithmic tasks — reminding users to add colleagues to documents, suggesting channel archives, and delivering simple notifications. The new version runs on an entirely different architecture built around a large language model and sophisticated search capabilities that can access Salesforce records, Google Drive files, calendar data, and years of Slack conversations.

    "It's two different things," Harris explained. "The old Slackbot was algorithmic and fairly simple. The new Slackbot is brand new — it's based around an LLM and a very robust search engine, and connections to third-party search engines, third-party enterprise data."

    Salesforce chose to retain the Slackbot brand despite the fundamental technical overhaul. "People know what Slackbot is, and so we wanted to carry that forward," Harris said.

    Why Anthropic's Claude powers the new Slackbot — and which AI models could come next

    The new Slackbot runs on Claude, Anthropic's large language model, a choice driven partly by compliance requirements. Slack's commercial service operates under FedRAMP Moderate certification to serve U.S. federal government customers, and Harris said Anthropic was "the only provider that could give us a compliant LLM" when Slack began building the new system.

    But that exclusivity won't last. "We are, this year, going to support additional providers," Harris said. "We have a great relationship with Google. Gemini is incredible — performance is great, cost is great. So we're going to use Gemini for some things." He added that OpenAI remains a possibility as well.

    Harris echoed Salesforce CEO Marc Benioff's view that large language models are becoming commoditized: "You've heard Marc talk about LLMs are commodities, that they're democratized. I call them CPUs."

    On the sensitive question of training data, Harris was unequivocal: Salesforce does not train any models on customer data. "Models don't have any sort of security," he explained. "If we trained it on some confidential conversation that you and I have, I don't want Carolyn to know — if I train it into the LLM, there is no way for me to say you get to see the answer, but Carolyn doesn't."

    Inside Salesforce's internal experiment: 80,000 employees tested Slackbot with striking results

    Salesforce has been testing the new Slackbot internally for months, rolling it out to all 80,000 employees. According to Ryan Gavin, Slack's chief marketing officer, the results have been striking: "It's the fastest adopted product in Salesforce history."

    Internal data shows that two-thirds of Salesforce employees have tried the new Slackbot, with 80% of those users continuing to use it regularly. Internal satisfaction rates reached 96% — the highest for any AI feature Slack has shipped. Employees report saving between two and 20 hours per week.

    The adoption happened largely organically. "I think it was about five days, and a Canvas was developed by our employees called 'The Most Stealable Slackbot Prompts,'" Gavin said. "People just started adding to it organically. I think it's up to 250-plus prompts that are in this Canvas right now."

    Kate Crotty, a principal UX researcher at Salesforce, found that 73% of internal adoption was driven by social sharing rather than top-down mandates. "Everybody is there to help each other learn and communicate hacks," she said.

    How Slackbot transforms scattered enterprise data into executive-ready insights

    During a product demonstration, Amy Bauer, Slack's product experience designer, showed how Slackbot can synthesize information across multiple sources. In one example, she asked Slackbot to analyze customer feedback from a pilot program, upload an image of a usage dashboard, and have Slackbot correlate the qualitative and quantitative data.

    "This is where Slackbot really earns its keep for me," Bauer explained. "What it's doing is not just simply reading the image — it's actually looking at the image and comparing it to the insight it just generated for me."

    Slackbot can then query Salesforce to find enterprise accounts with open deals that might be good candidates for early access, creating what Bauer called "a really great justification and plan to move forward." Finally, it can synthesize all that information into a Canvas — Slack's collaborative document format — and find calendar availability among stakeholders to schedule a review meeting.

    "Up until this point, we have been working in a one-to-one capacity with Slackbot," Bauer said. "But one of the benefits that I can do now is take this insight and have it generate this into a Canvas, a shared workspace where I can iterate on it, refine it with Slackbot, or share it out with my team."

    Rob Seaman, Slack's chief product officer, said the Canvas creation demonstrates where the product is heading: "This is making a tool call internally to Slack Canvas to actually write, effectively, a shared document. But it signals where we're going with Slackbot — we're eventually going to be adding in additional third-party tool calls."

    MrBeast's company became a Slackbot guinea pig—and employees say they're saving 90 minutes a day

    Among Salesforce's pilot customers is Beast Industries, the parent company of YouTube star MrBeast. Luis Madrigal, the company's chief information officer, joined the launch announcement to describe his experience.

    "As somebody who has rolled out enterprise technologies for over two decades now, this was practically one of the easiest," Madrigal said. "The plumbing is there. Slack as an implementation, Enterprise Tools — being able to turn on the Slackbot and the Slack AI functionality was as simple as having my team go in, review, do a quick security review."

    Madrigal said his security team signed off "rather quickly" — unusual for enterprise AI deployments — because Slackbot accesses only the information each individual user already has permission to view. "Given all the guardrails you guys have put into place for Slackbot to be unique and customized to only the information that each individual user has, only the conversations and the Slack rooms and Slack channels that they're part of—that made my security team sign off rather quickly."

    One Beast Industries employee, Sinan, the head of Beast Games marketing, reported saving "at bare minimum, 90 minutes a day." Another employee, Spencer, a creative supervisor, described it as "an assistant who's paying attention when I'm not."

    Other pilot customers include Slalom, reMarkable, Xero, Mercari, and Engine. Mollie Bodensteiner, SVP of Operations at Engine, called Slackbot "an absolute 'chaos tamer' for our team," estimating it saves her about 30 minutes daily "just by eliminating context switching."

    Slackbot vs. Microsoft Copilot vs. Google Gemini: The fight for enterprise AI dominance

    The launch puts Salesforce in direct competition with Microsoft's Copilot, which is integrated into Teams and the broader Microsoft 365 suite, as well as Google's Gemini integrations across Workspace. When asked what distinguishes Slackbot from these alternatives, Seaman pointed to context and convenience.

    "The thing that makes it most powerful for our customers and users is the proximity — it's just right there in your Slack," Seaman said. "There's a tremendous convenience affordance that's naturally built into it."

    The deeper advantage, executives argue, is that Slackbot already understands users' work without requiring setup or training. "Most AI tools sound the same no matter who is using them," the company's announcement stated. "They lack context, miss nuance, and force you to jump between tools to get anything done."

    Harris put it more directly: "If you've ever had that magic experience with AI — I think ChatGPT is a great example, it's a great experience from a consumer perspective — Slackbot is really what we're doing in the enterprise, to be this employee super agent that is loved, just like people love using Slack."

    Amy Bauer emphasized the frictionless nature of the experience. "Slackbot is inherently grounded in the context, in the data that you have in Slack," she said. "So as you continue working in Slack, Slackbot gets better because it's grounded in the work that you're doing there. There is no setup. There is no configuration for those end users."

    Salesforce's ambitious plan to make Slackbot the one 'super agent' that controls all the others

    Salesforce positions Slackbot as what Harris calls a "super agent" — a central hub that can eventually coordinate with other AI agents across an organization.

    "Every corporation is going to have an employee super agent," Harris said. "Slackbot is essentially taking the magic of what Slack does. We think that Slackbot, and we're really excited about it, is going to be that."

    The vision extends to third-party agents already launching in Slack. Last month, Anthropic released a preview of Claude Code for Slack, allowing developers to interact with Claude's coding capabilities directly in chat threads. OpenAI, Google, Vercel, and others have also built agents for the platform.

    "Most of the net-new apps that are being deployed to Slack are agents," Seaman noted during the press conference. "This is proof of the promise of humans and agents coexisting and working together in Slack to solve problems."

    Harris described a future where Slackbot becomes an MCP (Model Context Protocol) client, able to leverage tools from across the software ecosystem — similar to how the developer tool Cursor works. "Slack can be an MCP client, and Slackbot will be the hub of that, leveraging all these tools out in the world, some of which will be these amazing agents," he said.

    But Harris also cautioned against over-promising on multi-agent coordination. "I still think we're in the single agent world," he said. "FY26 is going to be the year where we started to see more coordination. But we're going to do it with customer success in mind, and not demonstrate and talk about, like, 'I've got 1,000 agents working together,' because I think that's unrealistic."

    Slackbot costs nothing extra, but Salesforce's data access fees could squeeze some customers

    Slackbot is included at no additional cost for customers on Business+ and Enterprise+ plans. "There's no additional fees customers have to do," Gavin confirmed. "If they're on one of those plans, they're going to get Slackbot."

    However, some enterprise customers may face other cost pressures related to Salesforce's broader data strategy. CIOs may see price increases for third-party applications that work with Salesforce data, as effects of higher charges for API access ripple through the software supply chain.

    Fivetran CEO George Fraser has warned that Salesforce's shift in pricing policy for API access could have tangible consequences for enterprises relying on Salesforce as a system of record. "They might not be able to use Fivetran to replicate their data to Snowflake and instead have to use Salesforce Data Cloud. Or they might find that they are not able to interact with their data via ChatGPT, and instead have to use Agentforce," Fraser said in a recent CIO report.

    Salesforce has framed the pricing change as standard industry practice.

    What Slackbot can do today, what's coming in weeks, and what's still on the roadmap

    The new Slackbot begins rolling out today and will reach all eligible customers by the end of February. Mobile availability will complete by March 3, Bauer confirmed during her interview with VentureBeat.

    Some capabilities remain works in progress. Calendar reading and availability checking are available at launch, but the ability to actually book meetings is "coming a few weeks after," according to Seaman. Image generation is not currently supported, though Bauer said it's "something that we are looking at in the future."

    When asked about integration with competing CRM systems like HubSpot and Microsoft Dynamics, Salesforce representatives declined to provide specifics during the interview, though they acknowledged the question touched on key competitive differentiators.

    Salesforce is betting the future of work looks like a chat window—and it's not alone

    The Slackbot launch is Salesforce's bet that the future of enterprise work is conversational — that employees will increasingly prefer to interact with AI through natural language rather than navigating traditional software interfaces.

    Harris described Slack's product philosophy using principles like "don't make me think" and "be a great host." The goal, he said, is for Slackbot to surface information proactively rather than requiring users to hunt for it.

    "One of the revelations for me is LLMs applied to unstructured information are incredible," Harris said. "And the amount of value you have if you're a Slack user, if your corporation uses Slack — the amount of value in Slack is unbelievable. Because you're talking about work, you're sharing documents, you're making decisions, but you can't as a human go through that and really get the same value that an LLM can do."

    Looking ahead, Harris expects the interfaces themselves to evolve beyond pure conversation. "We're kind of saturating what we can do with purely conversational UIs," he said. "I think we'll start to see agents building an interface that best suits your intent, as opposed to trying to surface something within a conversational interface that matches your intent."

    Microsoft, Google, and a growing roster of AI startups are placing similar bets — that the winning enterprise AI will be the one embedded in the tools workers already use, not another application to learn. The race to become that invisible layer of workplace intelligence is now fully underway.

    For Salesforce, the stakes extend beyond a single product launch. After a bruising year on Wall Street and persistent questions about whether AI threatens its core business, the company is wagering that Slackbot can prove the opposite — that the tens of millions of people already chatting in Slack every day is not a vulnerability, but an unassailable advantage.

    Haley Gault, the Salesforce account executive in Pittsburgh who stumbled upon the new Slackbot on a snowy morning, captured the shift in a single sentence: "I honestly can't imagine working for another company not having access to these types of tools. This is just how I work now."

    That's precisely what Salesforce is counting on.

  • Anthropic released Cowork on Monday, a new AI agent capability that extends the power of its wildly successful Claude Code tool to non-technical users — and according to company insiders, the team built the entire feature in approximately a week and a half, largely using Claude Code itself.

    The launch marks a major inflection point in the race to deliver practical AI agents to mainstream users, positioning Anthropic to compete not just with OpenAI and Google in conversational AI, but with Microsoft's Copilot in the burgeoning market for AI-powered productivity tools.

    "Cowork lets you complete non-technical tasks much like how developers use Claude Code," the company announced via its official Claude account on X. The feature arrives as a research preview available exclusively to Claude Max subscribers — Anthropic's power-user tier priced between $100 and $200 per month — through the macOS desktop application.

    For the past year, the industry narrative has focused on large language models that can write poetry or debug code. With Cowork, Anthropic is betting that the real enterprise value lies in an AI that can open a folder, read a messy pile of receipts, and generate a structured expense report without human hand-holding.

    How developers using a coding tool for vacation research inspired Anthropic's latest product

    The genesis of Cowork lies in Anthropic's recent success with the developer community. In late 2024, the company released Claude Code, a terminal-based tool that allowed software engineers to automate rote programming tasks. The tool was a hit, but Anthropic noticed a peculiar trend: users were forcing the coding tool to perform non-coding labor.

    According to Boris Cherny, an engineer at Anthropic, the company observed users deploying the developer tool for an unexpectedly diverse array of tasks.

    "Since we launched Claude Code, we saw people using it for all sorts of non-coding work: doing vacation research, building slide decks, cleaning up your email, cancelling subscriptions, recovering wedding photos from a hard drive, monitoring plant growth, controlling your oven," Cherny wrote on X. "These use cases are diverse and surprising — the reason is that the underlying Claude Agent is the best agent, and Opus 4.5 is the best model."

    Recognizing this shadow usage, Anthropic effectively stripped the command-line complexity from their developer tool to create a consumer-friendly interface. In its blog post announcing the feature, Anthropic explained that developers "quickly began using it for almost everything else," which "prompted us to build Cowork: a simpler way for anyone — not just developers — to work with Claude in the very same way."

    Inside the folder-based architecture that lets Claude read, edit, and create files on your computer

    Unlike a standard chat interface where a user pastes text for analysis, Cowork requires a different level of trust and access. Users designate a specific folder on their local machine that Claude can access. Within that sandbox, the AI agent can read existing files, modify them, or create entirely new ones.

    Anthropic offers several illustrative examples: reorganizing a cluttered downloads folder by sorting and intelligently renaming each file, generating a spreadsheet of expenses from a collection of receipt screenshots, or drafting a report from scattered notes across multiple documents.

    "In Cowork, you give Claude access to a folder on your computer. Claude can then read, edit, or create files in that folder," the company explained on X. "Try it to create a spreadsheet from a pile of screenshots, or produce a first draft from scattered notes."

    The architecture relies on what is known as an "agentic loop." When a user assigns a task, the AI does not merely generate a text response. Instead, it formulates a plan, executes steps in parallel, checks its own work, and asks for clarification if it hits a roadblock. Users can queue multiple tasks and let Claude process them simultaneously — a workflow Anthropic describes as feeling "much less like a back-and-forth and much more like leaving messages for a coworker."

    The system is built on Anthropic's Claude Agent SDK, meaning it shares the same underlying architecture as Claude Code. Anthropic notes that Cowork "can take on many of the same tasks that Claude Code can handle, but in a more approachable form for non-coding tasks."

    The recursive loop where AI builds AI: Claude Code reportedly wrote much of Claude Cowork

    Perhaps the most remarkable detail surrounding Cowork's launch is the speed at which the tool was reportedly built — highlighting a recursive feedback loop where AI tools are being used to build better AI tools.

    During a livestream hosted by Dan Shipper, Felix Rieseberg, an Anthropic employee, confirmed that the team built Cowork in approximately a week and a half.

    Alex Volkov, who covers AI developments, expressed surprise at the timeline: "Holy shit Anthropic built 'Cowork' in the last... week and a half?!"

    This prompted immediate speculation about how much of Cowork was itself built by Claude Code. Simon Smith, EVP of Generative AI at Klick Health, put it bluntly on X: "Claude Code wrote all of Claude Cowork. Can we all agree that we're in at least somewhat of a recursive improvement loop here?"

    The implication is profound: Anthropic's AI coding agent may have substantially contributed to building its own non-technical sibling product. If true, this is one of the most visible examples yet of AI systems being used to accelerate their own development and expansion — a strategy that could widen the gap between AI labs that successfully deploy their own agents internally and those that do not.

    Connectors, browser automation, and skills extend Cowork's reach beyond the local file system

    Cowork doesn't operate in isolation. The feature integrates with Anthropic's existing ecosystem of connectors — tools that link Claude to external information sources and services such as Asana, Notion, PayPal, and other supported partners. Users who have configured these connections in the standard Claude interface can leverage them within Cowork sessions.

    Additionally, Cowork can pair with Claude in Chrome, Anthropic's browser extension, to execute tasks requiring web access. This combination allows the agent to navigate websites, click buttons, fill forms, and extract information from the internet — all while operating from the desktop application.

    "Cowork includes a number of novel UX and safety features that we think make the product really special," Cherny explained, highlighting "a built-in VM [virtual machine] for isolation, out of the box support for browser automation, support for all your claude.ai data connectors, asking you for clarification when it's unsure."

    Anthropic has also introduced an initial set of "skills" specifically designed for Cowork that enhance Claude's ability to create documents, presentations, and other files. These build on the Skills for Claude framework the company announced in October, which provides specialized instruction sets Claude can load for particular types of tasks.

    Why Anthropic is warning users that its own AI agent could delete their files

    The transition from a chatbot that suggests edits to an agent that makes edits introduces significant risk. An AI that can organize files can, theoretically, delete them.

    In a notable display of transparency, Anthropic devoted considerable space in its announcement to warning users about Cowork's potential dangers — an unusual approach for a product launch.

    The company explicitly acknowledges that Claude "can take potentially destructive actions (such as deleting local files) if it's instructed to." Because Claude might occasionally misinterpret instructions, Anthropic urges users to provide "very clear guidance" about sensitive operations.

    More concerning is the risk of prompt injection attacks — a technique where malicious actors embed hidden instructions in content Claude might encounter online, potentially causing the agent to bypass safeguards or take harmful actions.

    "We've built sophisticated defenses against prompt injections," Anthropic wrote, "but agent safety — that is, the task of securing Claude's real-world actions — is still an active area of development in the industry."

    The company characterized these risks as inherent to the current state of AI agent technology rather than unique to Cowork. "These risks aren't new with Cowork, but it might be the first time you're using a more advanced tool that moves beyond a simple conversation," the announcement notes.

    Anthropic's desktop agent strategy sets up a direct challenge to Microsoft Copilot

    The launch of Cowork places Anthropic in direct competition with Microsoft, which has spent years attempting to integrate its Copilot AI into the fabric of the Windows operating system with mixed adoption results.

    However, Anthropic's approach differs in its isolation. By confining the agent to specific folders and requiring explicit connectors, they are attempting to strike a balance between the utility of an OS-level agent and the security of a sandboxed application.

    What distinguishes Anthropic's approach is its bottom-up evolution. Rather than designing an AI assistant and retrofitting agent capabilities, Anthropic built a powerful coding agent first — Claude Code — and is now abstracting its capabilities for broader audiences. This technical lineage may give Cowork more robust agentic behavior from the start.

    Claude Code has generated significant enthusiasm among developers since its initial launch as a command-line tool in late 2024. The company expanded access with a web interface in October 2025, followed by a Slack integration in December. Cowork is the next logical step: bringing the same agentic architecture to users who may never touch a terminal.

    Who can access Cowork now, and what's coming next for Windows and other platforms

    For now, Cowork remains exclusive to Claude Max subscribers using the macOS desktop application. Users on other subscription tiers — Free, Pro, Team, or Enterprise — can join a waitlist for future access.

    Anthropic has signaled clear intentions to expand the feature's reach. The blog post explicitly mentions plans to add cross-device sync and bring Cowork to Windows as the company learns from the research preview.

    Cherny set expectations appropriately, describing the product as "early and raw, similar to what Claude Code felt like when it first launched."

    To access Cowork, Max subscribers can download or update the Claude macOS app and click on "Cowork" in the sidebar.

    The real question facing enterprise AI adoption

    For technical decision-makers, the implications of Cowork extend beyond any single product launch. The bottleneck for AI adoption is shifting — no longer is model intelligence the limiting factor, but rather workflow integration and user trust.

    Anthropic's goal, as the company puts it, is to make working with Claude feel less like operating a tool and more like delegating to a colleague. Whether mainstream users are ready to hand over folder access to an AI that might misinterpret their instructions remains an open question.

    But the speed of Cowork's development — a major feature built in ten days, possibly by the company's own AI — previews a future where the capabilities of these systems compound faster than organizations can evaluate them.

    The chatbot has learned to use a file manager. What it learns to use next is anyone's guess.

  • Nous Research, the open-source artificial intelligence startup backed by crypto venture firm Paradigm, released a new competitive programming model on Monday that it says matches or exceeds several larger proprietary systems — trained in just four days using 48 of Nvidia's latest B200 graphics processors.

    The model, called NousCoder-14B, is another entry in a crowded field of AI coding assistants, but arrives at a particularly charged moment: Claude Code, the agentic programming tool from rival Anthropic, has dominated social media discussion since New Year's Day, with developers posting breathless testimonials about its capabilities. The simultaneous developments underscore how quickly AI-assisted software development is evolving — and how fiercely companies large and small are competing to capture what many believe will become a foundational technology for how software gets written.

    type: embedded-entry-inline id: 74cSyrq6OUrp9SEQ5zOUSl

    NousCoder-14B achieves a 67.87 percent accuracy rate on LiveCodeBench v6, a standardized evaluation that tests models on competitive programming problems published between August 2024 and May 2025. That figure represents a 7.08 percentage point improvement over the base model it was trained from, Alibaba's Qwen3-14B, according to Nous Research's technical report published alongside the release.

    "I gave Claude Code a description of the problem, it generated what we built last year in an hour," wrote Jaana Dogan, a principal engineer at Google responsible for the Gemini API, in a viral post on X last week that captured the prevailing mood around AI coding tools. Dogan was describing a distributed agent orchestration system her team had spent a year developing — a system Claude Code approximated from a three-paragraph prompt.

    The juxtaposition is instructive: while Anthropic's Claude Code has captured imaginations with demonstrations of end-to-end software development, Nous Research is betting that open-source alternatives trained on verifiable problems can close the gap — and that transparency in how these models are built matters as much as raw capability.


    How Nous Research built an AI coding model that anyone can replicate

    What distinguishes the NousCoder-14B release from many competitor announcements is its radical openness. Nous Research published not just the model weights but the complete reinforcement learning environment, benchmark suite, and training harness — built on the company's Atropos framework — enabling any researcher with sufficient compute to reproduce or extend the work.

    "Open-sourcing the Atropos stack provides the necessary infrastructure for reproducible olympiad-level reasoning research," noted one observer on X, summarizing the significance for the academic and open-source communities.

    The model was trained by Joe Li, a researcher in residence at Nous Research and a former competitive programmer himself. Li's technical report reveals an unexpectedly personal dimension: he compared the model's improvement trajectory to his own journey on Codeforces, the competitive programming platform where participants earn ratings based on contest performance.

    Based on rough estimates mapping LiveCodeBench scores to Codeforces ratings, Li calculated that NousCoder-14B's improvemen t— from approximately the 1600-1750 rating range to 2100-2200 — mirrors a leap that took him nearly two years of sustained practice between ages 14 and 16. The model accomplished the equivalent in four days.

    "Watching that final training run unfold was quite a surreal experience," Li wrote in the technical report.

    But Li was quick to note an important caveat that speaks to broader questions about AI efficiency: he solved roughly 1,000 problems during those two years, while the model required 24,000. Humans, at least for now, remain dramatically more sample-efficient learners.


    Inside the reinforcement learning system that trains on 24,000 competitive programming problems

    NousCoder-14B's training process offers a window into the increasingly sophisticated techniques researchers use to improve AI reasoning capabilities through reinforcement learning.

    The approach relies on what researchers call "verifiable rewards" — a system where the model generates code solutions, those solutions are executed against test cases, and the model receives a simple binary signal: correct or incorrect. This feedback loop, while conceptually straightforward, requires significant infrastructure to execute at scale.

    Nous Research used Modal, a cloud computing platform, to run sandboxed code execution in parallel. Each of the 24,000 training problems contains hundreds of test cases on average, and the system must verify that generated code produces correct outputs within time and memory constraints — 15 seconds and 4 gigabytes, respectively.

    The training employed a technique called DAPO (Dynamic Sampling Policy Optimization), which the researchers found performed slightly better than alternatives in their experiments. A key innovation involves "dynamic sampling" — discarding training examples where the model either solves all attempts or fails all attempts, since these provide no useful gradient signal for learning.

    The researchers also adopted "iterative context extension," first training the model with a 32,000-token context window before expanding to 40,000 tokens. During evaluation, extending the context further to approximately 80,000 tokens produced the best results, with accuracy reaching 67.87 percent.

    Perhaps most significantly, the training pipeline overlaps inference and verification — as soon as the model generates a solution, it begins work on the next problem while the previous solution is being checked. This pipelining, combined with asynchronous training where multiple model instances work in parallel, maximizes hardware utilization on expensive GPU clusters.


    The looming data shortage that could slow AI coding model progress

    Buried in Li's technical report is a finding with significant implications for the future of AI development: the training dataset for NousCoder-14B encompasses "a significant portion of all readily available, verifiable competitive programming problems in a standardized dataset format."

    In other words, for this particular domain, the researchers are approaching the limits of high-quality training data.

    "The total number of competitive programming problems on the Internet is roughly the same order of magnitude," Li wrote, referring to the 24,000 problems used for training. "This suggests that within the competitive programming domain, we have approached the limits of high-quality data."

    This observation echoes growing concern across the AI industry about data constraints. While compute continues to scale according to well-understood economic and engineering principles, training data is "increasingly finite," as Li put it.

    "It appears that some of the most important research that needs to be done in the future will be in the areas of synthetic data generation and data efficient algorithms and architectures," he concluded.

    The challenge is particularly acute for competitive programming because the domain requires problems with known correct solutions that can be verified automatically. Unlike natural language tasks where human evaluation or proxy metrics suffice, code either works or it doesn't — making synthetic data generation considerably more difficult.

    Li identified one potential avenue: training models not just to solve problems but to generate solvable problems, enabling a form of self-play similar to techniques that proved successful in game-playing AI systems. "Once synthetic problem generation is solved, self-play becomes a very interesting direction," he wrote.


    A $65 million bet that open-source AI can compete with Big Tech

    Nous Research has carved out a distinctive position in the AI landscape: a company committed to open-source releases that compete with — and sometimes exceed — proprietary alternatives.

    The company raised $50 million in April 2025 in a round led by Paradigm, the cryptocurrency-focused venture firm founded by Coinbase co-founder Fred Ehrsam. Total funding reached $65 million, according to some reports. The investment reflected growing interest in decentralized approaches to AI training, an area where Nous Research has developed its Psyche platform.

    Previous releases include Hermes 4, a family of models that we reported "outperform ChatGPT without content restrictions," and DeepHermes-3, which the company described as the first "toggle-on reasoning model" — allowing users to activate extended thinking capabilities on demand.

    The company has cultivated a distinctive aesthetic and community, prompting some skepticism about whether style might overshadow substance. "Ofc i'm gonna believe an anime pfp company. stop benchmarkmaxxing ffs," wrote one critic on X, referring to Nous Research's anime-style branding and the industry practice of optimizing for benchmark performance.

    Others raised technical questions. "Based on the benchmark, Nemotron is better," noted one commenter, referring to Nvidia's family of language models. Another asked whether NousCoder-14B is "agentic focused or just 'one shot' coding" — a distinction that matters for practical software development, where iterating on feedback typically produces better results than single attempts.


    What researchers say must happen next for AI coding tools to keep improving

    The release includes several directions for future work that hint at where AI coding research may be heading.

    Multi-turn reinforcement learning tops the list. Currently, the model receives only a final binary reward — pass or fail — after generating a solution. But competitive programming problems typically include public test cases that provide intermediate feedback: compilation errors, incorrect outputs, time limit violations. Training models to incorporate this feedback across multiple attempts could significantly improve performance.

    Controlling response length also remains a challenge. The researchers found that incorrect solutions tended to be longer than correct ones, and response lengths quickly saturated available context windows during training — a pattern that various algorithmic modifications failed to resolve.

    Perhaps most ambitiously, Li proposed "problem generation and self-play" — training models to both solve and create programming problems. This would address the data scarcity problem directly by enabling models to generate their own training curricula.

    "Humans are great at generating interesting and useful problems for other competitive programmers, but it appears that there still exists a significant gap in LLM capabilities in creative problem generation," Li wrote.

    The model is available now on Hugging Face under an Apache 2.0 license. For researchers and developers who want to build on the work, Nous Research has published the complete Atropos training stack alongside it.

    What took Li two years of adolescent dedication to achieve—climbing from a 1600-level novice to a 2100-rated competitor on Codeforces—an AI replicated in 96 hours. He needed 1,000 problems. The model needed 24,000. But soon enough, these systems may learn to write their own problems, teach themselves, and leave human benchmarks behind entirely.

    The question is no longer whether machines can learn to code. It's whether they'll soon be better teachers than we ever were.

  • When the creator of the world's most advanced coding agent speaks, Silicon Valley doesn't just listen — it takes notes.

    For the past week, the engineering community has been dissecting a thread on X from Boris Cherny, the creator and head of Claude Code at Anthropic. What began as a casual sharing of his personal terminal setup has spiraled into a viral manifesto on the future of software development, with industry insiders calling it a watershed moment for the startup.

    "If you're not reading the Claude Code best practices straight from its creator, you're behind as a programmer," wrote Jeff Tang, a prominent voice in the developer community. Kyle McNease, another industry observer, went further, declaring that with Cherny's "game-changing updates," Anthropic is "on fire," potentially facing "their ChatGPT moment."

    The excitement stems from a paradox: Cherny's workflow is surprisingly simple, yet it allows a single human to operate with the output capacity of a small engineering department. As one user noted on X after implementing Cherny's setup, the experience "feels more like Starcraft" than traditional coding — a shift from typing syntax to commanding autonomous units.

    Here is an analysis of the workflow that is reshaping how software gets built, straight from the architect himself.

    How running five AI agents at once turns coding into a real-time strategy game

    The most striking revelation from Cherny's disclosure is that he does not code in a linear fashion. In the traditional "inner loop" of development, a programmer writes a function, tests it, and moves to the next. Cherny, however, acts as a fleet commander.

    "I run 5 Claudes in parallel in my terminal," Cherny wrote. "I number my tabs 1-5, and use system notifications to know when a Claude needs input."

    By utilizing iTerm2 system notifications, Cherny effectively manages five simultaneous work streams. While one agent runs a test suite, another refactors a legacy module, and a third drafts documentation. He also runs "5-10 Claudes on claude.ai" in his browser, using a "teleport" command to hand off sessions between the web and his local machine.

    This validates the "do more with less" strategy articulated by Anthropic President Daniela Amodei earlier this week. While competitors like OpenAI pursue trillion-dollar infrastructure build-outs, Anthropic is proving that superior orchestration of existing models can yield exponential productivity gains.

    The counterintuitive case for choosing the slowest, smartest model

    In a surprising move for an industry obsessed with latency, Cherny revealed that he exclusively uses Anthropic's heaviest, slowest model: Opus 4.5.

    "I use Opus 4.5 with thinking for everything," Cherny explained. "It's the best coding model I've ever used, and even though it's bigger & slower than Sonnet, since you have to steer it less and it's better at tool use, it is almost always faster than using a smaller model in the end."

    For enterprise technology leaders, this is a critical insight. The bottleneck in modern AI development isn't the generation speed of the token; it is the human time spent correcting the AI's mistakes. Cherny's workflow suggests that paying the "compute tax" for a smarter model upfront eliminates the "correction tax" later.

    One shared file turns every AI mistake into a permanent lesson

    Cherny also detailed how his team solves the problem of AI amnesia. Standard large language models do not "remember" a company's specific coding style or architectural decisions from one session to the next.

    To address this, Cherny's team maintains a single file named CLAUDE.md in their git repository. "Anytime we see Claude do something incorrectly we add it to the CLAUDE.md, so Claude knows not to do it next time," he wrote.

    This practice transforms the codebase into a self-correcting organism. When a human developer reviews a pull request and spots an error, they don't just fix the code; they tag the AI to update its own instructions. "Every mistake becomes a rule," noted Aakash Gupta, a product leader analyzing the thread. The longer the team works together, the smarter the agent becomes.

    Slash commands and subagents automate the most tedious parts of development

    The "vanilla" workflow one observer praised is powered by rigorous automation of repetitive tasks. Cherny uses slash commands — custom shortcuts checked into the project's repository — to handle complex operations with a single keystroke.

    He highlighted a command called /commit-push-pr, which he invokes dozens of times daily. Instead of manually typing git commands, writing a commit message, and opening a pull request, the agent handles the bureaucracy of version control autonomously.

    Cherny also deploys subagents — specialized AI personas — to handle specific phases of the development lifecycle. He uses a code-simplifier to clean up architecture after the main work is done and a verify-app agent to run end-to-end tests before anything ships.

    Why verification loops are the real unlock for AI-generated code

    If there is a single reason Claude Code has reportedly hit $1 billion in annual recurring revenue so quickly, it is likely the verification loop. The AI is not just a text generator; it is a tester.

    "Claude tests every single change I land to claude.ai/code using the Claude Chrome extension," Cherny wrote. "It opens a browser, tests the UI, and iterates until the code works and the UX feels good."

    He argues that giving the AI a way to verify its own work — whether through browser automation, running bash commands, or executing test suites — improves the quality of the final result by "2-3x." The agent doesn't just write code; it proves the code works.

    What Cherny's workflow signals about the future of software engineering

    The reaction to Cherny's thread suggests a pivotal shift in how developers think about their craft. For years, "AI coding" meant an autocomplete function in a text editor — a faster way to type. Cherny has demonstrated that it can now function as an operating system for labor itself.

    "Read this if you're already an engineer... and want more power," Jeff Tang summarized on X.

    The tools to multiply human output by a factor of five are already here. They require only a willingness to stop thinking of AI as an assistant and start treating it as a workforce. The programmers who make that mental leap first won't just be more productive. They'll be playing an entirely different game — and everyone else will still be typing.

  • Tony Stoyanov is CTO and co-founder of EliseAI

    In the 2010s, tech companies chased staff-level specialists: Backend engineers, data scientists, system architects. That model worked when technology evolved slowly. Specialists knew their craft, could deliver quickly and built careers on predictable foundations like cloud infrastructure or the latest JS framework

    Then AI went mainstream.

    The pace of change has exploded. New technologies appear and mature in less than a year. You can’t hire someone who has been building AI agents for five years, as the technology hasn’t existed for that long. The people thriving today aren’t those with the longest résumés; they’re the ones who learn fast, adapt fast and act without waiting for direction. Nowhere is this transformation more evident than in software engineering, which has likely experienced the most dramatic shift of all, evolving faster than almost any other field of work.

    How AI Is rewriting the rules

    AI has lowered the barrier to doing complex technical work, technical skills and it's also raised expectations for what counts as real expertise. McKinsey estimates that by 2030, up to 30% of U.S. work hours could be automated and 12 million workers may need to shift roles entirely. Technical depth still matters, but AI favors people who can figure things out as they go.

    At my company, I see this every day. Engineers who never touched front-end code are now building UIs, while front-end developers are moving into back-end work. The technology keeps getting easier to use but the problems are harder because they span more disciplines.

    In that kind of environment, being great at one thing isn’t enough. What matters is the ability to bridge engineering, product and operations to make good decisions quickly, even with imperfect information.

    Despite all the excitement, only 1% of companies consider themselves truly mature in how they use AI. Many still rely on structures built for a slower era — layers of approval, rigid roles and an overreliance on specialists who can’t move outside their lane.

    The traits of a strong generalist 

    A strong generalist has breadth without losing depth. They go deep in one or two domains but stay fluent across many. As David Epstein puts it in Range, “You have people walking around with all the knowledge of humanity on their phone, but they have no idea how to integrate it. We don’t train people in thinking or reasoning.” True expertise comes from connecting the dots, not just collecting information.

    The best generalists share these traits:

    • Ownership: End-to-end accountability for outcomes, not just tasks.

    • First-principles thinking: Question assumptions, focus on the goal, and rebuild when needed.

    • Adaptability: Learn new domains quickly and move between them smoothly.

    • Agency: Act without waiting for approval and adjust as new information comes in.

    • Soft skills: Communicate clearly, align teams and keep customers’ needs in focus.

    • Range: Solve different kinds of problems and draw lessons across contexts.

    I try to make accountability a priority for my teams. Everyone knows what they own, what success looks like and how it connects to the mission. Perfection isn’t the goal, forward movement is.

    Embracing the shift

    Focusing on adaptable builders changed everything. These are the people with the range and curiosity to use AI tools to learn quickly and execute confidently.

    If you’re a builder who thrives in ambiguity, this is your time. The AI era rewards curiosity and initiative more than credentials. If you’re hiring, look ahead. The people who’ll move your company forward might not be the ones with the perfect résumé for the job. They’re the ones who can grow into what the company will need as it evolves.

    The future belongs to generalists and to the companies that trust them.

    Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.

  • Anthropic said on Wednesday it would release its Agent Skills technology as an open standard, a strategic bet that sharing its approach to making AI assistants more capable will cement the company's position in the fast-evolving enterprise software market.

    The San Francisco-based artificial intelligence company also unveiled organization-wide management tools for enterprise customers and a directory of partner-built skills from companies including Atlassian, Figma, Canva, Stripe, Notion, and Zapier.

    The moves mark a significant expansion of a technology Anthropic first introduced in October, transforming what began as a niche developer feature into infrastructure that now appears poised to become an industry standard.

    "We're launching Agent Skills as an independent open standard with a specification and reference SDK available at https://agentskills.io," Mahesh Murag, a product manager at Anthropic, said in an interview with VentureBeat. "Microsoft has already adopted Agent Skills within VS Code and GitHub; so have popular coding agents like Cursor, Goose, Amp, OpenCode, and more. We're in active conversations with others across the ecosystem."

    Inside the technology that teaches AI assistants to do specialized work

    Skills are, at their core, folders containing instructions, scripts, and resources that tell AI systems how to perform specific tasks consistently. Rather than requiring users to craft elaborate prompts each time they want an AI assistant to complete a specialized task, skills package that procedural knowledge into reusable modules.

    The concept addresses a fundamental limitation of large language models: while they possess broad general knowledge, they often lack the specific procedural expertise needed for specialized professional work. A skill for creating PowerPoint presentations, for instance, might include preferred formatting conventions, slide structure guidelines, and quality standards — information the AI loads only when working on presentations.

    Anthropic designed the system around what it calls "progressive disclosure." Each skill takes only a few dozen tokens when summarized in the AI's context window, with full details loading only when the task requires them. This architectural choice allows organizations to deploy extensive skill libraries without overwhelming the AI's working memory.

    Fortune 500 companies are already using skills in legal, finance, and accounting

    The new enterprise management features allow administrators on Anthropic's Team and Enterprise plans to provision skills centrally, controlling which workflows are available across their organizations while letting individual employees customize their experience.

    "Enterprise customers are using skills in production across both coding workflows and business functions like legal, finance, accounting, and data science," Murag said. "The feedback has been positive because skills let them personalize Claude to how they actually work and get to high-quality output faster."

    The community response has exceeded expectations, according to Murag: "Our skills repository already crossed 20k stars on GitHub, with tens of thousands of community-created and shared skills."

    Atlassian, Figma, Stripe, and Zapier join Anthropic's skills directory at launch

    Anthropic is launching with skills from ten partners, a roster that reads like a who's who of modern enterprise software. The presence of Atlassian, which makes Jira and Confluence, alongside design tools Figma and Canva, payment infrastructure company Stripe, and automation platform Zapier suggests Anthropic is positioning Skills as connective tissue between Claude and the applications businesses already use.

    The business arrangements with these partners focus on ecosystem development rather than immediate revenue generation.

    "Partners who build skills for the directory do so to enhance how Claude works with their platforms. It's a mutually beneficial ecosystem relationship similar to MCP connector partnerships," Murag explained. "There are no revenue-sharing arrangements at this time."

    For vetting new partners, Anthropic is taking a measured approach. "We began with established partners and are developing more formal criteria as we expand," Murag said. "We want to create a valuable supply of skills for enterprises while helping partner products shine."

    Notably, Anthropic is not charging extra for the capability. "Skills work across all Claude surfaces: Claude.ai, Claude Code, the Claude Agent SDK, and the API. They're included in Max, Pro, Team, and Enterprise plans at no additional cost. API usage follows standard API pricing," Murag said.

    Why Anthropic is giving away its competitive advantage to OpenAI and Google

    The decision to release Skills as an open standard is a calculated strategic choice. By making skills portable across AI platforms, Anthropic is betting that ecosystem growth will benefit the company more than proprietary lock-in would.

    The strategy appears to be working. OpenAI has quietly adopted structurally identical architecture in both ChatGPT and its Codex CLI tool. Developer Elias Judin discovered the implementation earlier this month, finding directories containing skill files that mirror Anthropic's specification—the same file naming conventions, the same metadata format, the same directory organization.

    This convergence suggests the industry has found a common answer to a vexing question: how do you make AI assistants consistently good at specialized work without expensive model fine-tuning?

    The timing aligns with broader standardization efforts in the AI industry. Anthropic donated its Model Context Protocol to the Linux Foundation on December 9, and both Anthropic and OpenAI co-founded the Agentic AI Foundation alongside Block. Google, Microsoft, and Amazon Web Services joined as members. The foundation will steward multiple open specifications, and Skills fit naturally into this standardization push.

    "We've also seen how complementary skills and MCP servers are," Murag noted. "MCP provides secure connectivity to external software and data, while skills provide the procedural knowledge for using those tools effectively. Partners who've invested in strong MCP integrations were a natural starting point."

    The AI industry abandons specialized agents in favor of one assistant that learns everything

    The Skills approach is a philosophical shift in how the AI industry thinks about making AI assistants more capable. The traditional approach involved building specialized agents for different use cases — a customer service agent, a coding agent, a research agent. Skills suggest a different model: one general-purpose agent equipped with a library of specialized capabilities.

    "We used to think agents in different domains will look very different," Barry Zhang, an Anthropic researcher, said at an industry conference last month, according to a Business Insider report. "The agent underneath is actually more universal than we thought."

    This insight has significant implications for enterprise software development. Rather than building and maintaining multiple specialized AI systems, organizations can invest in creating and curating skills that encode their institutional knowledge and best practices.

    Anthropic's own internal research supports this approach. A study the company published in early December found that its engineers used Claude in 60% of their work, achieving a 50% self-reported productivity boost—a two to threefold increase from the prior year. Notably, 27% of Claude-assisted work consisted of tasks that would not have been done otherwise, including building internal tools, creating documentation, and addressing what employees called "papercuts" — small quality-of-life improvements that had been perpetually deprioritized.

    Security risks and skill atrophy emerge as concerns for enterprise AI deployments

    The Skills framework is not without potential complications. As AI systems become more capable through skills, questions arise about maintaining human expertise. Anthropic's internal research found that while skills enabled engineers to work across more domains—backend developers building user interfaces, researchers creating data visualizations—some employees worried about skill atrophy.

    "When producing output is so easy and fast, it gets harder and harder to actually take the time to learn something," one Anthropic engineer said in the company's internal survey.

    There are also security considerations. Skills provide Claude with new capabilities through instructions and code, which means malicious skills could theoretically introduce vulnerabilities. Anthropic recommends installing skills only from trusted sources and thoroughly auditing those from less-trusted origins.

    The open standard approach introduces governance questions as well. While Anthropic has published the specification and launched a reference SDK, the long-term stewardship of the standard remains undefined. Whether it will fall under the Agentic AI Foundation or require its own governance structure is an open question.

    Anthropic's real product may not be Claude—it may be the infrastructure everyone else builds on

    The trajectory of Skills reveals something important about Anthropic's ambitions. Two months ago, the company introduced a feature that looked like a developer tool. Today, that feature has become a specification that Microsoft builds into VS Code, that OpenAI replicates in ChatGPT, and that enterprise software giants race to support.

    The pattern echoes strategies that have reshaped the technology industry before. Companies from Red Hat to Google have discovered that open standards can be more valuable than proprietary technology — that the company defining how an industry works often captures more value than the company trying to own it outright.

    For enterprise technology leaders evaluating AI investments, the message is straightforward: skills are becoming infrastructure. The expertise organizations encode into skills today will determine how effectively their AI assistants perform tomorrow, regardless of which model powers them.

    The competitive battles between Anthropic, OpenAI, and Google will continue. But on the question of how to make AI assistants reliably good at specialized work, the industry has quietly converged on an answer — and it came from the company that gave it away.

  • Enterprises can now harness the power of a large language model that's near that of the state-of-the-art Google’s Gemini 3 Pro, but at a fraction of the cost and with increased speed, thanks to the newly released Gemini 3 Flash.

    The model joins the flagship Gemini 3 Pro, Gemini 3 Deep Think, and Gemini Agent, all of which were announced and released last month.

    Gemini 3 Flash, now available on Gemini Enterprise, Google Antigravity, Gemini CLI, AI Studio, and on preview in Vertex AI, processes information in near real-time and helps build quick, responsive agentic applications. 

    The company said in a blog post that Gemini 3 Flash “builds on the model series that developers and enterprises already love, optimized for high-frequency workflows that demand speed, without sacrificing quality.

    The model is also the default for AI Mode on Google Search and the Gemini application. 

    Tulsee Doshi, senior director, product management on the Gemini team, said in a separate blog post that the model “demonstrates that speed and scale don’t have to come at the cost of intelligence.”

    “Gemini 3 Flash is made for iterative development, offering Gemini 3’s Pro-grade coding performance with low latency — it’s able to reason and solve tasks quickly in high-frequency workflows,” Doshi said. “It strikes an ideal balance for agentic coding, production-ready systems and responsive interactive applications.”

    Early adoption by specialized firms proves the model's reliability in high-stakes fields. Harvey, an AI platform for law firms, reported a 7% jump in reasoning on their internal 'BigLaw Bench,' while Resemble AI discovered that Gemini 3 Flash could process complex forensic data for deepfake detection 4x faster than Gemini 2.5 Pro. These aren't just speed gains; they are enabling 'near real-time' workflows that were previously impossible.

    More efficient at a lower cost

    Enterprise AI builders have become more aware of the cost of running AI models, especially as they try to convince stakeholders to put more budget into agentic workflows that run on expensive models. Organizations have turned to smaller or distilled models, focusing on open models or other research and prompting techniques to help manage bloated AI costs.

    For enterprises, the biggest value proposition for Gemini 3 Flash is that it offers the same level of advanced multimodal capabilities, such as complex video analysis and data extraction, as its larger Gemini counterparts, but is far faster and cheaper. 

    While Google’s internal materials highlight a 3x speed increase over the 2.5 Pro series, data from independent benchmarking firm Artificial Analysis adds a layer of crucial nuance.

    In the latter organization's pre-release testing, Gemini 3 Flash Preview recorded a raw throughput of 218 output tokens per second. This makes it 22% slower than the previous 'non-reasoning' Gemini 2.5 Flash, but it is still significantly faster than rivals including OpenAI's GPT-5.1 high (125 t/s) and DeepSeek V3.2 reasoning (30 t/s).

    Most notably, Artificial Analysis crowned Gemini 3 Flash as the new leader in their AA-Omniscience knowledge benchmark, where it achieved the highest knowledge accuracy of any model tested to date. However, this intelligence comes with a 'reasoning tax': the model more than doubles its token usage compared to the 2.5 Flash series when tackling complex indexes.

    This high token density is offset by Google's aggressive pricing: when accessing through the Gemini API, Gemini 3 Flash costs $0.50 per 1 million input tokens, compared to $1.25/1M input tokens for Gemini 2.5 Pro, and $3/1M output tokens, compared to $ 10/1 M output tokens for Gemini 2.5 Pro. This allows Gemini 3 Flash to claim the title of the most cost-efficient model for its intelligence tier, despite being one of the most 'talkative' models in terms of raw token volume. Here's how it stacks up to rival LLM offerings:

    Model

    Input (/1M)

    Output (/1M)

    Total Cost

    Source

    Qwen 3 Turbo

    $0.05

    $0.20

    $0.25

    Alibaba Cloud

    Grok 4.1 Fast (reasoning)

    $0.20

    $0.50

    $0.70

    xAI

    Grok 4.1 Fast (non-reasoning)

    $0.20

    $0.50

    $0.70

    xAI

    deepseek-chat (V3.2-Exp)

    $0.28

    $0.42

    $0.70

    DeepSeek

    deepseek-reasoner (V3.2-Exp)

    $0.28

    $0.42

    $0.70

    DeepSeek

    Qwen 3 Plus

    $0.40

    $1.20

    $1.60

    Alibaba Cloud

    ERNIE 5.0

    $0.85

    $3.40

    $4.25

    Qianfan

    Gemini 3 Flash Preview

    $0.50

    $3.00

    $3.50

    Google

    Claude Haiku 4.5

    $1.00

    $5.00

    $6.00

    Anthropic

    Qwen-Max

    $1.60

    $6.40

    $8.00

    Alibaba Cloud

    Gemini 3 Pro (≤200K)

    $2.00

    $12.00

    $14.00

    Google

    GPT-5.2

    $1.75

    $14.00

    $15.75

    OpenAI

    Claude Sonnet 4.5

    $3.00

    $15.00

    $18.00

    Anthropic

    Gemini 3 Pro (>200K)

    $4.00

    $18.00

    $22.00

    Google

    Claude Opus 4.5

    $5.00

    $25.00

    $30.00

    Anthropic

    GPT-5.2 Pro

    $21.00

    $168.00

    $189.00

    OpenAI

    More ways to save

    But enterprise developers and users can cut costs further by eliminating the lag most larger models often have, which racks up token usage. Google said the model “is able to modulate how much it thinks,” so that it uses more thinking and therefore more tokens for more complex tasks than for quick prompts. The company noted Gemini 3 Flash uses 30% fewer tokens than Gemini 2.5 Pro. 

    To balance this new reasoning power with strict corporate latency requirements, Google has introduced a 'Thinking Level' parameter. Developers can toggle between 'Low'—to minimize cost and latency for simple chat tasks—and 'High'—to maximize reasoning depth for complex data extraction. This granular control allows teams to build 'variable-speed' applications that only consume expensive 'thinking tokens' when a problem actually demands PhD-level lo

    The economic story extends beyond simple token prices. With the standard inclusion of Context Caching, enterprises processing massive, static datasets—such as entire legal libraries or codebase repositories—can see a 90% reduction in costs for repeated queries. When combined with the Batch API’s 50% discount, the total cost of ownership for a Gemini-powered agent drops significantly below the threshold of competing frontier models

    “Gemini 3 Flash delivers exceptional performance on coding and agentic tasks combined with a lower price point, allowing teams to deploy sophisticated reasoning costs across high-volume processes without hitting barriers,” Google said. 

    By offering a model that delivers strong multimodal performance at a more affordable price, Google is making the case that enterprises concerned with controlling their AI spend should choose its models, especially Gemini 3 Flash. 

    Strong benchmark performance 

    But how does Gemini 3 Flash stack up against other models in terms of its performance? 

    Doshi said the model achieved a score of 78% on the SWE-Bench Verified benchmark testing for coding agents, outperforming both the preceding Gemini 2.5 family and the newer Gemini 3 Pro itself!

    For enterprises, this means high-volume software maintenance and bug-fixing tasks can now be offloaded to a model that is both faster and cheaper than previous flagship models, without a degradation in code quality.

    The model also performed strongly on other benchmarks, scoring 81.2% on the MMMU Pro benchmark, comparable to Gemini 3 Pro. 

    While most Flash type models are explicitly optimized for short, quick tasks like generating code, Google claims Gemini 3 Flash’s performance “in reasoning, tool use and multimodal capabilities is ideal for developers looking to do more complex video analysis, data extraction and visual Q&A, which means it can enable more intelligent applications — like in-game assistants or A/B test experiments — that demand both quick answers and deep reasoning.”

    First impressions from early users

    So far, early users have been largely impressed with the model, particularly its benchmark performance. 

    What It Means for Enterprise AI Usage

    With Gemini 3 Flash now serving as the default engine across Google Search and the Gemini app, we are witnessing the "Flash-ification" of frontier intelligence. By making Pro-level reasoning the new baseline, Google is setting a trap for slower incumbents.

    The integration into platforms like Google Antigravity suggests that Google isn't just selling a model; it's selling the infrastructure for the autonomous enterprise.

    As developers hit the ground running with 3x faster speeds and a 90% discount on context caching, the "Gemini-first" strategy becomes a compelling financial argument. In the high-velocity race for AI dominance, Gemini 3 Flash may be the model that finally turns "vibe coding" from an experimental hobby into a production-ready reality.

  • Zoom Video Communications, the company best known for keeping remote workers connected during the pandemic, announced last week that it had achieved the highest score ever recorded on one of artificial intelligence's most demanding tests — a claim that sent ripples of surprise, skepticism, and genuine curiosity through the technology industry.

    The San Jose-based company said its AI system scored 48.1 percent on the Humanity's Last Exam, a benchmark designed by subject-matter experts worldwide to stump even the most advanced AI models. That result edges out Google's Gemini 3 Pro, which held the previous record at 45.8 percent.

    "Zoom has achieved a new state-of-the-art result on the challenging Humanity's Last Exam full-set benchmark, scoring 48.1%, which represents a substantial 2.3% improvement over the previous SOTA result," wrote Xuedong Huang, Zoom's chief technology officer, in a blog post.

    The announcement raises a provocative question that has consumed AI watchers for days: How did a video conferencing company — one with no public history of training large language models — suddenly vault past Google, OpenAI, and Anthropic on a benchmark built to measure the frontiers of machine intelligence?

    The answer reveals as much about where AI is headed as it does about Zoom's own technical ambitions. And depending on whom you ask, it's either an ingenious demonstration of practical engineering or a hollow claim that appropriates credit for others' work.

    How Zoom built an AI traffic controller instead of training its own model

    Zoom did not train its own large language model. Instead, the company developed what it calls a "federated AI approach" — a system that routes queries to multiple existing models from OpenAI, Google, and Anthropic, then uses proprietary software to select, combine, and refine their outputs.

    At the heart of this system sits what Zoom calls its "Z-scorer," a mechanism that evaluates responses from different models and chooses the best one for any given task. The company pairs this with what it describes as an "explore-verify-federate strategy," an agentic workflow that balances exploratory reasoning with verification across multiple AI systems.

    "Our federated approach combines Zoom's own small language models with advanced open-source and closed-source models," Huang wrote. The framework "orchestrates diverse models to generate, challenge, and refine reasoning through dialectical collaboration."

    In simpler terms: Zoom built a sophisticated traffic controller for AI, not the AI itself.

    This distinction matters enormously in an industry where bragging rights — and billions in valuation — often hinge on who can claim the most capable model. The major AI laboratories spend hundreds of millions of dollars training frontier systems on vast computing clusters. Zoom's achievement, by contrast, appears to rest on clever integration of those existing systems.

    Why AI researchers are divided over what counts as real innovation

    The response from the AI community was swift and sharply divided.

    Max Rumpf, an AI engineer who says he has trained state-of-the-art language models, posted a pointed critique on social media. "Zoom strung together API calls to Gemini, GPT, Claude et al. and slightly improved on a benchmark that delivers no value for their customers," he wrote. "They then claim SOTA."

    Rumpf did not dismiss the technical approach itself. Using multiple models for different tasks, he noted, is "actually quite smart and most applications should do this." He pointed to Sierra, an AI customer service company, as an example of this multi-model strategy executed effectively.

    His objection was more specific: "They did not train the model, but obfuscate this fact in the tweet. The injustice of taking credit for the work of others sits deeply with people."

    But other observers saw the achievement differently. Hongcheng Zhu, a developer, offered a more measured assessment: "To top an AI eval, you will most likely need model federation, like what Zoom did. An analogy is that every Kaggle competitor knows you have to ensemble models to win a contest."

    The comparison to Kaggle — the competitive data science platform where combining multiple models is standard practice among winning teams — reframes Zoom's approach as industry best practice rather than sleight of hand. Academic research has long established that ensemble methods routinely outperform individual models.

    Still, the debate exposed a fault line in how the industry understands progress. Ryan Pream, founder of Exoria AI, was dismissive: "Zoom are just creating a harness around another LLM and reporting that. It is just noise." Another commenter captured the sheer unexpectedness of the news: "That the video conferencing app ZOOM developed a SOTA model that achieved 48% HLE was not on my bingo card."

    Perhaps the most pointed critique concerned priorities. Rumpf argued that Zoom could have directed its resources toward problems its customers actually face. "Retrieval over call transcripts is not 'solved' by SOTA LLMs," he wrote. "I figure Zoom's users would care about this much more than HLE."

    The Microsoft veteran betting his reputation on a different kind of AI

    If Zoom's benchmark result seemed to come from nowhere, its chief technology officer did not.

    Xuedong Huang joined Zoom from Microsoft, where he spent decades building the company's AI capabilities. He founded Microsoft's speech technology group in 1993 and led teams that achieved what the company described as human parity in speech recognition, machine translation, natural language understanding, and computer vision.

    Huang holds a Ph.D. in electrical engineering from the University of Edinburgh. He is an elected member of the National Academy of Engineering and the American Academy of Arts and Sciences, as well as a fellow of both the IEEE and the ACM. His credentials place him among the most accomplished AI executives in the industry.

    His presence at Zoom signals that the company's AI ambitions are serious, even if its methods differ from the research laboratories that dominate headlines. In his tweet celebrating the benchmark result, Huang framed the achievement as validation of Zoom's strategy: "We have unlocked stronger capabilities in exploration, reasoning, and multi-model collaboration, surpassing the performance limits of any single model."

    That final clause — "surpassing the performance limits of any single model" — may be the most significant. Huang is not claiming Zoom built a better model. He is claiming Zoom built a better system for using models.

    Inside the test designed to stump the world's smartest machines

    The benchmark at the center of this controversy, Humanity's Last Exam, was designed to be exceptionally difficult. Unlike earlier tests that AI systems learned to game through pattern matching, HLE presents problems that require genuine understanding, multi-step reasoning, and the synthesis of information across complex domains.

    The exam draws on questions from experts around the world, spanning fields from advanced mathematics to philosophy to specialized scientific knowledge. A score of 48.1 percent might sound unimpressive to anyone accustomed to school grading curves, but in the context of HLE, it represents the current ceiling of machine performance.

    "This benchmark was developed by subject-matter experts globally and has become a crucial metric for measuring AI's progress toward human-level performance on challenging intellectual tasks," Zoom’s announcement noted.

    The company's improvement of 2.3 percentage points over Google's previous best may appear modest in isolation. But in competitive benchmarking, where gains often come in fractions of a percent, such a jump commands attention.

    What Zoom's approach reveals about the future of enterprise AI

    Zoom's approach carries implications that extend well beyond benchmark leaderboards. The company is signaling a vision for enterprise AI that differs fundamentally from the model-centric strategies pursued by OpenAI, Anthropic, and Google.

    Rather than betting everything on building the single most capable model, Zoom is positioning itself as an orchestration layer — a company that can integrate the best capabilities from multiple providers and deliver them through products that businesses already use every day.

    This strategy hedges against a critical uncertainty in the AI market: no one knows which model will be best next month, let alone next year. By building infrastructure that can swap between providers, Zoom avoids vendor lock-in while theoretically offering customers the best available AI for any given task.

    The announcement of OpenAI's GPT-5.2 the following day underscored this dynamic. OpenAI's own communications named Zoom as a partner that had evaluated the new model's performance "across their AI workloads and saw measurable gains across the board." Zoom, in other words, is both a customer of the frontier labs and now a competitor on their benchmarks — using their own technology.

    This arrangement may prove sustainable. The major model providers have every incentive to sell API access widely, even to companies that might aggregate their outputs. The more interesting question is whether Zoom's orchestration capabilities constitute genuine intellectual property or merely sophisticated prompt engineering that others could replicate.

    The real test arrives when Zoom's 300 million users start asking questions

    Zoom titled its announcement section on industry relations "A Collaborative Future," and Huang struck notes of gratitude throughout. "The future of AI is collaborative, not competitive," he wrote. "By combining the best innovations from across the industry with our own research breakthroughs, we create solutions that are greater than the sum of their parts."

    This framing positions Zoom as a beneficent integrator, bringing together the industry's best work for the benefit of enterprise customers. Critics see something else: a company claiming the prestige of an AI laboratory without doing the foundational research that earns it.

    The debate will likely be settled not by leaderboards but by products. When AI Companion 3.0 reaches Zoom's hundreds of millions of users in the coming months, they will render their own verdict — not on benchmarks they have never heard of, but on whether the meeting summary actually captured what mattered, whether the action items made sense, whether the AI saved them time or wasted it.

    In the end, Zoom's most provocative claim may not be that it topped a benchmark. It may be the implicit argument that in the age of AI, the best model is not the one you build — it's the one you know how to use.

  • We've heard (and written, here at VentureBeat) lots about the generative AI race between the U.S. and China, as those have been the countries with the groups most active in fielding new models (with a shoutout to Cohere in Canada and Mistral in France).

    But now a Korean startup is making waves: last week, the firm known as Motif Technologies released Motif-2-12.7B-Reasoning, another small parameter open-weight model that boasts impressive benchmark scores, quickly becoming the most performant model from that country according to independent benchmarking lab Artificial Analysis (beating even regular GPT-5.1 from U.S. leader OpenAI).

    But more importantly for enterprise AI teams, the company has published a white paper on arxiv.org with a concrete, reproducible training recipe that exposes where reasoning performance actually comes from — and where common internal LLM efforts tend to fail.

    For organizations building or fine-tuning their own models behind the firewall, the paper offers a set of practical lessons about data alignment, long-context infrastructure, and reinforcement learning stability that are directly applicable to enterprise environments. Here they are:

    1. Reasoning gains come from data distribution, not model size

    One of Motif’s most relevant findings for enterprise teams is that synthetic reasoning data only helps when its structure matches the target model’s reasoning style.

    The paper shows measurable differences in downstream coding performance depending on which “teacher” model generated the reasoning traces used during supervised fine-tuning.

    For enterprises, this undermines a common shortcut: generating large volumes of synthetic chain-of-thought data from a frontier model and assuming it will transfer cleanly. Motif’s results suggest that misaligned reasoning traces can actively hurt performance, even if they look high quality.

    The takeaway is operational, not academic: teams should validate that their synthetic data reflects the format, verbosity, and step granularity they want at inference time. Internal evaluation loops matter more than copying external datasets.

    2. Long-context training is an infrastructure problem first

    Motif trains at 64K context, but the paper makes clear that this is not simply a tokenizer or checkpointing tweak.

    The model relies on hybrid parallelism, careful sharding strategies, and aggressive activation checkpointing to make long-context training feasible on Nvidia H100-class hardware.

    For enterprise builders, the message is sobering but useful: long-context capability cannot be bolted on late.

    If retrieval-heavy or agentic workflows are core to the business use case, context length has to be designed into the training stack from the start. Otherwise, teams risk expensive retraining cycles or unstable fine-tunes.

    3. RL fine-tuning fails without data filtering and reuse

    Motif’s reinforcement learning fine-tuning (RLFT) pipeline emphasizes difficulty-aware filtering — keeping tasks whose pass rates fall within a defined band — rather than indiscriminately scaling reward training.

    This directly addresses a pain point many enterprise teams encounter when experimenting with RL: performance regressions, mode collapse, or brittle gains that vanish outside benchmarks. Motif also reuses trajectories across policies and expands clipping ranges, trading theoretical purity for training stability.

    The enterprise lesson is clear: RL is a systems problem, not just a reward model problem. Without careful filtering, reuse, and multi-task balancing, RL can destabilize models that are otherwise production-ready.

    4. Memory optimization determines what is even possible

    Motif’s use of kernel-level optimizations to reduce RL memory pressure highlights an often-overlooked constraint in enterprise settings: memory, not compute, is frequently the bottleneck. Techniques like loss-function-level optimization determine whether advanced training stages are viable at all.

    For organizations running shared clusters or regulated environments, this reinforces the need for low-level engineering investment, not just model architecture experimentation.

    Why this matters for enterprise AI teams

    Motif-2-12.7B-Reasoning is positioned as competitive with much larger models, but its real value lies in the transparency of how those results were achieved. The paper argues — implicitly but persuasively — that reasoning performance is earned through disciplined training design, not model scale alone.

    For enterprises building proprietary LLMs, the lesson is pragmatic: invest early in data alignment, infrastructure, and training stability, or risk spending millions fine-tuning models that never reliably reason in production.