- From redefining infrastructure and software to addressing public concerns and sustainability, AI data centers were at the heart of the data center industry’s evolution in 2025.
- The company says several new products and services will help customers modernize IT infrastructure for advanced AI.
- Any data center seeking a grid connection must install on-site generation or battery systems capable of meeting its full electricity demand.
Picture this: You're sitting in a conference room, halfway through a vendor pitch. The demo looks solid, and pricing fits nicely under budget. The timeline seems reasonable too. Everyone’s nodding along.
You’re literally minutes away from saying yes.
Then someone from your finance team walks in. They see the deck and frown. A few minutes later, they shoot you a message on Slack: “Actually, I threw together a version of this last week. Took me 2 hours in Cursor. Wanna take a look?”
Wait… what?
This person doesn't code. You know for a fact they've never written a line of JavaScript in their entire life. But here they are, showing you a working prototype on their laptop that does... pretty much exactly what the vendor pitched. Sure, it's got some rough edges, but it works. And it didn’t cost six figures. Just two hours of their time.
Suddenly, the assumptions you walked in with — about how software is developed, who makes it and how decisions are made around it — all start coming apart at the seams.
The old framework
For decades, every growing company asked the same question: Should we build this ourselves, or should we buy it?
And, for decades, the answer was pretty straightforward: Build if it's core to your business; buy if it isn’t.
The logic made sense, because building was expensive and meant borrowing time from overworked engineers, writing specs, planning sprints, managing infrastructure and bracing yourself for a long tail of maintenance. Buying was faster. Safer. You paid for the support and the peace of mind.
But something fundamental has changed: AI has made building accessible to everyone. What used to take weeks now takes hours, and what used to require fluency in a programming language now requires fluency in plain English.
When the cost and complexity of building collapse this dramatically, the old framework goes down with them. It’s not build versus buy anymore. It’s something stranger that we haven't quite found the right words for.
When the market doesn’t know what you need (yet)
My company never planned to build so many of the tools we use. We just had to build because the things we needed didn’t exist. And, through that process, we developed this visceral understanding of what we actually wanted, what was useful and what it could or couldn't do. Not what vendor decks told us we needed or what analyst reports said we should want, but what actually moved the needle in our business.
We figured out which problems were worth solving, which ones weren’t, where AI created real leverage and where it was just noise. And only then, once we had that hard-earned clarity, did we start buying.
By that point, we knew exactly what we were looking for and could tell the difference between substance and marketing in about five minutes. We asked questions that made vendors nervous because we'd already built some rudimentary version of what they were selling.
When anyone can build in minutes
Last week, someone on our CX team noticed some customer feedback about a bug in Slack. Just a minor customer complaint, nothing major. In another company, this would’ve kicked off a support ticket and they’d have waited for someone else to handle it, but that’s not what happened here. They opened Cursor, described the change and let AI write the fix. Then they submitted a pull request that engineering reviewed and merged.
Just 15 minutes after that complaint popped up in Slack, the fix was live in production.
The person who did this isn’t technical in the slightest. I doubt they could tell you the difference between Python and JavaScript, but they solved the problem anyway.
And that’s the point.
AI has gotten so good at cranking out relatively simple code that it handles 80% of the problems that used to require a sprint planning meeting and two weeks of engineering time. It’s erasing the boundary between technical and non-technical. Work that used to be bottlenecked by engineering is now being done by the people closest to the problem.
This is happening right now in companies that are actually paying attention.
The inversion that’s happening
Here's where it gets fascinating for finance leaders, because AI has actually flipped the entire strategic logic of the build versus buy decision on its head.
The old model went something like:
-
Define the need.
-
Decide whether to build or buy.
But defining the need took forever and required deep technical expertise, or you'd burn through money through trial-and-error vendor implementations. You'd sit through countless demos, trying to picture whether this actually solved your problem. Then you’d negotiate, implement, move all your data and workflows to the new tool and six months and six figures later discover whether (or not) you were actually right.
Now, the whole sequence gets turned around:
-
Build something lightweight with AI.
-
Use it to understand what you actually need.
-
Then decide whether to buy (and you'll know exactly why).
This approach lets you run controlled experiments. You figure out whether the problem even matters. You discover which features deliver value and which just look good in demos. Then you go shopping. Instead of letting some external vendor sell you on what the need is, you get to figure out whether you even have that need in the first place.
Think about how many software purchases you've made that, in hindsight, solved problems you didn't actually have. How many times have you been three months into an implementation and thought, “Hang on, is this actually helping us, or are we just trying to justify what we spent?”
Now, when you do buy, the question becomes “Does this solve the problem better than what we already proved we can build?”
That one reframe changes the entire conversation. Now you show up to vendor calls informed. You ask sharper questions, and negotiate from a place of strength. Most importantly, you avoid the most expensive mistake in enterprise software, which is solving a problem you never really had.
The trap you need to avoid
As this new capability emerges, I’m watching companies sprint in the wrong direction. They know they need to be AI native, so they go on a shopping spree. They look for AI-powered tools, filling their stack with products that have GPT integrations, chatbot UIs or “AI” slapped onto the marketing site. They think they’re transforming, but they’re not.
Remember what physicist Richard Feynman called cargo cult science? After World War II, islanders in the South Pacific built fake airstrips and control towers, mimicking what they'd seen during the war, hoping planes full of cargo would return. They had all the outward forms of an airport: Towers, headsets, even people miming flight controllers. But no planes landed, because the form wasn’t the function.
That’s exactly what’s happening with AI transformation in boardrooms everywhere. Leaders are buying AI tools without asking if they meaningfully change how work gets done, who they empower or what processes they unlock.
They’ve built the airstrip, but the planes aren’t showing up.
And the whole market's basically set up to make you fall into this trap. Everything gets branded as AI now, but nobody seems to care what these products actually do. Every SaaS product has bolted on a chatbot or an auto-complete feature and slapped an AI label on it, and the label has lost all meaning. It’s just a checkbox vendors figure they need to tick, regardless of whether it creates actual value for customers.
The finance team’s new superpower
This is the part that gets me excited about what finance teams can do now. You don’t have to guess anymore. You don’t have to bet six figures on a sales deck. You can test things, and you can actually learn something before you spend.
Here's what I mean: If you’re evaluating vendor management software, prototype the core workflow with AI tools. Figure out whether you’re solving a tooling problem or a process problem. Figure out whether you need software at all.
This doesn’t mean you’ll build everything internally — of course not. Most of the time, you’ll still end up buying, and that's totally fine, because enterprise tools exist for good reasons (scale, support, security, and maintenance). But now you’ll buy with your eyes wide open.
You’ll know what “good” looks like. You’ll show up to demos already understanding the edge cases, and know in about 5 minutes whether they actually get your specific problem. You’ll implement faster. You'll negotiate better because you're not completely dependent on the vendor's solution. And you’ll choose it because it's genuinely better than what you could build yourself.
You'll have already mapped out the shape of what you need, and you'll just be looking for the best version of it.
The new paradigm
For years, the mantra was: Build or buy.
Now, it’s more elegant and way smarter: Build to learn what to buy.
And it's not some future state. This is already happening. Right now, somewhere, a customer rep is using AI to fix a product issue they spotted minutes ago. Somewhere else, a finance team is prototyping their own analytical tools because they've realized they can iterate faster than they can write up requirements for engineering. Somewhere, a team is realizing that the boundary between technical and non-technical was always more cultural than fundamental.
The companies that embrace this shift will move faster and spend smarter. They’ll know their operations more deeply than any vendor ever could. They'll make fewer expensive mistakes, and buy better tools because they actually understand what makes tools good.
The companies that stick to the old playbook will keep sitting through vendor pitches, nodding along at budget-friendly proposals. They’ll debate timelines, and keep mistaking professional decks for actual solutions.
Until someone on their own team pops open their laptop, says, “I built a version of this last night. Want to check it out?,” and shows them something they built in two hours that does 80% of what they’re about to pay six figures for.
And, just like that, the rules change for good.
Siqi Chen is co-founder and CEO of Runway.
Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.
-
Gen AI in software engineering has moved well beyond autocomplete. The emerging frontier is agentic coding: AI systems capable of planning changes, executing them across multiple steps and iterating based on feedback. Yet despite the excitement around “AI agents that code,” most enterprise deployments underperform. The limiting factor is no longer the model. It’s context: The structure, history and intent surrounding the code being changed. In other words, enterprises are now facing a systems design problem: They have not yet engineered the environment these agents operate in.
The shift from assistance to agency
The past year has seen a rapid evolution from assistive coding tools to agentic workflows. Research has begun to formalize what agentic behavior means in practice: The ability to reason across design, testing, execution and validation rather than generate isolated snippets. Work such as dynamic action re-sampling shows that allowing agents to branch, reconsider and revise their own decisions significantly improves outcomes in large, interdependent codebases. At the platform level, providers like GitHub are now building dedicated agent orchestration environments, such as Copilot Agent and Agent HQ, to support multi-agent collaboration inside real enterprise pipelines.
But early field results tell a cautionary story. When organizations introduce agentic tools without addressing workflow and environment, productivity can decline. A randomized control study this year showed that developers who used AI assistance in unchanged workflows completed tasks more slowly, largely due to verification, rework and confusion around intent. The lesson is straightforward: Autonomy without orchestration rarely yields efficiency.
Why context engineering is the real unlock
In every unsuccessful deployment I’ve observed, the failure stemmed from context. When agents lack a structured understanding of a codebase, specifically its relevant modules, dependency graph, test harness, architectural conventions and change history. They often generate output that appears correct but is disconnected from reality. Too much information overwhelms the agent; too little forces it to guess. The goal is not to feed the model more tokens. The goal is to determine what should be visible to the agent, when and in what form.
The teams seeing meaningful gains treat context as an engineering surface. They create tooling to snapshot, compact and version the agent’s working memory: What is persisted across turns, what is discarded, what is summarized and what is linked instead of inlined. They design deliberation steps rather than prompting sessions. They make the specification a first-class artifact, something reviewable, testable and owned, not a transient chat history. This shift aligns with a broader trend some researchers describe as “specs becoming the new source of truth.”
Workflow must change alongside tooling
But context alone isn’t enough. Enterprises must re-architect the workflows around these agents. As McKinsey’s 2025 report “One Year of Agentic AI” noted, productivity gains arise not from layering AI onto existing processes but from rethinking the process itself. When teams simply drop an agent into an unaltered workflow, they invite friction: Engineers spend more time verifying AI-written code than they would have spent writing it themselves. The agents can only amplify what’s already structured: Well-tested, modular codebases with clear ownership and documentation. Without those foundations, autonomy becomes chaos.
Security and governance, too, demand a shift in mindset. AI-generated code introduces new forms of risk: Unvetted dependencies, subtle license violations and undocumented modules that escape peer review. Mature teams are beginning to integrate agentic activity directly into their CI/CD pipelines, treating agents as autonomous contributors whose work must pass the same static analysis, audit logging and approval gates as any human developer. GitHub’s own documentation highlights this trajectory, positioning Copilot Agents not as replacements for engineers but as orchestrated participants in secure, reviewable workflows. The goal isn’t to let an AI “write everything,” but to ensure that when it acts, it does so inside defined guardrails.
What enterprise decision-makers should focus on now
For technical leaders, the path forward starts with readiness rather than hype. Monoliths with sparse tests rarely yield net gains; agents thrive where tests are authoritative and can drive iterative refinement. This is exactly the loop Anthropic calls out for coding agents. Pilots in tightly scoped domains (test generation, legacy modernization, isolated refactors); treat each deployment as an experiment with explicit metrics (defect escape rate, PR cycle time, change failure rate, security findings burned down). As your usage grows, treat agents as data infrastructure: Every plan, context snapshot, action log and test run is data that composes into a searchable memory of engineering intent, and a durable competitive advantage.
Under the hood, agentic coding is less a tooling problem than a data problem. Every context snapshot, test iteration and code revision becomes a form of structured data that must be stored, indexed and reused. As these agents proliferate, enterprises will find themselves managing an entirely new data layer: One that captures not just what was built, but how it was reasoned about. This shift turns engineering logs into a knowledge graph of intent, decision-making and validation. In time, the organizations that can search and replay this contextual memory will outpace those who still treat code as static text.
The coming year will likely determine whether agentic coding becomes a cornerstone of enterprise development or another inflated promise. The difference will hinge on context engineering: How intelligently teams design the informational substrate their agents rely on. The winners will be those who see autonomy not as magic, but as an extension of disciplined systems design:Clear workflows, measurable feedback, and rigorous governance.
Bottom line
Platforms are converging on orchestration and guardrails, and research keeps improving context control at inference time. The winners over the next 12 to 24 months won’t be the teams with the flashiest model; they’ll be the ones that engineer context as an asset and treat workflow as the product. Do that, and autonomy compounds. Skip it, and the review queue does.
Context + agent = leverage. Skip the first half, and the rest collapses.
Dhyey Mavani is accelerating generative AI at LinkedIn.
Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.
The Allen Institute for AI (Ai2) recently released what it calls its most powerful family of models yet, Olmo 3. But the company kept iterating on the models, expanding its reinforcement learning (RL) runs, to create Olmo 3.1.
The new Olmo 3.1 models focus on efficiency, transparency, and control for enterprises.
Ai2 updated two of the three versions of Olmo 2: Olmo 3.1 Think 32B, the flagship model optimized for advanced research, and Olmo 3.1 Instruct 32B, designed for instruction-following, multi-turn dialogue, and tool use.
Olmo 3 has a third version, Olmo 3-Base for programming, comprehension, and math. It also works well for continue fine-tuning.
Ai2 said that to upgrade Olmo 3 Think 32B to Olmo 3.1, its researchers extended its best RL run with a longer training schedule.
“After the original Olmo 3 launch, we resumed our RL training run for Olmo 3 32B Think, training for an additional 21 days on 224 GPUs with extra epochs over our Dolci-Think-RL dataset,” Ai2 said in a blog post. “This yielded Olmo 3.1 32B Think, which brings substantial gains across math, reasoning, and instruction-following benchmarks: improvements of 5+ points on AIME, 4+ points on ZebraLogic, 4+ points on IFEval, and 20+ points on IFBench, alongside stronger performance on coding and complex multi-step tasks.”
To get to Olmo 3.1 Instruct, Ai2 said its researchers applied the recipe behind the smaller Instruct size, 7B, to the larger model.
Olmo 3.1 Instruct 32B is "optimized for chat, tool use, & multi-turn dialogue—making it a much more performant sibling of Olmo 3 Instruct 7B and ready for real-world applications,” Ai2 said in a post on X.
For now, the new checkpoints are available on the Ai2 Playground or Hugging Face, with API access coming soon.
Better performance on benchmarks
The Olmo 3.1 models performed well on benchmark tests, predictably beating the Olmo 3 models.
Olmo 3.1 Think outperformed Qwen 3 32B models in the AIME 2025 benchmark and performed close to Gemma 27B.
Olmo 3.1 Instruct performed strongly against its open-source peers, even beating models like Gemma 3 on the Math benchmark.
“As for Olmo 3.1 32B Instruct, it’s a larger-scale instruction-tuned model built for chat, tool use, and multi-turn dialogue. Olmo 3.1 32B Instruct is our most capable fully open chat model to date and — in our evaluations — the strongest fully open 32B-scale instruct model,” the company said.
Ai2 also upgraded its RL-Zero 7B models for math and coding. The company said on X that both models benefited from longer and more stable training runs.
Commitment to transparency and open source
Ai2 previously told VentureBeat that it designed the Olmo 3 family of models to offer enterprises and research labs more control and understanding of the data and training that went into the model.
Organizations could add to the model’s data mix and retrain it to also learn from what’s been added.
This has long been a commitment for Ai2, which also offers a tool called OlmoTrace that tracks how LLM outputs match its training data.
“Together, Olmo 3.1 Think 32B and Olmo 3.1 Instruct 32B show that openness and performance can advance together. By extending the same model flow, we continue to improve capabilities while retaining end-to-end transparency over data, code, and training decisions,” Ai2 said.
In a new paper that studies tool-use in large language model (LLM) agents, researchers at Google and UC Santa Barbara have developed a framework that enables agents to make more efficient use of tool and compute budgets. The researchers introduce two new techniques: a simple "Budget Tracker" and a more comprehensive framework called "Budget Aware Test-time Scaling." These techniques make agents explicitly aware of their remaining reasoning and tool-use allowance.
As AI agents rely on tool calls to work in the real world, test-time scaling has become less about smarter models and more about controlling cost and latency.
For enterprise leaders and developers, budget-aware scaling techniques offer a practical path to deploying effective AI agents without facing unpredictable costs or diminishing returns on compute spend.
The challenge of scaling tool use
Traditional test-time scaling focuses on letting models "think" longer. However, for agentic tasks like web browsing, the number of tool calls directly determines the depth and breadth of exploration.
This introduces significant operational overhead for businesses. "Tool calls such as webpage browsing results in more token consumption, increases the context length and introduces additional time latency," Zifeng Wang and Tengxiao Liu, co-authors of the paper, told VentureBeat. "Tool calls themselves introduce additional API costs."
The researchers found that simply granting agents more test-time resources does not guarantee better performance. "In a deep research task, if the agent has no sense of budget, it often goes down blindly," Wang and Liu explained. "It finds one somewhat related lead, then spends 10 or 20 tool calls digging into it, only to realize that the entire path was a dead end."
Optimizing resources with Budget Tracker
To evaluate how they can optimize tool-use budgets, the researchers first tried a lightweight approach called "Budget Tracker." This module acts as a plug-in that provides the agent with a continuous signal of resource availability, enabling budget-aware tool use.
The team hypothesized that "providing explicit budget signals enables the model to internalize resource constraints and adapt its strategy without requiring additional training."
Budget Tracker operates purely at the prompt level, which makes it easy to implement. (The paper provides full details on the prompts used for Budget Tracker, which makes it easy to implement.)
In Google's implementation, the tracker provides a brief policy guideline describing the budget regimes and corresponding recommendations for using tools. At each step of the response process, Budget Tracker makes the agent explicitly aware of its resource consumption and remaining budget, enabling it to condition subsequent reasoning steps on the updated resource state.
To test this, the researchers experimented with two paradigms: sequential scaling, where the model iteratively refines its output, and parallel scaling, where multiple independent runs are conducted and aggregated. They ran experiments on search agents equipped with search and browse tools following a ReAct-style loop. ReAct (Reasoning + Acting) is a popular method where the model alternates between internal thinking and external actions. To trace a true cost-performance scaling trend, they developed a unified cost metric that jointly accounts for the costs of both internal token consumption and external tool interactions.
They tested Budget Tracker on three information-seeking QA datasets requiring external search, including BrowseComp and HLE-Search, using models such as Gemini 2.5 Pro, Gemini 2.5 Flash, and Claude Sonnet 4. The experiments show that this simple plug-in improves performance across various budget constraints.
"Adding Budget Tracker achieves comparable accuracy using 40.4% fewer search calls, 19.9% fewer browse calls, and reducing overall cost … by 31.3%," the authors told VentureBeat. Finally, Budget Tracker continued to scale as the budget increased, whereas plain ReAct plateaued after a certain threshold.
BATS: A comprehensive framework for budget-aware scaling
To further improve tool-use resource optimization, the researchers introduced Budget Aware Test-time Scaling (BATS), a framework designed to maximize agent performance under any given budget. BATS maintains a continuous signal of remaining resources and uses this information to dynamically adapt the agent's behavior as it formulates its response.
BATS uses multiple modules to orchestrate the agent's actions. A planning module adjusts stepwise effort to match the current budget, while a verification module decides whether to "dig deeper" into a promising lead or "pivot" to alternative paths based on resource availability.
Given an information-seeking question and a tool-call budget, BATS begins by using the planning module to formulate a structured action plan and decide which tools to invoke. When tools are invoked, their responses are appended to the reasoning sequence to provide the context with new evidence. When the agent proposes a candidate answer, the verification module verifies it and decides whether to continue the current sequence or initiate a new attempt with the remaining budget.
The iterative process ends when budgeted resources are exhausted, at which point an LLM-as-a-judge selects the best answer across all verified answers. Throughout the execution, the Budget Tracker continuously updates both resource usage and remaining budget at every iteration.
The researchers tested BATS on the BrowseComp, BrowseComp-ZH, and HLE-Search benchmarks against baselines including standard ReAct and various training-based agents. Their experiments show that BATS achieves higher performance while using fewer tool calls and incurring lower overall cost than competing methods. Using Gemini 2.5 Pro as the backbone, BATS achieved 24.6% accuracy on BrowseComp compared to 12.6% for standard ReAct, and 27.0% on HLE-Search compared to 20.5% for ReAct.
BATS not only improves effectiveness under budget constraints but also yields better cost–performance trade-offs. For example, on the BrowseComp dataset, BATS achieved higher accuracy at a cost of approximately 23 cents compared to a parallel scaling baseline that required over 50 cents to achieve a similar result.
According to the authors, this efficiency makes previously expensive workflows viable. "This unlocks a range of long-horizon, data-intensive enterprise applications… such as complex codebase maintenance, due-diligence investigations, competitive landscape research, compliance audits, and multi-step document analysis," they said.
As enterprises look to deploy agents that manage their own resources, the ability to balance accuracy with cost will become a critical design requirement.
"We believe the relationship between reasoning and economics will become inseparable," Wang and Liu said. "In the future, [models] must reason about value."
- A new report says the race to build data centers amid booming AI demand is a serious threat to the world’s water supply.
OpenAI has officially released GPT-5.2, and the reactions from early testers — among whom OpenAI seeded the model several days prior to public release, in some cases weeks ago — paints a two toned picture: it is a monumental leap forward for deep, autonomous reasoning and coding, yet potentially an underwhelming "incremental" update for casual conversationalists.
Following early access periods and today's broader rollout, executives, developers, and analysts have taken to X (formerly Twitter) and company blogs to share their first testing results.
Here is a roundup of the first reactions to OpenAI’s latest flagship model.
"AI as a serious analyst"
The strongest praise for GPT-5.2 centers on its ability to handle "hard problems" that require extended thinking time.
Matt Shumer, CEO of HyperWriteAI, did not mince words in his review, calling GPT-5.2 Pro "the best model in the world."
Shumer highlighted the model's tenacity, noting that "it thinks for **over an hour** on hard problems. And it nails tasks no other model can touch."
This sentiment was echoed by Allie K. Miller, an AI entrepreneur and former AWS executive. Miller described the model as a step toward "AI as a serious analyst" rather than a "friendly companion."
"The thinking and problem-solving feel noticeably stronger," Miller wrote on X. "It gives much deeper explanations than I’m used to seeing. At one point it literally wrote code to improve its own OCR in the middle of a task."
Enterprise gains: Box reports distinct performance jumps
For the enterprise sector, the update appears to be even more significant.
Aaron Levie, CEO of Box, revealed on X that his company has been testing GPT-5.2 in early access. Levie reported that the model performs "7 points better than GPT-5.1" on their expanded reasoning tests, which approximate real-world knowledge work in financial services and life sciences.
"The model performed the majority of the tasks far faster than GPT-5.1 and GPT-5 as well," Levie noted, confirming that Box AI will be rolling out GPT-5.2 integration shortly.
Rutuja Rajwade, a Senior Product Marketing Manager at Box, expanded on this in a company blog post, citing specific latency improvements.
"Complex extraction" tasks dropped from 46 seconds on GPT-5 to just 12 seconds with GPT-5.2.
Rajwade also noted a jump in reasoning capabilities for the Media and Entertainment vertical, rising from 76% accuracy in GPT-5.1 to 81% in the new model.
A "serious leap" for coding and simulation
Developers are finding GPT-5.2 particularly potent for "one-shot" generation of complex code structures.
Pietro Schirano, CEO of magicpathai, shared a video of the model building a full 3D graphics engine in a single file with interactive controls. "It’s a serious leap forward in complex reasoning, math, coding, and simulations," Schirano posted. "The pace of progress is unreal."
Similarly, Ethan Mollick, a professor at the Wharton School of Business at the University of Pennsylvania and longtime LLM and AI power user and writer, demonstrated the model's ability to create a visually complex shader—an infinite neo-gothic city in a stormy ocean—via a single prompt.
The Agentic Era: Long-running autonomy
Perhaps the most functional shift is the model's ability to stay on task for hours without losing the thread.
Dan Shipper, CEO of thoughtful AI testing newsletter Every, reported that the model successfully performed a profit and loss (P&L) analysis that required it to work autonomously for two hours. "It did a P&L analysis where it worked for 2 hours and gave me great results," Shipper wrote.
However, Shipper also noted that for day-to-day tasks, the update feels "mostly incremental."
In an article for Every, Katie Parrott wrote that while GPT-5.2 excels at instruction following, it is "less resourceful" than competitors like Claude Opus 4.5 in certain contexts, such as deducing a user's location from email data.
The downsides: Speed and Rigidity
Despite the reasoning capabilities, the "feel" of the model has drawn critique.
Shumer highlighted a significant "speed penalty" when using the model's Thinking mode. "In my experience the Thinking mode is very slow for most questions," Shumer wrote in his deep-dive review. "I almost never use Instant."
Allie Miller also pointed out issues with the model's default behavior. "The downside is tone and format," she noted. "The default voice felt a bit more rigid, and the length/markdown behavior is extreme: a simple question turned into 58 bullets and numbered points."
The Verdict
The early reaction suggests that GPT-5.2 is a tool optimized for power users, developers, and enterprise agents rather than casual chat. As Shumer summarized in his review: "For deep research, complex reasoning, and tasks that benefit from careful thought, GPT-5.2 Pro is the best option available right now."
However, for users seeking creative writing or quick, fluid answers, models like Claude Opus 4.5 remain strong competitors. "My favorite model remains Claude Opus 4.5," Miller admitted, "but my complex ChatGPT work will get a nice incremental boost."
The rumors were true: OpenAI on Thursday announced the release of its new frontier large language model (LLM) family, GPT-5.2.
It comes at a pivotal moment for the AI pioneer, which has faced intensifying pressure since rival Google’s Gemini 3 LLM seized the top spot on major third-party performance leaderboards and many key benchmarks last month, though OpenAI leaders stressed in a press briefing that the timing of this release had been discussed and worked on well in advance of the release of Gemini 3.
OpenAI describes GPT-5.2 as its "most capable model series yet for professional knowledge work," aiming to reclaim the performance crown with significant gains in reasoning, coding, and agentic workflows.
"It’s our most advanced frontier model and the strongest yet in the market for professional use," Fidji Simo, OpenAI’s CEO of Applications, said during a press briefing today. "We designed 5.2 to unlock even more economic value for people. It's better at creating spreadsheets, building presentations, writing code, perceiving images, understanding long context, using tools, and handling complex, multi-step projects."
GPT-5.2 features a massive 400,000-token context window — allowing it to ingest hundreds of documents or large code repositories at once — and a 128,000 max output token limit, enabling it to generate extensive reports or full applications in a single go.
The model also features a knowledge cutoff of August 31, 2025, ensuring it is up-to-date with relatively recent world events and technical documentation. It explicitly includes "Reasoning token support," confirming the underlying architecture uses the chain-of-thought processing popularized by the "o1" series.
The 'Code Red' Reality Check
The release arrives following The Information's report of an emergency "Code Red" directive to OpenAI staff from CEO Sam Altman to improve ChaTGPT — a move reportedly designed to mobilize resources following the "quality gap" exposed by Gemini 3. The Verge similarly reported on the timing of GPT-5.2's release ahead of the official announcement.
During the briefing, OpenAI executives acknowledged the directive but pushed back on the narrative that the model was rushed solely to answer Google.
"It is important to note this has been in the works for many, many months," Simo told reporters. She clarified that while the "Code Red" helped focus the company, it wasn't the sole driver of the timeline.
"We announced this Code Red to really signal to the company that we want to marshal resources in one particular area... but that's not the reason it's coming out this week in particular."
Max Schwarzer, lead of OpenAI's post-training team, echoed this sentiment to dispel the idea of a panic launch. "We've been planning for this release since a very long time ago... this specific week we talked about many months ago."
A spokesperson from OpenAI further clarified that the "Code Red" call applied to ChatGPT as a product, not solely underlying model development or the release of new models.
Under the Hood: Instant, Thinking, and Pro
OpenAI is segmenting the GPT-5.2 release into three distinct tiers within ChatGPT, a strategy likely designed to balance the massive compute costs of "reasoning" models with user demand for speed:
-
GPT-5.2 Instant: Optimized for speed and daily tasks like writing, translation, and information seeking.
-
GPT-5.2 Thinking: Designed for "complex, structured work" and long-running agents, this model leverages deeper reasoning chains to handle coding, math, and multi-step projects.
-
GPT-5.2 Pro: The new heavyweight champion. OpenAI describes this as its "smartest and most trustworthy option," delivering the highest accuracy for difficult questions where quality outweighs latency.
For developers, the models are available immediately in the application programming interface (API) as
gpt-5.2,gpt-5.2-chat-latest(Instant), andgpt-5.2-pro.The Numbers: Beating the Benchmarks
The GPT-5.2 release includes leading metrics across most domains — specifically those that target the "professional knowledge work" gap where competitors have recently gained ground.
OpenAI highlighted a new benchmark called GDPval, which measures performance on "well-specified knowledge work tasks" across 44 occupations.
"GPT-5.2 Thinking is now state-of-the-art on that benchmark... and beats or ties top industry professionals on 70.9% of well-specified professional tasks like spreadsheets, presentations, and document creation, according to expert human judges," Simo said.
In the critical arena of coding, OpenAI is claiming a decisive lead. Schwarzer noted that on SWE-bench Pro, a rigorous evaluation of real-world software engineering, GPT-5.2 Thinking sets a new state-of-the-art score of 55.6%.
He emphasized that this benchmark is "more contamination resistant, challenging, diverse, and industrially relevant than previous benchmarks like SWE-bench Verified."Other key benchmark results include:
-
GPQA Diamond (Science): GPT-5.2 Pro scored 93.2%, edging out GPT-5.2 Thinking (92.4%) and surpassing GPT-5.1 Thinking (88.1%).
-
FrontierMath: On Tier 1-3 problems, GPT-5.2 Thinking solved 40.3%, a significant jump from the 31.0% achieved by its predecessor.
-
ARC-AGI-1: GPT-5.2 Pro is reportedly the first model to cross the 90% threshold on this general reasoning benchmark, scoring 90.5%
The Price of Intelligence
Performance comes at a premium. While ChatGPT subscription pricing remains unchanged for now, the API costs for the new flagship models are steep compared to previous generations, reflecting the high compute demands of "thinking" mode. They're also on the upper-end of API costs for the industry.
-
GPT-5.2 Thinking: Priced at $1.75 per 1 million input tokens and $14 per 1 million output tokens.
-
GPT-5.2 Pro: The costs jump significantly to $21 per 1 million input tokens and $168 per 1 million output tokens.
GPT-5.2 Thinking is priced 40% higher in the API than the standard GPT-5.1 ($1.25/$10), signaling that OpenAI views the new reasoning capabilities as a tangible value-add rather than a mere efficiency update.
The high-end GPT-5.2 Pro follows the same pattern, costing 40% more than the previous GPT-5 Pro ($15/$120). While expensive, it still undercuts OpenAI’s most specialized reasoning model, o1-pro, which remains the most costly offering on the menu at a staggering $150 per million input tokens and $600 per million output tokens.
OpenAI argues that despite the higher per-token cost, the model’s "greater token efficiency" and ability to solve tasks in fewer turns make it economically viable for high-value enterprise workflows.
Here's how it compares to the current API costs for other competing models across the LLM field:
Model
Input (/1M)
Output (/1M)
Total Cost
Source
Qwen 3 Turbo
$0.05
$0.20
$0.25
Grok 4.1 Fast (reasoning)
$0.20
$0.50
$0.70
Grok 4.1 Fast (non-reasoning)
$0.20
$0.50
$0.70
deepseek-chat (V3.2-Exp)
$0.28
$0.42
$0.70
deepseek-reasoner (V3.2-Exp)
$0.28
$0.42
$0.70
Qwen 3 Plus
$0.40
$1.20
$1.60
ERNIE 5.0
$0.85
$3.40
$4.25
Claude Haiku 4.5
$1.00
$5.00
$6.00
Qwen-Max
$1.60
$6.40
$8.00
Gemini 3 Pro (≤200K)
$2.00
$12.00
$14.00
GPT-5.2
$1.75
$14.00
$15.75
Gemini 3 Pro (>200K)
$4.00
$18.00
$22.00
Claude Sonnet 4.5
$3.00
$15.00
$18.00
Claude Opus 4.5
$5.00
$25.00
$30.00
GPT-5.2 Pro
$21.00
$168.00
$189.00
Image Generation: Nothing New Yet...But 'More to Come'
During the briefing, VentureBeat asked the OpenAI participants if the new release included any boost to image generation capabilities, noting the excitement around similar features in recent competitor launches like Google's Gemini 3 Image aka Nano Banana Pro.
Unfortunately for those seeking to recreate the kind of text-and-information heavy graphics and image editing capabilities, OpenAI executives clarified that GPT-5.2 comes with no current image improvements over the prior GPT-5.1 and OpenAI's integrated DALL-E 3 and gpt-4o native image generation models.
"On image Gen, nothing to announce today, but more to come," Simo said. She acknowledged the popularity of the feature, adding, "We know this is a very important use case that people love, that we introduced [to] the market, and so definitely more to come there."
Aidan Clark, OpenAI's lead of training, also declined to comment on visual generation specifics, stating simply, "I can't really speak to image Gen myself."
The 'Mega-Agent' Era
Beyond raw scores, OpenAI is positioning GPT-5.2 as the engine for a new generation of "long-running agents" capable of executing multi-step workflows without human hand-holding."
Box found that 5.2 can extract information from long, complex documents about 40% faster, and also saw a 40% boost in reasoning accuracy for Life Sciences and healthcare," Simo said.
She also noted that Notion reported the model "outperforms 5.1 across every dimension... and it excels at the kind of really ambiguous, longer rising tasks that define real knowledge work."Schwarzer added that coding startups like Augment Code found the model "delivered substantially stronger deep code capabilities than any prior model," which is why it was selected to power their new code review agent.Visual capabilities have also seen an upgrade.
OpenAI's release blog post shows an example where "a traveler reports a delayed flight, a missed connection, an overnight stay in New York, and a medical seating requirement."
The outcome? "GPT‑5.2 manages the entire chain of tasks—rebooking, special-assistance seating, and compensation—delivering a more complete outcome than GPT‑5.1."
A new evaluation called ScreenSpot-Pro, which tests a model's ability to understand GUI screenshots, shows GPT-5.2 Thinking achieving 86.3% accuracy, compared to just 64.2% for GPT-5.1.
Science and Reliability
OpenAI leaders also stressed the model's utility for scientific research, attempting to move the conversation beyond simple chatbots to research assistants.
Aidan Clark, lead of the training team, shared an example of a senior immunology researcher testing the model.
"They tested it by asking it to generate the most important unanswered questions about the immune system," Clark said. "That immunology researcher reported that GPT-5.2 produced sharper questions and stronger explanations for why those questions... matter compared to any previous pro model.
"Reliability was another key focus. Schwarzer claimed the new model "hallucinates substantially less than GPT-5.1," noting that on a set of de-identified queries, "responses contained errors 38% less often."
The 'Vibe' Shift
Interestingly, OpenAI acknowledged that not every user might immediately prefer the new models.
When asked why legacy models like GPT-5.1 would remain available, Schwarzer admitted that "models change a little bit every time.
"Some users may find that they prefer the vibes of the previous model, even though we think the latest one is across the board generally much better," Schwarzer said. He also noted that for some enterprise customers who have "really fine-tuned a prompt for a specific model," there might be "small regressions," necessitating access to the older versions.
Safety, 'Adult Mode,' and Future Roadmap
Addressing safety concerns, Simo confirmed that the company is preparing to roll out an "Adult Mode" in the first quarter of next year, following the implementation of a new age prediction system.
"We're in the process of improving that," Simo said regarding the age prediction technology.
"We want to do that ahead of launching adult mode."Looking further ahead, industry reports suggest OpenAI is working on a more fundamental architectural shift under the codename "Project Garlic," targeting a flagship release in early 2026.
While executives did not comment on specific future roadmaps during the briefing, Simo remained optimistic about the economics of their current trajectory.
"If you look at historical trends, compute has increased about 3x every year for the last three years," she explained. "Revenue has also increased at the same pace... creating this virtuous cycle."
Clark added that efficiency is improving rapidly: "The model we're releasing today achieves an even better score [on ARC-AGI] with almost 400 times less cost and less compute associated with it" compared to models from a year ago.
GPT-5.2 Instant, Thinking, and Pro begin rolling out in ChatGPT today to paid users (Plus, Pro, Team, and Enterprise). The company notes the rollout will be gradual to maintain stability.
-


