When I first wrote “Vector databases: Shiny object syndrome and the case of a missing unicorn” in March 2024, the industry was awash in hype. Vector databases were positioned as the next big thing — a must-have infrastructure layer for the gen AI era. Billions of venture dollars flowed, developers rushed to integrate embeddings into their pipelines and analysts breathlessly tracked funding rounds for Pinecone, Weaviate, Chroma, Milvus and a dozen others.
The promise was intoxicating: Finally, a way to search by meaning rather than by brittle keywords. Just dump your enterprise knowledge into a vector store, connect an LLM and watch magic happen.
Except the magic never fully materialized.
Two years on, the reality check has arrived: 95% of organizations invested in gen AI initiatives are seeing zero measurable returns. And, many of the warnings I raised back then — about the limits of vectors, the crowded vendor landscape and the risks of treating vector databases as silver bullets — have played out almost exactly as predicted.
Prediction 1: The missing unicorn
Back then, I questioned whether Pinecone — the poster child of the category — would achieve unicorn status or whether it would become the “missing unicorn” of the database world. Today, that question has been answered in the most telling way possible: Pinecone is reportedly exploring a sale, struggling to break out amid fierce competition and customer churn.
Yes, Pinecone raised big rounds and signed marquee logos. But in practice, differentiation was thin. Open-source players like Milvus, Qdrant and Chroma undercut them on cost. Incumbents like Postgres (with pgVector) and Elasticsearch simply added vector support as a feature. And customers increasingly asked: “Why introduce a whole new database when my existing stack already does vectors well enough?”
The result: Pinecone, once valued near a billion dollars, is now looking for a home. The missing unicorn indeed. In September 2025, Pinecone appointed Ash Ashutosh as CEO, with founder Edo Liberty moving to a chief scientist role. The timing is telling: The leadership change comes amid increasing pressure and questions over its long-term independence.
Prediction 2: Vectors alone won’t cut it
I also argued that vector databases by themselves were not an end solution. If your use case required exactness — l ike searching for “Error 221” in a manual—a pure vector search would gleefully serve up “Error 222” as “close enough.” Cute in a demo, catastrophic in production.
That tension between similarity and relevance has proven fatal to the myth of vector databases as all-purpose engines.
“Enterprises discovered the hard way that semantic ≠ correct.”
Developers who gleefully swapped out lexical search for vectors quickly reintroduced… lexical search in conjunction with vectors. Teams that expected vectors to “just work” ended up bolting on metadata filtering, rerankers and hand-tuned rules. By 2025, the consensus is clear: Vectors are powerful, but only as part of a hybrid stack.
Prediction 3: A crowded field becomes commoditized
The explosion of vector database startups was never sustainable. Weaviate, Milvus (via Zilliz), Chroma, Vespa, Qdrant — each claimed subtle differentiators, but to most buyers they all did the same thing: store vectors and retrieve nearest neighbors.
Today, very few of these players are breaking out. The market has fragmented, commoditized and in many ways been swallowed by incumbents. Vector search is now a checkbox feature in cloud data platforms, not a standalone moat.
Just as I wrote then: Distinguishing one vector DB from another will pose an increasing challenge. That challenge has only grown harder. Vald, Marqo, LanceDB, PostgresSQL, MySQL HeatWave, Oracle 23c, Azure SQL, Cassandra, Redis, Neo4j, SingleStore, ElasticSearch, OpenSearch, Apahce Solr… the list goes on.
The new reality: Hybrid and GraphRAG
But this isn’t just a story of decline — it’s a story of evolution. Out of the ashes of vector hype, new paradigms are emerging that combine the best of multiple approaches.
Hybrid Search: Keyword + vector is now the default for serious applications. Companies learned that you need both precision and fuzziness, exactness and semantics. Tools like Apache Solr, Elasticsearch, pgVector and Pinecone’s own “cascading retrieval” embrace this.
GraphRAG: The hottest buzzword of late 2024/2025 is GraphRAG — graph-enhanced retrieval augmented generation. By marrying vectors with knowledge graphs, GraphRAG encodes the relationships between entities that embeddings alone flatten away. The payoff is dramatic.
Benchmarks and evidence
-
Amazon’s AI blog cites benchmarks from Lettria, where hybrid GraphRAG boosted answer correctness from ~50% to 80%-plus in test datasets across finance, healthcare, industry, and law.
-
The GraphRAG-Bench benchmark (released May 2025) provides a rigorous evaluation of GraphRAG vs. vanilla RAG across reasoning tasks, multi-hop queries and domain challenges.
-
An OpenReview evaluation of RAG vs GraphRAG found that each approach has strengths depending on task — but hybrid combinations often perform best.
-
FalkorDB’s blog reports that when schema precision matters (structured domains), GraphRAG can outperform vector retrieval by a factor of ~3.4x on certain benchmarks.
The rise of GraphRAG underscores the larger point: Retrieval is not about any single shiny object. It’s about building retrieval systems — layered, hybrid, context-aware pipelines that give LLMs the right information, with the right precision, at the right time.
What this means going forward
The verdict is in: Vector databases were never the miracle. They were a step — an important one — in the evolution of search and retrieval. But they are not, and never were, the endgame.
The winners in this space won’t be those who sell vectors as a standalone database. They will be the ones who embed vector search into broader ecosystems — integrating graphs, metadata, rules and context engineering into cohesive platforms.
In other words: The unicorn isn’t the vector database. The unicorn is the retrieval stack.
Looking ahead: What’s next
-
Unified data platforms will subsume vector + graph: Expect major DB and cloud vendors to offer integrated retrieval stacks (vector + graph + full-text) as built-in capabilities.
-
“Retrieval engineering” will emerge as a distinct discipline: Just as MLOps matured, so too will practices around embedding tuning, hybrid ranking and graph construction.
-
Meta-models learning to query better: Future LLMs may learn to orchestrate which retrieval method to use per query, dynamically adjusting weighting.
-
Temporal and multimodal GraphRAG: Already, researchers are extending GraphRAG to be time-aware (T-GRAG) and multimodally unified (e.g. connecting images, text, video).
-
Open benchmarks and abstraction layers: Tools like BenchmarkQED (for RAG benchmarking) and GraphRAG-Bench will push the community toward fairer, comparably measured systems.
From shiny objects to essential infrastructure
The arc of the vector database story has followed a classic path: A pervasive hype cycle, followed by introspection, correction and maturation. In 2025, vector search is no longer the shiny object everyone pursues blindly — it’s now a critical building block within a more sophisticated, multi-pronged retrieval architecture.
The original warnings were right. Pure vector-based hopes often crash on the shoals of precision, relational complexity and enterprise constraints. Yet the technology was never wasted: It forced the industry to rethink retrieval, blending semantic, lexical and relational strategies.
If I were to write a sequel in 2027, I suspect it would frame vector databases not as unicorns, but as legacy infrastructure — foundational, but eclipsed by smarter orchestration layers, adaptive retrieval controllers and AI systems that dynamically choose which retrieval tool fits the query.
As of now, the real battle is not vector vs keyword — it’s the indirection, blending and discipline in building retrieval pipelines that reliably ground gen AI in facts and domain knowledge. That’s the unicorn we should be chasing now.
Amit Verma is head of engineering and AI Labs at Neuron7.
Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.
-
The race to deploy agentic AI is on. Across the enterprise, systems that can plan, take actions and collaborate across business applications promise unprecedented efficiency. But in the rush to automate, a critical component is being overlooked: Scalable security. We are building a workforce of digital employees without giving them a secure way to log in, access data and do their jobs without creating catastrophic risk.
The fundamental problem is that traditional identity and access management (IAM) designed for humans breaks at agentic scale. Controls like static roles, long-lived passwords and one-time approvals are useless when non-human identities can outnumber human ones by 10 to one. To harness the power of agentic AI, identity must evolve from a simple login gatekeeper into the dynamic control plane for your entire AI operation.
“The fastest path to responsible AI is to avoid real data. Use synthetic data to prove value, then earn the right to touch the real thing.” — Shawn Kanungo, keynote speaker and innovation strategist; bestselling author of The Bold Ones
Why your human-centric IAM is a sitting duck
Agentic AI does not just use software; it behaves like a user. It authenticates to systems, assumes roles and calls APIs. If you treat these agents as mere features of an application, you invite invisible privilege creep and untraceable actions. A single over-permissioned agent can exfiltrate data or trigger erroneous business processes at machine speed, with no one the wiser until it is too late.
The static nature of legacy IAM is the core vulnerability. You cannot pre-define a fixed role for an agent whose tasks and required data access might change daily. The only way to keep access decisions accurate is to move policy enforcement from a one-time grant to a continuous, runtime evaluation.
Prove value before production data
Kanungo’s guidance offers a practical on-ramp. Start with synthetic or masked datasets to validate agent workflows, scopes and guardrails. Once your policies, logs and break-glass paths hold up in this sandbox, you can graduate agents to real data with confidence and clear audit evidence.
Building an identity-centric operating model for AI
Securing this new workforce requires a shift in mindset. Each AI agent must be treated as a first-class citizen within your identity ecosystem.
First, every agent needs a unique, verifiable identity. This is not just a technical ID; it must be linked to a human owner, a specific business use case and a software bill of materials (SBOM). The era of shared service accounts is over; they are the equivalent of giving a master key to a faceless crowd.
Second, replace set-and-forget roles with session-based, risk-aware permissions. Access should be granted just in time, scoped to the immediate task and the minimum necessary dataset, then automatically revoked when the job is complete. Think of it as giving an agent a key to a single room for one meeting, not the master key to the entire building.
Three pillars of a scalable agent security architecture
Context-aware authorization at the core. Authorization can no longer be a simple yes or no at the door. It must be a continuous conversation. Systems should evaluate context in real time. Is the agent’s digital posture attested? Is it requesting data typical for its purpose? Is this access occurring during a normal operational window? This dynamic evaluation enables both security and speed.
Purpose-bound data access at the edge. The final line of defense is the data layer itself. By embedding policy enforcement directly into the data query engine, you can enforce row-level and column-level security based on the agent’s declared purpose. A customer service agent should be automatically blocked from running a query that appears designed for financial analysis. Purpose binding ensures data is used as intended, not merely accessed by an authorized identity.
Tamper-evident evidence by default. In a world of autonomous actions, auditability is non-negotiable. Every access decision, data query and API call should be immutably logged, capturing the who, what, where and why. Link logs so they are tamper evident and replayable for auditors or incident responders, providing a clear narrative of every agent’s activities.
A practical roadmap to get started
Begin with an identity inventory. Catalog all non-human identities and service accounts. You will likely find sharing and over-provisioning. Begin issuing unique identities for each agent workload.
Pilot a just-in-time access platform. Implement a tool that grants short-lived, scoped credentials for a specific project. This proves the concept and shows the operational benefits.
Mandate short-lived credentials. Issue tokens that expire in minutes, not months. Seek out and remove static API keys and secrets from code and configuration.
Stand up a synthetic data sandbox. Validate agent workflows, scopes, prompts and policies on synthetic or masked data first. Promote to real data only after controls, logs and egress policies pass.
Conduct an agent incident tabletop drill. Practice responses to a leaked credential, a prompt injection or a tool escalation. Prove you can revoke access, rotate credentials and isolate an agent in minutes.
The bottom line
You cannot manage an agentic, AI-driven future with human-era identity tools. The organizations that will win recognize identity as the central nervous system for AI operations. Make identity the control plane, move authorization to runtime, bind data access to purpose and prove value on synthetic data before touching the real thing. Do that, and you can scale to a million agents without scaling your breach risk.
Michelle Buckner is a former NASA Information System Security Officer (ISSO).
Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.
Researchers at Google Cloud and UCLA have proposed a new reinforcement learning framework that significantly improves the ability of language models to learn very challenging multi-step reasoning tasks. Supervised Reinforcement Learning (SRL) reformulates problem-solving as a sequence of logical “actions,” providing rich learning signals during the training process.
This approach enables smaller models to learn complex problems that were previously out of reach for other common training techniques. Experiments show that SRL not only excels on math reasoning benchmarks but also generalizes effectively to agentic software engineering tasks.
SRL is a versatile training framework that can elevate smaller and less expensive models to higher reasoning abilities.
The limits of current LLM reasoning training
Recent advances in training large language models (LLMs) for reasoning have largely been driven by reinforcement learning with verifiable rewards (RLVR), a method where a model is rewarded based on the correctness of its final answer. By repeatedly trying to solve problems and getting feedback on the final outcome, the model gradually learns effective problem-solving strategies.
However, the success of this outcome-based approach depends on the model's ability to discover a correct solution within a limited number of attempts, or "rollouts." Since each rollout is computationally expensive, models can't try indefinitely. This method hits a wall when problems are so difficult that the model rarely, if ever, finds the right answer within its budget.
This creates a critical learning bottleneck. In many multi-step reasoning problems, a model might correctly solve several steps but get derailed by a single mistake, leading to an incorrect answer. With RLVR, this entire effort receives a negative reward, and the model learns nothing from its partially correct work. It’s an all-or-nothing approach that fails to provide granular feedback and provides sparse rewards.
An alternative method is supervised fine-tuning (SFT), where the model learns from examples containing the full reasoning process laid out by experts. While SFT can instill reasoning abilities, it often leads to overfitting (the model simply learns to imitate the trajectories in the training data instead of learning to generalize to problems beyond the examples it has seen). This issue is made worse by the fact that high-quality, human-created training data is both scarce and expensive to produce.
As the paper notes, these limitations leave "a critical gap for training small open-source models to effectively learn difficult problems."
How supervised reinforcement learning works
SRL introduces a framework that reformulates problem-solving as a "sequential decision-making process," striking a balance between pure outcome-based RL and pure imitation learning. Instead of optimizing only for the final answer or forcing the model to imitate an expert's entire thought process, SRL teaches the model to reproduce a sequence of key actions that form the backbone of expert reasoning. This allows the model to learn to take actions similar to an expert while developing its own internal reasoning style.
In the SRL framework, expert demonstrations are broken down into a series of intermediate, concrete actions, each representing a meaningful step. For a math problem, an action might be an algebraic manipulation. For a software engineering agent, it could be a command executed in a code repository. To generate training data, SRL uses a powerful teacher model to create solution trajectories, which are then used to train a smaller model.
According to I-Hung Hsu, a research scientist at Google and co-author of the paper, this middle-ground approach is key to its effectiveness in real-world scenarios. "SRL sits in the middle: It captures the structured flexibility of real-world problem solving, where there are multiple valid strategies but also clear notions of what ‘good reasoning’ looks like at each step," Hsu told VentureBeat. "This makes SRL suitable for domains like data science automation or probably supply chain optimization — tasks that reward sound intermediate reasoning rather than mere final answers."
During training, the model first generates an "inner monologue" (its internal reasoning process, enclosed in <think> tags) before committing to an action. At each step, SRL provides a reward based on the similarity between the model's predicted action and the expert's action. This step-wise reward system provides dense, fine-grained feedback, allowing the model to learn and improve even if its overall solution isn't perfect. This solves the sparse reward problem RLVR faces.
SRL in action
The researchers' experiments show that SRL significantly outperforms strong baselines in both challenging mathematical reasoning and agentic software engineering benchmarks. They also observed that SRL encourages more flexible and sophisticated reasoning patterns in models, such as interleaved planning and self-verification, which improve solution quality without just making the outputs longer.
For enterprise leaders, performance gains are only valuable if they don't come with runaway costs. Hsu clarifies that SRL-trained models are more efficient in their reasoning. "The gains come from better reasoning quality and structure, not from verbosity," he said. "In terms of efficiency, SRL-trained models are roughly on par with the base model in token usage... while SRL isn’t designed to reduce inference cost, it achieves stronger reasoning performance without increasing it."
For the math tests, the team fine-tuned Qwen2.5-7B-Instruct on a dataset of 1,000 difficult math questions. They compared its performance against models trained with SFT and RLVR (using the GRPO algorithm common in models like DeepSeek-R1) on four competition-level math benchmarks. The SRL-trained model achieved a substantial 3.0% average performance boost over other methods.
The team extended SRL to agentic software engineering, a domain critical for enterprise automation. They trained a coding-specialized model, Qwen2.5-Coder-7B-Instruct, on 5,000 expert trajectories of agents interacting with a coding environment. The SRL-trained model was benchmarked against the original base model and SWE-Gym-7B, a strong baseline fine-tuned with SFT. SRL achieved a 14.8% task resolve rate, representing a 74% relative improvement over the SFT-based model. This shows SRL's ability to train more competent AI agents for complex, real-world programming tasks.
A new standard for high-stakes AI?
The paper's strongest results came from combining methods: First, using SRL to teach foundational reasoning, then using RLVR to refine that skill. In their experiments, when the researchers used SRL as a pre-training and applied RLVR in post-training, they observed a 3.7% average increase, demonstrating a powerful curriculum learning strategy.
This raises the question of whether this could become a new blueprint for building specialized AI.
"We view SRL as a strong foundation," Hsu said. "In a sense, SRL provides a curriculum — teaching models to think and act step by step — before we refine those behaviors with outcome-based reinforcement learning. This SRL-first approach not only stabilizes the later RL stage but also makes reasoning more interpretable and generalizable, which is critical for high-stakes applications."
Looking ahead, Hsu acknowledges that scaling this pipeline still faces challenges, particularly the high cost and complexity of end-to-end RLVR for agentic tasks. However, he is optimistic about the path forward. "While high-quality expert trajectories remain important," he concluded, "we think the next big leap will come from automating their generation and filtering — leveraging strong teacher models or even self-improving student models to bootstrap new data."
It was originally found in leaked code and publicized by AI influencers on X, but OpenAI has made it official: ChatGPT now offers Group Chats, allowing multiple users to join the same, single ChatGPT conversation and send messages to each other and the underlying large language model (LLM), online and via its mobile apps.
Imagine adding ChatGPT as another member of your existing group chats, allowing you to text it as you would one of your friends or family members and have them respond as well, and you'll have an idea of the intriguing power and potential of this feature.
However, the feature is only available as a limited pilot for now to ChatGPT users in Japan, New Zealand, South Korea, and Taiwan (all tiers, including free usage).
“Group chats are just the beginning of ChatGPT becoming a shared space to collaborate and interact with others,” OpenAI wrote in its announcement.
This development builds on internal experimentation at OpenAI, where technical staffer Keyan Zhang said in a post on X that OpenAI's team initially considered multiplayer ChatGPT to be “a wild, out-of-distribution idea.”
According to Zhang, the model’s performance in those early tests demonstrated far more potential than existing interfaces typically allow.
The move follows OpenAI investor yet competitor Microsoft's update of its Copilot AI assistant to allow group chats last month, as well as Anthropic's introduction of shareable context and chat histories from its Claude AI models through its Projects feature introduced summer 2024, though this is not a simultaneous, realtime group chat in the same way.
Collaborative functionality integrated into ChatGPT
Group chats function as shared conversational spaces where users can plan events, brainstorm ideas, or collaborate on projects with the added support of ChatGPT.
These conversations are distinct from individual chats and are excluded from ChatGPT’s memory system—meaning no data from these group threads is used to train or personalize future interactions.
Users can initiate a group chat by selecting the people icon in a new or existing conversation. Adding others creates a copy of the original thread, preserving the source dialogue. Participants can join via a shareable link and are prompted to create a profile with a name, username, and photo. The feature supports 1 to 20 participants per group.
Each group chat is listed in a new section of the ChatGPT interface, and users can manage settings like naming the group, adding or removing participants, or muting notifications.
Powered by GPT-5.1 with expanded tools
The new group chat feature runs on GPT-5.1 Auto, a backend setting that chooses the optimal model based on the user’s subscription tier and the prompt.
Functionality such as search, image generation, file upload, and dictation is available inside group conversations.
Importantly, the system applies rate limits only when ChatGPT is producing responses. Direct messages between human users in the group do not count toward any plan’s message cap.
OpenAI has added new social features to ChatGPT in support of this group dynamic. The model can react with emojis, interpret conversational context to decide when to respond, and personalize generated content using members’ profile photos—such as inserting user likenesses into images when asked.
Privacy by default, controls for younger users
OpenAI emphasized that privacy and user control are integral to group chat design. The feature operates independently of the user’s personalized ChatGPT memory, and no new memories are created from these interactions.
Participation requires an invitation link, and members are always able to see who is in a chat or leave at any time.
Users under the age of 18 are automatically shielded from sensitive content in group chats. Parents or guardians can disable group chat access altogether via built-in parental controls.
Group creators retain special permissions, including immunity from being removed by others. All other participants can be added or removed by group members.
A testbed for shared AI experiences
OpenAI frames group chats as an early step toward richer, multi-user applications of AI, hinting at broader ambitions for ChatGPT as a shared workspace. The company expects to expand access over time and refine the feature based on how early users engage with it.
Keyan Zhang’s post suggests that the underlying model capabilities are far ahead of the interfaces users currently interact with. This pilot, in OpenAI’s view, offers a new “container” where more of the model’s latent capacity can be surfaced.
“Our models have a lot more room to shine than today’s experiences show, and the current containers only use a fraction of their capabilities,” Zhang said.
With this initial pilot focused on a limited set of markets, OpenAI is likely monitoring both usage patterns and cultural fit as it plans for broader deployment. For now, the group chat experiment offers a new way for users to interact with ChatGPT—and with each other—in real time, using a conversational interface that blends productivity and personalization.
Developer access: Still unclear
OpenAI has not provided any indication that Group Chats will be accessible via the API or SDK. The current rollout is framed strictly within the ChatGPT product environment, with no mention of tool calls, developer hooks, or integration support for programmatic use. This absence of signaling leaves it unclear whether the company views group interaction as a future developer primitive or as a contained UX feature for end users only.
For enterprise teams exploring how to replicate multi-user collaboration with generative models, any current implementation would require custom orchestration—such as managing multi-party context and prompts across separate API calls, and handling session state and response merging externally. Until OpenAI provides formal support, Group Chats remain a closed interface feature rather than a developer-accessible capability.
Here is a standalone concluding subsection tailored for the article, focusing on what the ChatGPT Group Chat rollout means for enterprise decision makers in both pilot regions and globally:
Implications for enterprise AI and data leaders
For enterprise teams already leveraging AI platforms—or preparing to—OpenAI’s group chat feature introduces a new layer of multi-user collaboration that could shift how generative models are deployed across workflows. While the pilot is limited to users in Japan, New Zealand, South Korea, and Taiwan, its design and roadmap offer key signals for AI engineers, orchestration specialists, and data leads globally.
AI engineers managing large language model (LLM) deployments can now begin to conceptualize real-time, multi-user interfaces not just as support tools, but as collaborative environments for research, content generation, and ideation. This adds another front in model tuning: not just how models respond to individuals, but how they behave in live group settings with context shifts and varied user intentions.
For AI orchestration leads, the ability to integrate ChatGPT into collaborative flows without exposing private memory or requiring custom builds may reduce friction in piloting generative AI in cross-functional teams. These group sessions could serve as lightweight alternatives to internal tools for brainstorming, prototyping, or knowledge sharing—useful for teams constrained by infrastructure, budget, or time.
Enterprise data managers may also find use cases in structured group chat sessions for data annotation, taxonomy validation, or internal training support. The system’s lack of memory persistence adds a level of data isolation that aligns with standard security and compliance practices—though global rollout will be key to validating regional data handling standards.
As group chat capabilities evolve, decision makers should monitor how shared usage patterns might inform future model behaviors, auditing needs, and governance structures. In the long term, features like these will influence not just how organizations interact with generative AI, but how they design team-level interfaces around it.
OpenAI experiment finds that sparse models could give AI builders the tools to debug neural networks
OpenAI researchers are experimenting with a new approach to designing neural networks, with the aim of making AI models easier to understand, debug, and govern. Sparse models can provide enterprises with a better understanding of how these models make decisions.
Understanding how models choose to respond, a big selling point of reasoning models for enterprises, can provide a level of trust for organizations when they turn to AI models for insights.
The method called for OpenAI scientists and researchers to look at and evaluate models not by analyzing post-training performance, but by adding interpretability or understanding through sparse circuits.
OpenAI notes that much of the opacity of AI models stems from how most models are designed, so to gain a better understanding of model behavior, they must create workarounds.
“Neural networks power today’s most capable AI systems, but they remain difficult to understand,” OpenAI wrote in a blog post. “We don’t write these models with explicit step-by-step instructions. Instead, they learn by adjusting billions of internal connections or weights until they master a task. We design the rules of training, but not the specific behaviors that emerge, and the result is a dense web of connections that no human can easily decipher.”
To enhance the interpretability of the mix, OpenAI examined an architecture that trains untangled neural networks, making them simpler to understand. The team trained language models with a similar architecture to existing models, such as GPT-2, using the same training schema.
The result: improved interpretability.
The path toward interpretability
Understanding how models work, giving us insight into how they're making their determinations, is important because these have a real-world impact, OpenAI says.
The company defines interpretability as “methods that help us understand why a model produced a given output.” There are several ways to achieve interpretability: chain-of-thought interpretability, which reasoning models often leverage, and mechanistic interpretability, which involves reverse-engineering a model’s mathematical structure.
OpenAI focused on improving mechanistic interpretability, which it said “has so far been less immediately useful, but in principle, could offer a more complete explanation of the model’s behavior.”
“By seeking to explain model behavior at the most granular level, mechanistic interpretability can make fewer assumptions and give us more confidence. But the path from low-level details to explanations of complex behaviors is much longer and more difficult,” according to OpenAI.
Better interpretability allows for better oversight and gives early warning signs if the model’s behavior no longer aligns with policy.
OpenAI noted that improving mechanistic interpretability “is a very ambitious bet,” but research on sparse networks has improved this.
How to untangle a model
To untangle the mess of connections a model makes, OpenAI first cut most of these connections. Since transformer models like GPT-2 have thousands of connections, the team had to “zero out” these circuits. Each will only talk to a select number, so the connections become more orderly.
Next, the team ran “circuit tracing” on tasks to create groupings of interpretable circuits. The last task involved pruning the model “to obtain the smallest circuit which achieves a target loss on the target distribution,” according to OpenAI. It targeted a loss of 0.15 to isolate the exact nodes and weights responsible for behaviors.
“We show that pruning our weight-sparse models yields roughly 16-fold smaller circuits on our tasks than pruning dense models of comparable pretraining loss. We are also able to construct arbitrarily accurate circuits at the cost of more edges. This shows that circuits for simple behaviors are substantially more disentangled and localizable in weight-sparse models than dense models,” the report said.
Small models become easier to train
Although OpenAI managed to create sparse models that are easier to understand, these remain significantly smaller than most foundation models used by enterprises. Enterprises increasingly use small models, but frontier models, such as its flagship GPT-5.1, will still benefit from improved interpretability down the line.
Other model developers also aim to understand how their AI models think. Anthropic, which has been researching interpretability for some time, recently revealed that it had “hacked” Claude’s brain — and Claude noticed. Meta also is working to find out how reasoning models make their decisions.
As more enterprises turn to AI models to help make consequential decisions for their business, and eventually customers, research into understanding how models think would give the clarity many organizations need to trust models more.
- Spanish engineering firm ACS pivots to lucrative data center construction, securing a major partnership as AI boom drives infrastructure gold rush.
- Your expensive GPU infrastructure doesn’t have to become a sunk cost when you repurpose GPU hardware strategically.
Mere hours after OpenAI updated its flagship foundation model GPT-5 to GPT-5.1, promising reduced token usage overall and a more pleasant personality with more preset options, Chinese search giant Baidu unveiled its next-generation foundation model, ERNIE 5.0, alongside a suite of AI product upgrades and strategic international expansions.
The goal: to position as a global contender in the increasingly competitive enterprise AI market.
Announced at the company's Baidu World 2025 event, ERNIE 5.0 is a proprietary, natively omni-modal model designed to jointly process and generate content across text, images, audio, and video.
Unlike Baidu’s recently released ERNIE-4.5-VL-28B-A3B-Thinking, which is open source under an enterprise-friendly and permissive Apache 2.0 license, ERNIE 5.0 is a proprietary model and is available only via Baidu’s ERNIE Bot website (I needed to select it manuallyu from the model picker dropdown) and the Qianfan cloud platform application programming interface (API) for enterprise customers.
Alongside the model launch, Baidu introduced major updates to its digital human platform, no-code tools, and general-purpose AI agents — all targeted at expanding its AI footprint beyond China.
The company also introduced ERNIE 5.0 Preview 1022, a variant optimized for text-intensive tasks, alongside the general preview model that balances across modalities.
Baidu emphasized that ERNIE 5.0 represents a shift in how intelligence is deployed at scale, with CEO Robin Li stating: “When you internalize AI, it becomes a native capability and transforms intelligence from a cost into a source of productivity.”
Where ERNIE 5.0 outshines GPT-5 and Gemini 2.5 Pro
ERNIE 5.0’s benchmark results suggest that Baidu has achieved parity—or near-parity—with the top Western foundation models across a wide spectrum of tasks.
In public benchmark slides shared during the Baidu World 2025 event, ERNIE 5.0 Preview outperformed or matched OpenAI’s GPT-5-High and Google’s Gemini 2.5 Pro in multimodal reasoning, document understanding, and image-based QA, while also demonstrating strong language modeling and code execution abilities.
The company emphasized its ability to handle joint inputs and outputs across modalities, rather than relying on post-hoc modality fusion, which it framed as a technical differentiator.
On visual tasks, ERNIE 5.0 achieved leading scores on OCRBench, DocVQA, and ChartQA, three benchmarks that test document recognition, comprehension, and structured data reasoning.
Baidu claims the model beat both GPT-5-High and Gemini 2.5 Pro on these document and chart-based benchmarks, areas it describes as core to enterprise applications like automated document processing and financial analysis.
In image generation, ERNIE 5.0 tied or exceeded Google’s Veo3 across categories including semantic alignment and image quality, according to Baidu’s internal GenEval-based evaluation. Baidu claimed that the model’s multimodal integration allows it to generate and interpret visual content with greater contextual awareness than models relying on modality-specific encoders.
For audio and speech tasks, ERNIE 5.0 demonstrated competitive results on MM-AU and TUT2017 audio understanding benchmarks, as well as question answering from spoken language inputs. Its audio performance, while not as heavily emphasized as vision or text, suggests a broad capability footprint intended to support full-spectrum multimodal applications.
In language tasks, the model showed strong results on instruction following, factual question answering, and mathematical reasoning—core areas that define the enterprise utility of large language models.
The Preview 1022 variant of ERNIE 5.0, tailored for textual performance, showed even stronger language-specific results in early developer access. While Baidu does not claim broad superiority in general language reasoning, its internal evaluations suggest that ERNIE 5.0 Preview 1022 closes the gap with top-tier English-language models and outperforms them in Chinese-language performance.
While Baidu did not release full benchmark details or raw scores publicly, its performance positioning suggests a deliberate attempt to frame ERNIE 5.0 not as a niche multimodal system but as a flagship model competitive with the largest closed models in general-purpose reasoning.
Where Baidu claims a clear lead is in structured document understanding, visual chart reasoning, and integration of multiple modalities into a single, native modeling architecture. Independent verification of these results remains pending, but the breadth of claimed capabilities positions ERNIE 5.0 as a serious alternative in the multimodal foundation model landscape.
Enterprise Pricing Strategy
ERNIE 5.0 is positioned at the premium end of Baidu’s model pricing structure. The company has released specific pricing for API usage on its Qianfan platform, aligning the cost with other top-tier offerings from Chinese competitors like Alibaba.
Model
Input Cost (per 1K tokens)
Output Cost (per 1K tokens)
Source
ERNIE 5.0
$0.00085 (¥0.006)
$0.0034 (¥0.024)
ERNIE 4.5 Turbo (ex.)
$0.00011 (¥0.0008)
$0.00045 (¥0.0032)
Qwen3 (Coder ex.)
$0.00085 (¥0.006)
$0.0034 (¥0.024)
The contrast in cost between ERNIE 5.0 and earlier models such as ERNIE 4.5 Turbo underscores Baidu’s strategy to differentiate between high-volume, low-cost models and high-capability models designed for complex tasks and multimodal reasoning.
Compared to other U.S. alternatives, it remains mid-range in pricing:
Model
Input (/1 M tokens)
Output (/1 M tokens)
Source
GPT-5.1
$1.25
$10.00
ERNIE 5.0
$0.85
$3.40
ERNIE 4.5 Turbo (ex.)
$0.11
$0.45
Claude Opus 4.1
$15.00
$75.00
Gemini 2.5 Pro
$1.25 (≤200k) / $2.50 (>200k)
$10.00 (≤200k) / $15.00 (>200k)
Grok 4 (grok-4-0709)
$3.00
$15.00
Global Expansion: Products and Platforms
In tandem with the model release, Baidu is expanding internationally:
-
GenFlow 3.0, now with 20M+ users, is the company’s largest general-purpose AI agent and features enhanced memory and multimodal task handling.
-
Famou, a self-evolving agent capable of dynamically solving complex problems, is now commercially available via invite.
-
MeDo, the international version of Baidu’s no-code builder Miaoda, is live globally via medo.dev.
-
Oreate, a productivity workspace with document, slide, image, video, and podcast support, has reached over 1.2M users worldwide.
Baidu’s digital human platform, already rolled out in Brazil, is also part of the global push. According to company data, 83% of livestreamers during this year’s “Double 11” shopping event in China used Baidu’s digital human tech, contributing to a 91% increase in GMV.
Meanwhile, Baidu’s autonomous ride-hailing service Apollo Go has surpassed 17 million rides, operating driverless fleets in 22 cities and claiming the title of the world’s largest robotaxi network.
Open-Source Vision-Language Model Garners Industry Attention
Two days before the flagship ERNIE 5.0 event, Baidu also released an open-source multimodal model under the Apache 2.0 license: ERNIE-4.5-VL-28B-A3B-Thinking.
As reported by my colleague Michael Nuñez at VentureBeat, the model activates just 3 billion parameters while maintaining a total of 28 billion, using a Mixture-of-Experts (MoE) architecture for efficient inference.
Key technical innovations include:
-
“Thinking with Images”, which enables dynamic zoom-based visual analysis
-
Support for chart interpretation, document understanding, visual grounding, and temporal awareness in video
-
Runtime on a single 80GB GPU, making it accessible to mid-sized organizations
-
Full compatibility with Transformers, vLLM, and Baidu’s FastDeploy toolkits
This release adds pressure on closed-source competitors. With Apache 2.0 licensing, ERNIE-4.5-VL-28B-A3B-Thinking becomes a viable foundation model for commercial applications without licensing restrictions — something few high-performing models in this class offer.
Community Feedback and Baidu’s Response
Following the launch of ERNIE 5.0, developer and AI evaluator Lisan al Gaib (@scaling01) posted a mixed review on X. While initially impressed by the model’s benchmark performance, they reported a persistent issue where ERNIE 5.0 would repeatedly invoke tools — even when explicitly instructed not to — during SVG generation tasks.
“ERNIE 5.0 benchmarks looked insane until I tested it… unfortunately it’s RL braindamaged or they have a serious issue with their chat platform / system prompt,” Lisan wrote.
In a matter of hours, Baidu’s developer-focused support account, @ErnieforDevs, responded:
“Thanks for the feedback! It’s a known bug — certain syntax can consistently trigger it. We’re working on a fix. You can try rephrasing or changing the prompt to avoid it for now.”
The quick turnaround reflects Baidu’s increasing emphasis on developer communication, especially as it courts international users through both proprietary and open-source offerings.
Outlook for Baidu and its ERNIE foundational LLM family
Baidu’s ERNIE 5.0 marks a strategic escalation in the global foundation model race. With performance claims that put it on par with the most advanced systems from OpenAI and Google, and a mix of premium pricing and open-access alternatives, Baidu is signaling its ambition to become not just a domestic AI leader, but a credible global infrastructure provider.
At a time when enterprise AI users are increasingly demanding multimodal performance, flexible licensing, and deployment efficiency, Baidu’s two-track approach—premium hosted APIs and open-source releases—may broaden its appeal across both corporate and developer communities.
Whether the company’s performance claims hold up under third-party testing remains to be seen. But in a landscape shaped by rising costs, model complexity, and compute bottlenecks, ERNIE 5.0 and its supporting ecosystem give Baidu a competitive position in the next wave of AI deployment.
-
Artificial intelligence agents powered by the world's most advanced language models routinely fail to complete even straightforward professional tasks on their own, according to groundbreaking research released Thursday by Upwork, the largest online work marketplace.
But the same study reveals a more promising path forward: When AI agents collaborate with human experts, project completion rates surge by up to 70%, suggesting the future of work may not pit humans against machines but rather pair them together in powerful new ways.
The findings, drawn from more than 300 real client projects posted to Upwork's platform, marking the first systematic evaluation of how human expertise amplifies AI agent performance in actual professional work — not synthetic tests or academic simulations. The research challenges both the hype around fully autonomous AI agents and fears that such technology will imminently replace knowledge workers.
"AI agents aren't that agentic, meaning they aren't that good," Andrew Rabinovich, Upwork's chief technology officer and head of AI and machine learning, said in an exclusive interview with VentureBeat. "However, when paired with expert human professionals, project completion rates improve dramatically, supporting our firm belief that the future of work will be defined by humans and AI collaborating to get more work done, with human intuition and domain expertise playing a critical role."
How AI agents performed on 300+ real freelance jobs—and why they struggled
Upwork's Human+Agent Productivity Index (HAPI) evaluated how three leading AI systems — Gemini 2.5 Pro, OpenAI's GPT-5, and Claude Sonnet 4 — performed on actual jobs posted by paying clients across categories including writing, data science, web development, engineering, sales, and translation.
Critically, Upwork deliberately selected simple, well-defined projects where AI agents stood a reasonable chance of success. These jobs, priced under $500, represent less than 6% of Upwork's total gross services volume — a tiny fraction of the platform's overall business and an acknowledgment of current AI limitations.
"The reality is that although we study AI, and I've been doing this for 25 years, and we see significant breakthroughs, the reality is that these agents aren't that agentic," Rabinovich told VentureBeat. "So if we go up the value chain, the problems become so much more difficult, then we don't think they can solve them at all, even to scratch the surface. So we specifically chose simpler tasks that would give an agent some kind of traction."
Even on these deliberately simplified tasks, AI agents working independently struggled. But when expert freelancers provided feedback — spending an average of just 20 minutes per review cycle — the agents' performance improved substantially with each iteration.
20 minutes of human feedback boosted AI completion rates up to 70%
The research reveals stark differences in how AI agents perform with and without human guidance across different types of work. For data science and analytics projects, Claude Sonnet 4 achieved a 64% completion rate working alone but jumped to 93% after receiving feedback from a human expert. In sales and marketing work, Gemini 2.5 Pro's completion rate rose from 17% independently to 31% with human input. OpenAI's GPT-5 showed similarly dramatic improvements in engineering and architecture tasks, climbing from 30% to 50% completion.
The pattern held across virtually all categories, with agents responding particularly well to human feedback on qualitative, creative work requiring editorial judgment — areas like writing, translation, and marketing — where completion rates increased by up to 17 percentage points per feedback cycle.
The finding challenges a fundamental assumption in the AI industry: that agent benchmarks conducted in isolation accurately predict real-world performance.
"While we show that in the tasks that we have selected for agents to perform in isolation, they perform similarly to the previous results that we've seen published openly, what we've shown is that in collaboration with humans, the performance of these agents improves surprisingly well," Rabinovich said. "It's not just a one-turn back and forth, but the more feedback the human provides, the better the agent gets at performing."
Why ChatGPT can ace the SAT but can't count the R's in 'strawberry'
The research arrives as the AI industry grapples with a measurement crisis. Traditional benchmarks — standardized tests that AI models can master, sometimes scoring perfectly on SAT exams or mathematics olympiads — have proven poor predictors of real-world capability.
"With advances of large language models, what we're now seeing is that these static, academic datasets are completely saturated," Rabinovich said. "So you could get a perfect score in the SAT test or LSAT or any of the math olympiads, and then you would ask ChatGPT how many R's there are in the word strawberry, and it would get it wrong."
This phenomenon — where AI systems ace formal tests but stumble on trivial real-world questions — has led to growing skepticism about AI capabilities, even as companies race to deploy autonomous agents. Several recent benchmarks from other firms have tested AI agents on Upwork jobs, but those evaluations measured only isolated performance, not the collaborative potential that Upwork's research reveals.
"We wanted to evaluate the quality of these agents on actual real work with economic value associated with it, and not only see how well these agents do, but also see how these agents do in collaboration with humans, because we sort of knew already that in isolation, they're not that advanced," Rabinovich explained.
For Upwork, which connects roughly 800,000 active clients posting more than 3 million jobs annually to a global pool of freelancers, the research serves a strategic business purpose: establishing quality standards for AI agents before allowing them to compete or collaborate with human workers on its platform.
The economics of human-AI teamwork: Why paying for expert feedback still saves money
Despite requiring multiple rounds of human feedback — each lasting about 20 minutes — the time investment remains "orders of magnitude different between a human doing the work alone, versus a human doing the work with an AI agent," Rabinovich said. Where a project might take a freelancer days to complete independently, the agent-plus-human approach can deliver results in hours through iterative cycles of automated work and expert refinement.
The economic implications extend beyond simple time savings. Upwork recently reported that gross services volume from AI-related work grew 53% year-over-year in the third quarter of 2025, one of the strongest growth drivers for the company. But executives have been careful to frame AI not as a replacement for freelancers but as an enhancement to their capabilities.
"AI was a huge overhang for our valuation," Erica Gessert, Upwork's CFO, told CFO Brew in October. "There was this belief that all work was going to go away. AI was going to take it, and especially work that's done by people like freelancers, because they are impermanent. Actually, the opposite is true."
The company's strategy centers on enabling freelancers to handle more complex, higher-value work by offloading routine tasks to AI. "Freelancers actually prefer to have tools that automate the manual labor and repetitive part of their work, and really focus on the creative and conceptual part of the process," Rabinovich said.
Rather than replacing jobs, he argues, AI will transform them: "Simpler tasks will be automated by agents, but the jobs will become much more complex in the number of tasks, so the amount of work and therefore earnings for freelancers will actually only go up."
AI coding agents excel, but creative writing and translation still need humans
The research reveals a clear pattern in agent capabilities. AI systems perform best on "deterministic and verifiable" tasks with objectively correct answers, like solving math problems or writing basic code. "Most coding tasks are very similar to each other," Rabinovich noted. "That's why coding agents are becoming so good."
In Upwork's tests, web development, mobile app development, and data science projects — especially those involving structured, computational work — saw the highest standalone agent completion rates. Claude Sonnet 4 completed 68% of web development jobs and 64% of data science projects without human help, while Gemini 2.5 Pro achieved 74% on certain technical tasks.
But qualitative work proved far more challenging. When asked to create website layouts, write marketing copy, or translate content with appropriate cultural nuance, agents floundered without expert guidance. "When you ask it to write you a poem, the quality of the poem is extremely subjective," Rabinovich said. "Since the rubrics for evaluation were provided by humans, there's some level of variability in representation."
Writing, translation, and sales and marketing projects showed the most dramatic improvements from human feedback. For writing work, completion rates increased by up to 17 percentage points after expert review. Engineering and architecture projects requiring creative problem-solving — like civil engineering or architectural design — improved by as much as 23 percentage points with human oversight.
This pattern suggests AI agents excel at pattern matching and replication but struggle with creativity, judgment, and context — precisely the skills that define higher-value professional work.
Inside the research: How Upwork tested AI agents with peer-reviewed scientific methods
Upwork partnered with elite freelancers on its platform to evaluate every deliverable produced by AI agents, both independently and after each cycle of human feedback. These evaluators created detailed rubrics defining whether projects met core requirements specified in job descriptions, then scored outputs across multiple iterations.
Importantly, evaluators focused only on objective completion criteria, excluding subjective factors like stylistic preferences or quality judgments that might emerge in actual client relationships. "Rubric-based completion rates should not be viewed as a measure of whether an agent would be paid in a real marketplace setting," the research notes, "but as an indicator of its ability to fulfill explicitly defined requests."
This distinction matters: An AI agent might technically complete all specified requirements yet still produce work a client rejects as inadequate. Conversely, subjective client satisfaction — the true measure of marketplace success — remains beyond current measurement capabilities.
The research underwent double-blind peer review and was accepted to NeurIPS, the premier academic conference for AI research, where Upwork will present full results in early December. The company plans to publish a complete methodology and make the benchmark available to the research community, updating the task pool regularly to prevent overfitting as agents improve.
"The idea is for this benchmark to be a living and breathing platform where agents can come in and evaluate themselves on all categories of work, and the tasks that will be offered on the platform will always update, so that these agents don't overfit and basically memorize the tasks at hand," Rabinovich said.
Upwork's AI strategy: Building Uma, a 'meta-agent' that manages human and AI workers
The research directly informs Upwork's product roadmap as the company positions itself for what executives call "the age of AI and beyond." Rather than building its own AI agents to complete specific tasks, Upwork is developing Uma, a "meta orchestration agent" that coordinates between human workers, AI systems, and clients.
"Today, Upwork is a marketplace where clients look for freelancers to get work done, and then talent comes to Upwork to find work," Rabinovich explained. "This is getting expanded into a domain where clients come to Upwork, communicate with Uma, this meta-orchestration agent, and then Uma identifies the necessary talent to get the job done, gets the tasks outcomes completed, and then delivers that to the client."
In this vision, clients would interact primarily with Uma rather than directly hiring freelancers. The AI system would analyze project requirements, determine which tasks require human expertise versus AI execution, coordinate the workflow, and ensure quality — acting as an intelligent project manager rather than a replacement worker.
"We don't want to build agents that actually complete the tasks, but we are building this meta orchestration agent that figures out what human and agent talent is necessary in order to complete the tasks," Rabinovich said. "Uma evaluates the work to be delivered to the client, orchestrates the interaction between humans and agents, and is able to learn from all the interactions that happen on the platform how to break jobs into tasks so that they get completed in a timely and effective manner."
The company recently announced plans to open its first international office in Lisbon, Portugal, by the fourth quarter of 2026, with a focus on AI infrastructure development and technical hiring. The expansion follows Upwork's record-breaking third quarter, driven partly by AI-powered product innovation and strong demand for workers with AI skills.
OpenAI, Anthropic, and Google race to build autonomous agents—but reality lags hype
Upwork's findings arrive amid escalating competition in the AI agent space. OpenAI, Anthropic, Google, and numerous startups are racing to develop autonomous agents capable of complex multi-step tasks, from booking travel to analyzing financial data to writing software.
But recent high-profile stumbles have tempered initial enthusiasm. AI agents frequently misunderstand instructions, make logical errors, or produce confidently wrong results — a phenomenon researchers call "hallucination." The gap between controlled demonstration videos and reliable real-world performance remains vast.
"There have been some evaluations that came from OpenAI and other platforms where real Upwork tasks were considered for completion by agents, and across the board, the reported results were not very optimistic, in the sense that they showed that agents—even the best ones, meaning powered by most advanced LLMs — can't really compete with humans that well, because the completion rates are pretty low," Rabinovich said.
Rather than waiting for AI to fully mature — a timeline that remains uncertain—Upwork is betting on a hybrid approach that leverages AI's strengths (speed, scalability, pattern recognition) while retaining human strengths (judgment, creativity, contextual understanding).
This philosophy extends to learning and improvement. Current AI models train primarily on static datasets scraped from the internet, supplemented by human preference feedback. But most professional work is qualitative, making it difficult for AI systems to know whether their outputs are actually good without expert evaluation.
"Unless you have this collaboration between the human and the machine, where the human is kind of the teacher and the machine is the student trying to discover new solutions, none of this will be possible," Rabinovich said. "Upwork is very uniquely positioned to create such an environment because if you try to do this with, say, self-driving cars, and you tell Waymo cars to explore new ways of getting to the airport, like avoiding traffic signs, then a bunch of bad things will happen. In doing work on Upwork, if it creates a wrong website, it doesn't cost very much, and there's no negative side effects. But the opportunity to learn is absolutely tremendous."
Will AI take your job? The evidence suggests a more complicated answer
While much public discourse around AI focuses on job displacement, Rabinovich argues the historical pattern suggests otherwise — though the transition may prove disruptive.
"The narrative in the public is that AI is eliminating jobs, whether it's writing, translation, coding or other digital work, but no one really talks about the exponential amount of new types of work that it will create," he said. "When we invented electricity and steam engines and things like that, they certainly replaced certain jobs, but the amount of new jobs that were introduced is exponentially more, and we think the same is going to happen here."
The research identifies emerging job categories focused on AI oversight: designing effective human-machine workflows, providing high-quality feedback to improve agent performance, and verifying that AI-generated work meets quality standards. These skills—prompt engineering, agent supervision, output verification—barely existed two years ago but now command premium rates on platforms like Upwork.
"New types of skills from humans are becoming necessary in the form of how to design the interaction between humans and machines, how to guide agents to make them better, and ultimately, how to verify that whatever agentic proposals are being made are actually correct, because that's what's necessary in order to advance the state of AI," Rabinovich said.
The question remains whether this transition— from doing tasks to overseeing them — will create opportunities as quickly as it disrupts existing roles. For freelancers on Upwork, the answer may already be emerging in their bank accounts: The platform saw AI-related work grow 53% year-over-year, even as fears of AI-driven unemployment dominated headlines.
LinkedIn is launching its new AI-powered people search this week, after what seems like a very long wait for what should have been a natural offering for generative AI.
It comes a full three years after the launch of ChatGPT and six months after LinkedIn launched its AI job search offering. For technical leaders, this timeline illustrates a key enterprise lesson: Deploying generative AI in real enterprise settings is challenging, especially at a scale of 1.3 billion users. It’s a slow, brutal process of pragmatic optimization.
The following account is based on several exclusive interviews with the LinkedIn product and engineering team behind the launch.
First, here’s how the product works: A user can now type a natural language query like, "Who is knowledgeable about curing cancer?" into LinkedIn’s search bar.
LinkedIn's old search, based on keywords, would have been stumped. It would have looked only for references to "cancer". If a user wanted to get sophisticated, they would have had to run separate, rigid keyword searches for "cancer" and then "oncology" and manually try to piece the results together.
The new AI-powered system, however, understands the intent of the search because the LLM under the hood grasps semantic meaning. It recognizes, for example, that "cancer" is conceptually related to "oncology" and even less directly, to "genomics research." As a result, it surfaces a far more relevant list of people, including oncology leaders and researchers, even if their profiles don't use the exact word "cancer."
The system also balances this relevance with usefulness. Instead of just showing the world's top oncologist (who might be an unreachable third-degree connection), it will also weigh who in your immediate network — like a first-degree connection — is "pretty relevant" and can serve as a crucial bridge to that expert.
See the video below for an example.
Arguably, though, the more important lesson for enterprise practitioners is the "cookbook" LinkedIn has developed: a replicable, multi-stage pipeline of distillation, co-design, and relentless optimization. LinkedIn had to perfect this on one product before attempting it on another.
"Don't try to do too much all at once," writes Wenjing Zhang, LinkedIn's VP of Engineering, in a post about the product launch, and who also spoke with VentureBeat last week in an interview. She notes that an earlier "sprawling ambition" to build a unified system for all of LinkedIn's products "stalled progress."
Instead, LinkedIn focused on winning one vertical first. The success of its previously launched AI Job Search — which led to job seekers without a four-year degree being 10% more likely to get hired, according to VP of Product Engineering Erran Berger — provided the blueprint.
Now, the company is applying that blueprint to a far larger challenge. "It's one thing to be able to do this across tens of millions of jobs," Berger told VentureBeat. "It's another thing to do this across north of a billion members."
For enterprise AI builders, LinkedIn's journey provides a technical playbook for what it actually takes to move from a successful pilot to a billion-user-scale product.
The new challenge: a 1.3 billion-member graph
The job search product created a robust recipe that the new people search product could build upon, Berger explained.
The recipe started with with a "golden data set" of just a few hundred to a thousand real query-profile pairs, meticulously scored against a detailed 20- to 30-page "product policy" document. To scale this for training, LinkedIn used this small golden set to prompt a large foundation model to generate a massive volume of synthetic training data. This synthetic data was used to train a 7-billion-parameter "Product Policy" model — a high-fidelity judge of relevance that was too slow for live production but perfect for teaching smaller models.
However, the team hit a wall early on. For six to nine months, they struggled to train a single model that could balance strict policy adherence (relevance) against user engagement signals. The "aha moment" came when they realized they needed to break the problem down. They distilled the 7B policy model into a 1.7B teacher model focused solely on relevance. They then paired it with separate teacher models trained to predict specific member actions, such as job applications for the jobs product, or connecting and following for people search. This "multi-teacher" ensemble produced soft probability scores that the final student model learned to mimic via KL divergence loss.
The resulting architecture operates as a two-stage pipeline. First, a larger 8B parameter model handles broad retrieval, casting a wide net to pull candidates from the graph. Then, the highly distilled student model takes over for fine-grained ranking. While the job search product successfully deployed a 0.6B (600-million) parameter student, the new people search product required even more aggressive compression. As Zhang notes, the team pruned their new student model from 440M down to just 220M parameters, achieving the necessary speed for 1.3 billion users with less than 1% relevance loss.
But applying this to people search broke the old architecture. The new problem included not just ranking but also retrieval.
“A billion records," Berger said, is a "different beast."
The team’s prior retrieval stack was built on CPUs. To handle the new scale and the latency demands of a "snappy" search experience, the team had to move its indexing to GPU-based infrastructure. This was a foundational architectural shift that the job search product did not require.
Organizationally, LinkedIn benefited from multiple approaches. For a time, LinkedIn had two separate teams — job search and people search — attempting to solve the problem in parallel. But once the job search team achieved its breakthrough using the policy-driven distillation method, Berger and his leadership team intervened. They brought over the architects of the job search win — product lead Rohan Rajiv and engineering lead Wenjing Zhang — to transplant their 'cookbook' directly to the new domain.
Distilling for a 10x throughput gain
With the retrieval problem solved, the team faced the ranking and efficiency challenge. This is where the cookbook was adapted with new, aggressive optimization techniques.
Zhang’s technical post (I’ll insert the link once it goes live) provides the specific details our audience of AI engineers will appreciate. One of the more significant optimizations was input size.
To feed the model, the team trained another LLM with reinforcement learning (RL) for a single purpose: to summarize the input context. This "summarizer" model was able to reduce the model's input size by 20-fold with minimal information loss.
The combined result of the 220M-parameter model and the 20x input reduction? A 10x increase in ranking throughput, allowing the team to serve the model efficiently to its massive user base.
Pragmatism over hype: building tools, not agents
Throughout our discussions, Berger was adamant about something else that might catch peoples’ attention: The real value for enterprises today lies in perfecting recommender systems, not in chasing "agentic hype." He also refused to talk about the specific models that the company used for the searches, suggesting it almost doesn't matter. The company selects models based on which one it finds the most efficient for the task.
The new AI-powered people search is a manifestation of Berger’s philosophy that it’s best to optimize the recommender system first. The architecture includes a new "intelligent query routing layer," as Berger explained, that itself is LLM-powered. This router pragmatically decides if a user's query — like "trust expert" — should go to the new semantic, natural-language stack or to the old, reliable lexical search.
This entire, complex system is designed to be a "tool" that a future agent will use, not the agent itself.
"Agentic products are only as good as the tools that they use to accomplish tasks for people," Berger said. "You can have the world's best reasoning model, and if you're trying to use an agent to do people search but the people search engine is not very good, you're not going to be able to deliver."
Now that the people search is available, Berger suggested that one day the company will be offering agents to use it. But he didn’t provide details on timing. He also said the recipe used for job and people search will be spread across the company’s other products.
For enterprises building their own AI roadmaps, LinkedIn's playbook is clear:
-
Be pragmatic: Don't try to boil the ocean. Win one vertical, even if it takes 18 months.
-
Codify the "cookbook": Turn that win into a repeatable process (policy docs, distillation pipelines, co-design).
-
Optimize relentlessly: The real 10x gains come after the initial model, in pruning, distillation, and creative optimizations like an RL-trained summarizer.
LinkedIn's journey shows that for real-world enterprise AI, emphasis on specific models or cool agentic systems should take a back seat. The durable, strategic advantage comes from mastering the pipeline — the 'AI-native' cookbook of co-design, distillation, and ruthless optimization.
(Editor's note: We will be publishing a full-length podcast with LinkedIn's Erran Berger, which will dive deeper into these technical details, on the VentureBeat podcast feed soon.)
-


