Data Infrastructure Archives - Optimus Technology Group

Why most enterprise AI coding pilots underperform (Hint: It’s not the model)

Why most enterprise AI coding pilots underperform (Hint: It’s not the model)

Gen AI in software engineering has moved well beyond autocomplete. The emerging frontier is agentic coding: AI systems capable of planning changes, executing them across multiple steps and iterating based on feedback. Yet despite the excitement around “AI agents that code,” most enterprise deployments underperform. The limiting factor is no longer the model. It’s context: The structure, history and intent surrounding the code being changed. In other words, enterprises are now facing a systems design problem: They have not yet engineered the environment these agents operate in.

The shift from assistance to agency

The past year has seen a rapid evolution from assistive coding tools to agentic workflows. Research has begun to formalize what agentic behavior means in practice: The ability to reason across design, testing, execution and validation rather than generate isolated snippets. Work such as dynamic action re-sampling shows that allowing agents to branch, reconsider and revise their own decisions significantly improves outcomes in large, interdependent codebases. At the platform level, providers like GitHub are now building dedicated agent orchestration environments, such as Copilot Agent and Agent HQ, to support multi-agent collaboration inside real enterprise pipelines.

But early field results tell a cautionary story. When organizations introduce agentic tools without addressing workflow and environment, productivity can decline. A randomized control study this year showed that developers who used AI assistance in unchanged workflows completed tasks more slowly, largely due to verification, rework and confusion around intent. The lesson is straightforward: Autonomy without orchestration rarely yields efficiency.

Why context engineering is the real unlock

In every unsuccessful deployment I’ve observed, the failure stemmed from context. When agents lack a structured understanding of a codebase, specifically its relevant modules, dependency graph, test harness, architectural conventions and change history. They often generate output that appears correct but is disconnected from reality. Too much information overwhelms the agent; too little forces it to guess. The goal is not to feed the model more tokens. The goal is to determine what should be visible to the agent, when and in what form.

The teams seeing meaningful gains treat context as an engineering surface. They create tooling to snapshot, compact and version the agent’s working memory: What is persisted across turns, what is discarded, what is summarized and what is linked instead of inlined. They design deliberation steps rather than prompting sessions. They make the specification a first-class artifact, something reviewable, testable and owned, not a transient chat history. This shift aligns with a broader trend some researchers describe as “specs becoming the new source of truth.”

Workflow must change alongside tooling

But context alone isn’t enough. Enterprises must re-architect the workflows around these agents. As McKinsey’s 2025 report “One Year of Agentic AI” noted, productivity gains arise not from layering AI onto existing processes but from rethinking the process itself. When teams simply drop an agent into an unaltered workflow, they invite friction: Engineers spend more time verifying AI-written code than they would have spent writing it themselves. The agents can only amplify what’s already structured: Well-tested, modular codebases with clear ownership and documentation. Without those foundations, autonomy becomes chaos.

Security and governance, too, demand a shift in mindset. AI-generated code introduces new forms of risk: Unvetted dependencies, subtle license violations and undocumented modules that escape peer review. Mature teams are beginning to integrate agentic activity directly into their CI/CD pipelines, treating agents as autonomous contributors whose work must pass the same static analysis, audit logging and approval gates as any human developer. GitHub’s own documentation highlights this trajectory, positioning Copilot Agents not as replacements for engineers but as orchestrated participants in secure, reviewable workflows. The goal isn’t to let an AI “write everything,” but to ensure that when it acts, it does so inside defined guardrails.

What enterprise decision-makers should focus on now

For technical leaders, the path forward starts with readiness rather than hype. Monoliths with sparse tests rarely yield net gains; agents thrive where tests are authoritative and can drive iterative refinement. This is exactly the loop Anthropic calls out for coding agents. Pilots in tightly scoped domains (test generation, legacy modernization, isolated refactors); treat each deployment as an experiment with explicit metrics (defect escape rate, PR cycle time, change failure rate, security findings burned down). As your usage grows, treat agents as data infrastructure: Every plan, context snapshot, action log and test run is data that composes into a searchable memory of engineering intent, and a durable competitive advantage.

Under the hood, agentic coding is less a tooling problem than a data problem. Every context snapshot, test iteration and code revision becomes a form of structured data that must be stored, indexed and reused. As these agents proliferate, enterprises will find themselves managing an entirely new data layer: One that captures not just what was built, but how it was reasoned about. This shift turns engineering logs into a knowledge graph of intent, decision-making and validation. In time, the organizations that can search and replay this contextual memory will outpace those who still treat code as static text.

The coming year will likely determine whether agentic coding becomes a cornerstone of enterprise development or another inflated promise. The difference will hinge on context engineering: How intelligently teams design the informational substrate their agents rely on. The winners will be those who see autonomy not as magic, but as an extension of disciplined systems design:Clear workflows, measurable feedback, and rigorous governance.

Bottom line

Platforms are converging on orchestration and guardrails, and research keeps improving context control at inference time. The winners over the next 12 to 24 months won’t be the teams with the flashiest model; they’ll be the ones that engineer context as an asset and treat workflow as the product. Do that, and autonomy compounds. Skip it, and the review queue does.

Context + agent = leverage. Skip the first half, and the rest collapses.

Dhyey Mavani is accelerating generative AI at LinkedIn.

Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.

read more
Ontology is the real guardrail: How to stop AI agents from misunderstanding your business

Ontology is the real guardrail: How to stop AI agents from misunderstanding your business

Enterprises are investing billions of dollars in AI agents and infrastructure to transform business processes. However, we are seeing limited success in real-world applications, often due to the inability of agents to truly understand business data, policies and processes.

While we manage the integrations well with technologies like API management, model context protocol (MCP) and others, having agents truly understand the “meaning” of data in the context of a given businesis a different story. Enterprise data is mostly siloed into disparate systems in structured and unstructured forms and needs to be analyzed with a domain-specific business lens.s

As an example, the term “customer” may refer to a different group of people in a Sales CRM system, compared to a finance system which may use this tag for paying clients. One department might define “product” as a SKU; another may represent as a "product" family; a third as a marketing bundle.

Data about “product sales” thus varies in meaning without agreed upon relationships and definitions. For agents to combine data from multiple systems, they must understand different representations. Agents need to know what the data means in context and how to find the right data for the right process. Moreover, schema changes in systems and data quality issues during collection can lead to more ambiguity and inability of agents to know how to act when such situations are encountered.

Furthermore, classification of data into categories like PII (personally identifiable information) needs to be rigorously followed to maintain compliance with standards like GDPR and CCPA. This requires the data to be labelled correctly and agents to be able to understand and respect this classification. Hence we see that building a cool demo using agents is very much doable - but putting into production working on real business data is a different story altogether.

The ontology-based source of truth

Building effective agentic solutions requries an ontology-based single source of truth. Ontology is a business definition of concepts, their hierarchy and relationships. It defines terms with respect to business domains, can help establish a single-source of truth for data and capture uniform field names and apply classifications to fields.

An ontology may be domain-specific (healthcare or finance), or organization-specific based on internal structures. Defining an ontology upfront is time consuming, but can help standardize business processes and lay a strong foundation for agentic AI.

Ontology may be realized using common queryable formats like triplestore. More complex business rules with multi-hop relations could use a labelled property graphs like Neo4j. These graphs can also help enterprises discover new relationships and answer complex questions. Ontologies like FIBO (Finance Industry Business Ontology) and UMLS (Unified Medical Language System) are available in the public domain and can be a very good starting point. However, these usually need to be customized to capture specific details of an enterprise.

Getting started with ontology

Once implemented, an ontology can be the driving force for enterprise agents. We can now prompt AI to follow the ontology and use it to discover data and relationships. If needed, we can have an agentic layer serve key details of the ontology itself and discover data. Business rules and policies can be implemented in this ontology for agents to adhere to. This is an excellent way to ground your agents and establish guardrails based on real business context.

Agents designed in this manner and tuned to follow an ontology can stick to guardrails and avoid hallucinations that can be caused by the large language models (LLM) powering them. For example, a business policy may define that unless all documents associated with a loan do not have verified flags set to "true," the loan status should be kept in “pending” state. Agents can work around this policy and determine what documents are needed and query the knowledge base.

Here's an example implementation:

(Original figure by Author)

As illustrated, we have structured and unstructured data processed by a document intelligence (DocIntel) agent which populates a Neo4j database based on an ontology of the business domain. A data discovery agent in Neo4j finds and queries the right data and passes it to other agents handling business process execution. The inter-agent communication happens with a popular protocol like A2A (agent to agent). A new protocol called AG-UI (Agent User Interaction) can help build more generic UI screens to capture the workings and responses from these agents.

With this method, we can avoid hallucinations by enforcing agents to follow ontology-driven paths and maintain data classifications and relationships. Moreover, we can scale easily by adding new assets, relationships and policies that agents can automatically comply to, and control hallucinations by defining rules for the whole system rather than individual entities. For example, if an agent hallucinates an individual 'customer,' because the connected data for the hallucinated 'customer' will not be verifiable in the data discovery, we can easily detect this anomaly and plan to eliminate it. This helps the agentic system scale with the business and manage its dynamic nature.

Indeed, a reference architecture like this adds some overhead in data discovery and graph databases. But for a large enterprise, it adds the right guardrails and gives agents directions to orchestrate complex business processes.

Dattaraj Rao is innovation and R&D architect at Persistent Systems.

Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.

read more
ScaleOps’ new AI Infra Product slashes GPU costs for self-hosted enterprise LLMs by 50% for early adopters
ScaleOps’ new AI Infra Product slashes GPU costs for self-hosted enterprise LLMs by 50% for early adopters

ScaleOps has expanded its cloud resource management platform with a new product aimed at enterprises operating self-hosted large language models (LLMs) and GPU-based AI applications.

The AI Infra Product announced today, extends the company’s existing automation capabilities to address a growing need for efficient GPU utilization, predictable performance, and reduced operational burden in large-scale AI deployments.

The company said the system is already running in enterprise production environments and delivering major efficiency gains for early adopters, reducing GPU costs by between 50% and 70%, according to the company. The company does not publicly list enterprise pricing for this solution and instead invites interested customers to receive a custom quote based on their operation size and needs here.

In explaining how the system behaves under heavy load, Yodar Shafrir, CEO and Co-Founder of ScaleOps, said in an email to VentureBeat that the platform uses “proactive and reactive mechanisms to handle sudden spikes without performance impact,” noting that its workload rightsizing policies “automatically manage capacity to keep resources available.”

He added that minimizing GPU cold-start delays was a priority, emphasizing that the system “ensures instant response when traffic surges,” particularly for AI workloads where model load times are substantial.

Expanding Resource Automation to AI Infrastructure

Enterprises deploying self-hosted AI models face performance variability, long load times, and persistent underutilization of GPU resources. ScaleOps positioned the new AI Infra Product as a direct response to these issues.

The platform allocates and scales GPU resources in real time and adapts to changes in traffic demand without requiring alterations to existing model deployment pipelines or application code.

According to ScaleOps, the system manages production environments for organizations including Wiz, DocuSign, Rubrik, Coupa, Alkami, Vantor, Grubhub, Island, Chewy, and several Fortune 500 companies.

The AI Infra Product introduces workload-aware scaling policies that proactively and reactively adjust capacity to maintain performance during demand spikes. The company stated that these policies reduce the cold-start delays associated with loading large AI models, which improves responsiveness when traffic increases.

Technical Integration and Platform Compatibility

The product is designed for compatibility with common enterprise infrastructure patterns. It works across all Kubernetes distributions, major cloud platforms, on-premises data centers, and air-gapped environments. ScaleOps emphasized that deployment does not require code changes, infrastructure rewrites, or modifications to existing manifests.

Shafrir said the platform “integrates seamlessly into existing model deployment pipelines without requiring any code or infrastructure changes,” and he added that teams can begin optimizing immediately with their existing GitOps, CI/CD, monitoring, and deployment tooling.

Shafrir also addressed how the automation interacts with existing systems. He said the platform operates without disrupting workflows or creating conflicts with custom scheduling or scaling logic, explaining that the system “doesn’t change manifests or deployment logic” and instead enhances schedulers, autoscalers, and custom policies by incorporating real-time operational context while respecting existing configuration boundaries.

Performance, Visibility, and User Control

The platform provides full visibility into GPU utilization, model behavior, performance metrics, and scaling decisions at multiple levels, including pods, workloads, nodes, and clusters. While the system applies default workload scaling policies, ScaleOps noted that engineering teams retain the ability to tune these policies as needed.

In practice, the company aims to reduce or eliminate the manual tuning that DevOps and AIOps teams typically perform to manage AI workloads. Installation is intended to require minimal effort, described by ScaleOps as a two-minute process using a single helm flag, after which optimization can be enabled through a single action.

Cost Savings and Enterprise Case Studies

ScaleOps reported that early deployments of the AI Infra Product have achieved GPU cost reductions of 50–70% in customer environments. The company cited two examples:

A major creative software company operating thousands of GPUs averaged 20% utilization before adopting ScaleOps. The product increased utilization, consolidated underused capacity, and enabled GPU nodes to scale down. These changes reduced overall GPU spending by more than half. The company also reported a 35% reduction in latency for key workloads.

A global gaming company used the platform to optimize a dynamic LLM workload running on hundreds of GPUs. According to ScaleOps, the product increased utilization by a factor of seven while maintaining service-level performance. The customer projected $1.4 million in annual savings from this workload alone.

ScaleOps stated that the expected GPU savings typically outweigh the cost of adopting and operating the platform, and that customers with limited infrastructure budgets have reported fast returns on investment.

Industry Context and Company Perspective

The rapid adoption of self-hosted AI models has created new operational challenges for enterprises, particularly around GPU efficiency and the complexity of managing large-scale workloads. Shafrir described the broader landscape as one in which “cloud-native AI infrastructure is reaching a breaking point.”

“Cloud-native architectures unlocked great flexibility and control, but they also introduced a new level of complexity,” he said in the announcement. “Managing GPU resources at scale has become chaotic—waste, performance issues, and skyrocketing costs are now the norm. The ScaleOps platform was built to fix this. It delivers the complete solution for managing and optimizing GPU resources in cloud-native environments, enabling enterprises to run LLMs and AI applications efficiently, cost-effectively, and while improving performance.”

Shafrir added that the product brings together the full set of cloud resource management functions needed to manage diverse workloads at scale. The company positioned the platform as a holistic system for continuous, automated optimization.

A Unified Approach for the Future

With the addition of the AI Infra Product, ScaleOps aims to establish a unified approach to GPU and AI workload management that integrates with existing enterprise infrastructure.

The platform’s early performance metrics and reported cost savings suggest a focus on measurable efficiency improvements within the expanding ecosystem of self-hosted AI deployments.

read more
From shiny object to sober reality: The vector database story, two years later
From shiny object to sober reality: The vector database story, two years later

When I first wrote “Vector databases: Shiny object syndrome and the case of a missing unicorn” in March 2024, the industry was awash in hype. Vector databases were positioned as the next big thing — a must-have infrastructure layer for the gen AI era. Billions of venture dollars flowed, developers rushed to integrate embeddings into their pipelines and analysts breathlessly tracked funding rounds for Pinecone, Weaviate, Chroma, Milvus and a dozen others.

The promise was intoxicating: Finally, a way to search by meaning rather than by brittle keywords. Just dump your enterprise knowledge into a vector store, connect an LLM and watch magic happen.

Except the magic never fully materialized.

Two years on, the reality check has arrived: 95% of organizations invested in gen AI initiatives are seeing zero measurable returns. And, many of the warnings I raised back then — about the limits of vectors, the crowded vendor landscape and the risks of treating vector databases as silver bullets — have played out almost exactly as predicted.

Prediction 1: The missing unicorn

Back then, I questioned whether Pinecone — the poster child of the category — would achieve unicorn status or whether it would become the “missing unicorn” of the database world. Today, that question has been answered in the most telling way possible: Pinecone is reportedly exploring a sale, struggling to break out amid fierce competition and customer churn.

Yes, Pinecone raised big rounds and signed marquee logos. But in practice, differentiation was thin. Open-source players like Milvus, Qdrant and Chroma undercut them on cost. Incumbents like Postgres (with pgVector) and Elasticsearch simply added vector support as a feature. And customers increasingly asked: “Why introduce a whole new database when my existing stack already does vectors well enough?”

The result: Pinecone, once valued near a billion dollars, is now looking for a home. The missing unicorn indeed. In September 2025, Pinecone appointed Ash Ashutosh as CEO, with founder Edo Liberty moving to a chief scientist role. The timing is telling: The leadership change comes amid increasing pressure and questions over its long-term independence.

Prediction 2: Vectors alone won’t cut it

I also argued that vector databases by themselves were not an end solution. If your use case required exactness — l ike searching for “Error 221” in a manual—a pure vector search would gleefully serve up “Error 222” as “close enough.” Cute in a demo, catastrophic in production.

That tension between similarity and relevance has proven fatal to the myth of vector databases as all-purpose engines.

“Enterprises discovered the hard way that semantic ≠ correct.”

Developers who gleefully swapped out lexical search for vectors quickly reintroduced… lexical search in conjunction with vectors. Teams that expected vectors to “just work” ended up bolting on metadata filtering, rerankers and hand-tuned rules. By 2025, the consensus is clear: Vectors are powerful, but only as part of a hybrid stack.

Prediction 3: A crowded field becomes commoditized

The explosion of vector database startups was never sustainable. Weaviate, Milvus (via Zilliz), Chroma, Vespa, Qdrant — each claimed subtle differentiators, but to most buyers they all did the same thing: store vectors and retrieve nearest neighbors.

Today, very few of these players are breaking out. The market has fragmented, commoditized and in many ways been swallowed by incumbents. Vector search is now a checkbox feature in cloud data platforms, not a standalone moat.

Just as I wrote then: Distinguishing one vector DB from another will pose an increasing challenge. That challenge has only grown harder. Vald, Marqo, LanceDB, PostgresSQL, MySQL HeatWave, Oracle 23c, Azure SQL, Cassandra, Redis, Neo4j, SingleStore, ElasticSearch, OpenSearch, Apahce Solr… the list goes on.

The new reality: Hybrid and GraphRAG

But this isn’t just a story of decline — it’s a story of evolution. Out of the ashes of vector hype, new paradigms are emerging that combine the best of multiple approaches.

Hybrid Search: Keyword + vector is now the default for serious applications. Companies learned that you need both precision and fuzziness, exactness and semantics. Tools like Apache Solr, Elasticsearch, pgVector and Pinecone’s own “cascading retrieval” embrace this.

GraphRAG: The hottest buzzword of late 2024/2025 is GraphRAG — graph-enhanced retrieval augmented generation. By marrying vectors with knowledge graphs, GraphRAG encodes the relationships between entities that embeddings alone flatten away. The payoff is dramatic.

Benchmarks and evidence

Amazon’s AI blog cites benchmarks from Lettria, where hybrid GraphRAG boosted answer correctness from ~50% to 80%-plus in test datasets across finance, healthcare, industry, and law.

The GraphRAG-Bench benchmark (released May 2025) provides a rigorous evaluation of GraphRAG vs. vanilla RAG across reasoning tasks, multi-hop queries and domain challenges.

An OpenReview evaluation of RAG vs GraphRAG found that each approach has strengths depending on task — but hybrid combinations often perform best.

FalkorDB’s blog reports that when schema precision matters (structured domains), GraphRAG can outperform vector retrieval by a factor of ~3.4x on certain benchmarks.

The rise of GraphRAG underscores the larger point: Retrieval is not about any single shiny object. It’s about building retrieval systems — layered, hybrid, context-aware pipelines that give LLMs the right information, with the right precision, at the right time.

What this means going forward

The verdict is in: Vector databases were never the miracle. They were a step — an important one — in the evolution of search and retrieval. But they are not, and never were, the endgame.

The winners in this space won’t be those who sell vectors as a standalone database. They will be the ones who embed vector search into broader ecosystems — integrating graphs, metadata, rules and context engineering into cohesive platforms.

In other words: The unicorn isn’t the vector database. The unicorn is the retrieval stack.

Looking ahead: What’s next

Unified data platforms will subsume vector + graph: Expect major DB and cloud vendors to offer integrated retrieval stacks (vector + graph + full-text) as built-in capabilities.

“Retrieval engineering” will emerge as a distinct discipline: Just as MLOps matured, so too will practices around embedding tuning, hybrid ranking and graph construction.

Meta-models learning to query better: Future LLMs may learn to orchestrate which retrieval method to use per query, dynamically adjusting weighting.

Temporal and multimodal GraphRAG: Already, researchers are extending GraphRAG to be time-aware (T-GRAG) and multimodally unified (e.g. connecting images, text, video).

Open benchmarks and abstraction layers: Tools like BenchmarkQED (for RAG benchmarking) and GraphRAG-Bench will push the community toward fairer, comparably measured systems.

From shiny objects to essential infrastructure

The arc of the vector database story has followed a classic path: A pervasive hype cycle, followed by introspection, correction and maturation. In 2025, vector search is no longer the shiny object everyone pursues blindly — it’s now a critical building block within a more sophisticated, multi-pronged retrieval architecture.

The original warnings were right. Pure vector-based hopes often crash on the shoals of precision, relational complexity and enterprise constraints. Yet the technology was never wasted: It forced the industry to rethink retrieval, blending semantic, lexical and relational strategies.

If I were to write a sequel in 2027, I suspect it would frame vector databases not as unicorns, but as legacy infrastructure — foundational, but eclipsed by smarter orchestration layers, adaptive retrieval controllers and AI systems that dynamically choose which retrieval tool fits the query.

As of now, the real battle is not vector vs keyword — it’s the indirection, blending and discipline in building retrieval pipelines that reliably ground gen AI in facts and domain knowledge. That’s the unicorn we should be chasing now.

Amit Verma is head of engineering and AI Labs at Neuron7.

Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.

read more
Inside LinkedIn’s generative AI cookbook: How it scaled people search to 1.3 billion users
Inside LinkedIn’s generative AI cookbook: How it scaled people search to 1.3 billion users

LinkedIn is launching its new AI-powered people search this week, after what seems like a very long wait for what should have been a natural offering for generative AI.

It comes a full three years after the launch of ChatGPT and six months after LinkedIn launched its AI job search offering. For technical leaders, this timeline illustrates a key enterprise lesson: Deploying generative AI in real enterprise settings is challenging, especially at a scale of 1.3 billion users. It’s a slow, brutal process of pragmatic optimization.

The following account is based on several exclusive interviews with the LinkedIn product and engineering team behind the launch.

First, here’s how the product works: A user can now type a natural language query like, "Who is knowledgeable about curing cancer?" into LinkedIn’s search bar.

LinkedIn's old search, based on keywords, would have been stumped. It would have looked only for references to "cancer". If a user wanted to get sophisticated, they would have had to run separate, rigid keyword searches for "cancer" and then "oncology" and manually try to piece the results together.

The new AI-powered system, however, understands the intent of the search because the LLM under the hood grasps semantic meaning. It recognizes, for example, that "cancer" is conceptually related to "oncology" and even less directly, to "genomics research." As a result, it surfaces a far more relevant list of people, including oncology leaders and researchers, even if their profiles don't use the exact word "cancer."

The system also balances this relevance with usefulness. Instead of just showing the world's top oncologist (who might be an unreachable third-degree connection), it will also weigh who in your immediate network — like a first-degree connection — is "pretty relevant" and can serve as a crucial bridge to that expert.

See the video below for an example.

Arguably, though, the more important lesson for enterprise practitioners is the "cookbook" LinkedIn has developed: a replicable, multi-stage pipeline of distillation, co-design, and relentless optimization. LinkedIn had to perfect this on one product before attempting it on another.

"Don't try to do too much all at once," writes Wenjing Zhang, LinkedIn's VP of Engineering, in a post about the product launch, and who also spoke with VentureBeat last week in an interview. She notes that an earlier "sprawling ambition" to build a unified system for all of LinkedIn's products "stalled progress."

Instead, LinkedIn focused on winning one vertical first. The success of its previously launched AI Job Search — which led to job seekers without a four-year degree being 10% more likely to get hired, according to VP of Product Engineering Erran Berger — provided the blueprint.

Now, the company is applying that blueprint to a far larger challenge. "It's one thing to be able to do this across tens of millions of jobs," Berger told VentureBeat. "It's another thing to do this across north of a billion members."

For enterprise AI builders, LinkedIn's journey provides a technical playbook for what it actually takes to move from a successful pilot to a billion-user-scale product.

The new challenge: a 1.3 billion-member graph

The job search product created a robust recipe that the new people search product could build upon, Berger explained.

The recipe started with with a "golden data set" of just a few hundred to a thousand real query-profile pairs, meticulously scored against a detailed 20- to 30-page "product policy" document. To scale this for training, LinkedIn used this small golden set to prompt a large foundation model to generate a massive volume of synthetic training data. This synthetic data was used to train a 7-billion-parameter "Product Policy" model — a high-fidelity judge of relevance that was too slow for live production but perfect for teaching smaller models.

However, the team hit a wall early on. For six to nine months, they struggled to train a single model that could balance strict policy adherence (relevance) against user engagement signals. The "aha moment" came when they realized they needed to break the problem down. They distilled the 7B policy model into a 1.7B teacher model focused solely on relevance. They then paired it with separate teacher models trained to predict specific member actions, such as job applications for the jobs product, or connecting and following for people search. This "multi-teacher" ensemble produced soft probability scores that the final student model learned to mimic via KL divergence loss.

The resulting architecture operates as a two-stage pipeline. First, a larger 8B parameter model handles broad retrieval, casting a wide net to pull candidates from the graph. Then, the highly distilled student model takes over for fine-grained ranking. While the job search product successfully deployed a 0.6B (600-million) parameter student, the new people search product required even more aggressive compression. As Zhang notes, the team pruned their new student model from 440M down to just 220M parameters, achieving the necessary speed for 1.3 billion users with less than 1% relevance loss.

But applying this to people search broke the old architecture. The new problem included not just ranking but also retrieval.

“A billion records," Berger said, is a "different beast."

The team’s prior retrieval stack was built on CPUs. To handle the new scale and the latency demands of a "snappy" search experience, the team had to move its indexing to GPU-based infrastructure. This was a foundational architectural shift that the job search product did not require.

Organizationally, LinkedIn benefited from multiple approaches. For a time, LinkedIn had two separate teams — job search and people search — attempting to solve the problem in parallel. But once the job search team achieved its breakthrough using the policy-driven distillation method, Berger and his leadership team intervened. They brought over the architects of the job search win — product lead Rohan Rajiv and engineering lead Wenjing Zhang — to transplant their 'cookbook' directly to the new domain.

Distilling for a 10x throughput gain

With the retrieval problem solved, the team faced the ranking and efficiency challenge. This is where the cookbook was adapted with new, aggressive optimization techniques.

Zhang’s technical post (I’ll insert the link once it goes live) provides the specific details our audience of AI engineers will appreciate. One of the more significant optimizations was input size.

To feed the model, the team trained another LLM with reinforcement learning (RL) for a single purpose: to summarize the input context. This "summarizer" model was able to reduce the model's input size by 20-fold with minimal information loss.

The combined result of the 220M-parameter model and the 20x input reduction? A 10x increase in ranking throughput, allowing the team to serve the model efficiently to its massive user base.

Pragmatism over hype: building tools, not agents

Throughout our discussions, Berger was adamant about something else that might catch peoples’ attention: The real value for enterprises today lies in perfecting recommender systems, not in chasing "agentic hype." He also refused to talk about the specific models that the company used for the searches, suggesting it almost doesn't matter. The company selects models based on which one it finds the most efficient for the task.

The new AI-powered people search is a manifestation of Berger’s philosophy that it’s best to optimize the recommender system first. The architecture includes a new "intelligent query routing layer," as Berger explained, that itself is LLM-powered. This router pragmatically decides if a user's query — like "trust expert" — should go to the new semantic, natural-language stack or to the old, reliable lexical search.

This entire, complex system is designed to be a "tool" that a future agent will use, not the agent itself.

"Agentic products are only as good as the tools that they use to accomplish tasks for people," Berger said. "You can have the world's best reasoning model, and if you're trying to use an agent to do people search but the people search engine is not very good, you're not going to be able to deliver."

Now that the people search is available, Berger suggested that one day the company will be offering agents to use it. But he didn’t provide details on timing. He also said the recipe used for job and people search will be spread across the company’s other products.

For enterprises building their own AI roadmaps, LinkedIn's playbook is clear:

Be pragmatic: Don't try to boil the ocean. Win one vertical, even if it takes 18 months.

Codify the "cookbook": Turn that win into a repeatable process (policy docs, distillation pipelines, co-design).

Optimize relentlessly: The real 10x gains come after the initial model, in pruning, distillation, and creative optimizations like an RL-trained summarizer.

LinkedIn's journey shows that for real-world enterprise AI, emphasis on specific models or cool agentic systems should take a back seat. The durable, strategic advantage comes from mastering the pipeline — the 'AI-native' cookbook of co-design, distillation, and ruthless optimization.

(Editor's note: We will be publishing a full-length podcast with LinkedIn's Erran Berger, which will dive deeper into these technical details, on the VentureBeat podcast feed soon.)

read more
How Deductive AI saved DoorDash 1,000 engineering hours by automating software debugging

How Deductive AI saved DoorDash 1,000 engineering hours by automating software debugging

As software systems grow more complex and AI tools generate code faster than ever, a fundamental problem is getting worse: Engineers are drowning in debugging work, spending up to half their time hunting down the causes of software failures instead of building new products. The challenge has become so acute that it's creating a new category of tooling — AI agents that can diagnose production failures in minutes instead of hours.

Deductive AI, a startup emerging from stealth mode Wednesday, believes it has found a solution by applying reinforcement learning — the same technology that powers game-playing AI systems — to the messy, high-stakes world of production software incidents. The company announced it has raised $7.5 million in seed funding led by CRV, with participation from Databricks Ventures, Thomvest Ventures, and PrimeSet, to commercialize what it calls "AI SRE agents" that can diagnose and help fix software failures at machine speed.

The pitch resonates with a growing frustration inside engineering organizations: Modern observability tools can show that something broke, but they rarely explain why. When a production system fails at 3 a.m., engineers still face hours of manual detective work, cross-referencing logs, metrics, deployment histories, and code changes across dozens of interconnected services to identify the root cause.

"The complexities and inter-dependencies of modern infrastructure means that investigating the root cause of an outage or incident can feel like searching for a needle in a haystack, except the haystack is the size of a football field, it's made of a million other needles, it's constantly reshuffling itself, and is on fire — and every second you don't find it equals lost revenue," said Sameer Agarwal, Deductive's co-founder and chief technology officer, in an exclusive interview with VentureBeat.

Deductive's system builds what the company calls a "knowledge graph" that maps relationships across codebases, telemetry data, engineering discussions, and internal documentation. When an incident occurs, multiple AI agents work together to form hypotheses, test them against live system evidence, and converge on a root cause — mimicking the investigative workflow of experienced site reliability engineers, but completing the process in minutes rather than hours.

The technology has already shown measurable impact at some of the world's most demanding production environments. DoorDash's advertising platform, which runs real-time auctions that must complete in under 100 milliseconds, has integrated Deductive into its incident response workflow. The company has set an ambitious 2026 goal of resolving production incidents within 10 minutes.

"Our Ads Platform operates at a pace where manual, slow-moving investigations are no longer viable. Every minute of downtime directly affects company revenue," said Shahrooz Ansari, Senior Director of Engineering at DoorDash, in an interview with VentureBeat. "Deductive has become a critical extension of our team, rapidly synthesizing signals across dozens of services and surfacing the insights that matter—within minutes."

DoorDash estimates that Deductive has root-caused approximately 100 production incidents over the past few months, translating to more than 1,000 hours of annual engineering productivity and a revenue impact "in millions of dollars," according to Ansari. At location intelligence company Foursquare, Deductive reduced the time to diagnose Apache Spark job failures by 90% —t urning a process that previously took hours or days into one that completes in under 10 minutes — while generating over $275,000 in annual savings.

Why AI-generated code is creating a debugging crisis

The timing of Deductive's launch reflects a brewing tension in software development: AI coding assistants are enabling engineers to generate code faster than ever, but the resulting software is often harder to understand and maintain.

"Vibe coding," a term popularized by AI researcher Andrej Karpathy, refers to using natural-language prompts to generate code through AI assistants. While these tools accelerate development, they can introduce what Agarwal describes as "redundancies, breaks in architectural boundaries, assumptions, or ignored design patterns" that accumulate over time.

"Most AI-generated code still introduces redundancies, breaks architectural boundaries, makes assumptions, or ignores established design patterns," Agarwal told Venturebeat. "In many ways, we now need AI to help clean up the mess that AI itself is creating."

The claim that engineers spend roughly half their time on debugging isn't hyperbole. The Association for Computing Machinery reports that developers spend 35% to 50% of their time validating and debugging software. More recently, Harness's State of Software Delivery 2025 report found that 67% of developers are spending more time debugging AI-generated code.

"We've seen world-class engineers spending half of their time debugging instead of building," said Rakesh Kothari, Deductive's co-founder and CEO. "And as vibe coding generates new code at a rate we've never seen, this problem is only going to get worse."

How Deductive's AI agents actually investigate production failures

Deductive's technical approach differs substantially from the AI features being added to existing observability platforms like Datadog or New Relic. Most of those systems use large language models to summarize data or identify correlations, but they lack what Agarwal calls "code-aware reasoning"—the ability to understand not just that something broke, but why the code behaves the way it does.

"Most enterprises use multiple observability tools across different teams and services, so no vendor has a single holistic view of how their systems behave, fail, and recover—nor are they able to pair that with an understanding of the code that defines system behavior," Agarwal explained. "These are key ingredients to resolving software incidents and it is exactly the gap Deductive fills."

The system connects to existing infrastructure using read-only API access to observability platforms, code repositories, incident management tools, and chat systems. It then continuously builds and updates its knowledge graph, mapping dependencies between services and tracking deployment histories.

When an alert fires, Deductive launches what the company describes as a multi-agent investigation. Different agents specialize in different aspects of the problem: one might analyze recent code changes, another examines trace data, while a third correlates the timing of the incident with recent deployments. The agents share findings and iteratively refine their hypotheses.

The critical difference from rule-based automation is Deductive's use of reinforcement learning. The system learns from every incident which investigative steps led to correct diagnoses and which were dead ends. When engineers provide feedback, the system incorporates that signal into its learning model.

"Each time it observes an investigation, it learns which steps, data sources, and decisions led to the right outcome," Agarwal said. "It learns how to think through problems, not just point them out."

At DoorDash, a recent latency spike in an API initially appeared to be an isolated service issue. Deductive's investigation revealed that the root cause was actually timeout errors from a downstream machine learning platform undergoing a deployment. The system connected these dots by analyzing log volumes, traces, and deployment metadata across multiple services.

"Without Deductive, our team would have had to manually correlate the latency spike across all logs, traces, and deployment histories," Ansari said. "Deductive was able to explain not just what changed, but how and why it impacted production behavior."

The company keeps humans in the loop—for now

While Deductive's technology could theoretically push fixes directly to production systems, the company has deliberately chosen to keep humans in the loop—at least for now.

"While our system is capable of deeper automation and could push fixes to production, currently, we recommend precise fixes and mitigations that engineers can review, validate, and apply," Agarwal said. "We believe maintaining a human in the loop is essential for trust, transparency and operational safety."

However, he acknowledged that "over time, we do think that deeper automation will come and how humans operate in the loop will evolve."

Databricks and ThoughtSpot veterans bet on reasoning over observability

The founding team brings deep expertise from building some of Silicon Valley's most successful data infrastructure platforms. Agarwal earned his Ph.D. at UC Berkeley, where he created BlinkDB, an influential system for approximate query processing. He was among the first engineers at Databricks, where he helped build Apache Spark. Kothari was an early engineer at ThoughtSpot, where he led teams focused on distributed query processing and large-scale system optimization.

The investor syndicate reflects both the technical credibility and market opportunity. Beyond CRV's Max Gazor, the round included participation from Ion Stoica, founder of Databricks and Anyscale; Ajeet Singh, founder of Nutanix and ThoughtSpot; and Ben Sigelman, founder of Lightstep.

Rather than competing with platforms like Datadog or PagerDuty, Deductive positions itself as a complementary layer that sits on top of existing tools. The pricing model reflects this: Instead of charging based on data volume, Deductive charges based on the number of incidents investigated, plus a base platform fee.

The company offers both cloud-hosted and self-hosted deployment options and emphasizes that it doesn't store customer data on its servers or use it to train models for other customers — a critical assurance given the proprietary nature of both code and production system behavior.

With fresh capital and early customer traction at companies like DoorDash, Foursquare, and Kumo AI, Deductive plans to expand its team and deepen the system's reasoning capabilities from reactive incident analysis to proactive prevention. The near-term vision: helping teams predict problems before they occur.

DoorDash's Ansari offers a pragmatic endorsement of where the technology stands today: "Investigations that were previously manual and time-consuming are now automated, allowing engineers to shift their energy toward prevention, business impact, and innovation."

In an industry where every second of downtime translates to lost revenue, that shift from firefighting to building increasingly looks less like a luxury and more like table stakes.

read more
Ship fast, optimize later: top AI engineers don’t care about cost — they’re prioritizing deployment

Ship fast, optimize later: top AI engineers don’t care about cost — they’re prioritizing deployment

Across industries, rising compute expenses are often cited as a barrier to AI adoption — but leading companies are finding that cost is no longer the real constraint.

The tougher challenges (and the ones top of mind for many tech leaders)? Latency, flexibility and capacity.

At Wonder, for instance, AI adds a mere few cents per order; the food delivery and takeout company is much more concerned with cloud capacity with skyrocketing demands. Recursion, for its part, has been focused on balancing small and larger-scale training and deployment via on-premises clusters and the cloud; this has afforded the biotech company flexibility for rapid experimentation.

The companies’ true in-the-wild experiences highlight a broader industry trend: For enterprises operating AI at scale, economics aren't the key decisive factor — the conversation has shifted from how to pay for AI to how fast it can be deployed and sustained.

AI leaders from the two companies recently sat down with Venturebeat’s CEO and editor-in-chief Matt Marshall as part of VB’s traveling AI Impact Series. Here’s what they shared.

Wonder: Rethink what you assume about capacity

Wonder uses AI to power everything from recommendations to logistics — yet, as of now, reported CTO James Chen, AI adds just a few cents per order.

Chen explained that the technology component of a meal order costs 14 cents, the AI adds 2 to 3 cents, although that’s “going up really rapidly” to 5 to 8 cents. Still, that seems almost immaterial compared to total operating costs.

Instead, the 100% cloud-native AI company’s main concern has been capacity with growing demand. Wonder was built with “the assumption” (which proved to be incorrect) that there would be “unlimited capacity” so they could move “super fast” and wouldn’t have to worry about managing infrastructure, Chen noted.

But the company has grown quite a bit over the last few years, he said; as a result, about six months ago, “we started getting little signals from the cloud providers, ‘Hey, you might need to consider going to region two,’” because they were running out of capacity for CPU or data storage at their facilities as demand grew.

It was “very shocking” that they had to move to plan B earlier than they anticipated. “Obviously it's good practice to be multi-region, but we were thinking maybe two more years down the road,” said Chen.

What's not economically feasible (yet)

Wonder built its own model to maximize its conversion rate, Chen noted; the goal is to surface new restaurants to relevant customers as much as possible. These are “isolated scenarios” where models are trained over time to be “very, very efficient and very fast.”

Currently, the best bet for Wonder’s use case is large models, Chen noted. But in the long term, they’d like to move to small models that are hyper-customized to individuals (via AI agents or concierges) based on their purchase history and even their clickstream. “Having these micro models is definitely the best, but right now the cost is very expensive,” Chen noted. “If you try to create one for each person, it's just not economically feasible.”

Budgeting is an art, not a science

Wonder gives its devs and data scientists as much playroom as possible to experiment, and internal teams review the costs of use to make sure nobody turned on a model and “jacked up massive compute around a huge bill,” said Chen.

The company is trying different things to offload to AI and operate within margins. “But then it's very hard to budget because you have no idea,” he said. One of the challenging things is the pace of development; when a new model comes out, “we can’t just sit there, right? We have to use it.”

Budgeting for the unknown economics of a token-based system is “definitely art versus science.”

A critical component in the software development lifecycle is preserving context when using large native models, he explained. When you find something that works, you can add it to your company’s “corpus of context” that can be sent with every request. That’s big and it costs money each time.

“Over 50%, up to 80% of your costs is just resending the same information back into the same engine again on every request,” said Chen.

In theory, the more they do should require less cost per unit. “I know when a transaction happens, I'll pay the X cent tax for each one, but I don't want to be limited to use the technology for all these other creative ideas."

The 'vindication moment' for Recursion

Recursion, for its part, has focused on meeting broad-ranging compute needs via a hybrid infrastructure of on-premise clusters and cloud inference.

When initially looking to build out its AI infrastructure, the company had to go with its own setup, as “the cloud providers didn't have very many good offerings,” explained CTO Ben Mabey. “The vindication moment was that we needed more compute and we looked to the cloud providers and they were like, ‘Maybe in a year or so.’”

The company’s first cluster in 2017 incorporated Nvidia gaming GPUs (1080s, launched in 2016); they have since added Nvidia H100s and A100s, and use a Kubernetes cluster that they run in the cloud or on-prem.

Addressing the longevity question, Mabey noted: “These gaming GPUs are actually still being used today, which is crazy, right? The myth that a GPU's life span is only three years, that's definitely not the case. A100s are still top of the list, they're the workhorse of the industry.”

Best use cases on-prem vs cloud; cost differences

More recently, Mabey’s team has been training a foundation model on Recursion’s image repository (which consists of petabytes of data and more than 200 pictures). This and other types of big training jobs have required a “massive cluster” and connected, multi-node setups.

“When we need that fully-connected network and access to a lot of our data in a high parallel file system, we go on-prem,” he explained. On the other hand, shorter workloads run in the cloud.

Recursion’s method is to “pre-empt” GPUs and Google tensor processing units (TPUs), which is the process of interrupting running GPU tasks to work on higher-priority ones. “Because we don't care about the speed in some of these inference workloads where we're uploading biological data, whether that's an image or sequencing data, DNA data,” Mabey explained. “We can say, ‘Give this to us in an hour,’ and we're fine if it kills the job.”

From a cost perspective, moving large workloads on-prem is “conservatively” 10 times cheaper, Mabey noted; for a five year TCO, it's half the cost. On the other hand, for smaller storage needs, the cloud can be “pretty competitive” cost-wise.

Ultimately, Mabey urged tech leaders to step back and determine whether they’re truly willing to commit to AI; cost-effective solutions typically require multi-year buy-ins.

“From a psychological perspective, I've seen peers of ours who will not invest in compute, and as a result they're always paying on demand," said Mabey. "Their teams use far less compute because they don't want to run up the cloud bill. Innovation really gets hampered by people not wanting to burn money.”

read more
Google debuts AI chips with 4X performance boost, secures Anthropic megadeal worth billions

Google debuts AI chips with 4X performance boost, secures Anthropic megadeal worth billions

Google Cloud is introducing what it calls its most powerful artificial intelligence infrastructure to date, unveiling a seventh-generation Tensor Processing Unit and expanded Arm-based computing options designed to meet surging demand for AI model deployment — what the company characterizes as a fundamental industry shift from training models to serving them to billions of users.

The announcement, made Thursday, centers on Ironwood, Google's latest custom AI accelerator chip, which will become generally available in the coming weeks. In a striking validation of the technology, Anthropic, the AI safety company behind the Claude family of models, disclosed plans to access up to one million of these TPU chips — a commitment worth tens of billions of dollars and among the largest known AI infrastructure deals to date.

The move underscores an intensifying competition among cloud providers to control the infrastructure layer powering artificial intelligence, even as questions mount about whether the industry can sustain its current pace of capital expenditure. Google's approach — building custom silicon rather than relying solely on Nvidia's dominant GPU chips — amounts to a long-term bet that vertical integration from chip design through software will deliver superior economics and performance.

Why companies are racing to serve AI models, not just train them

Google executives framed the announcements around what they call "the age of inference" — a transition point where companies shift resources from training frontier AI models to deploying them in production applications serving millions or billions of requests daily.

"Today's frontier models, including Google's Gemini, Veo, and Imagen and Anthropic's Claude train and serve on Tensor Processing Units," said Amin Vahdat, vice president and general manager of AI and Infrastructure at Google Cloud. "For many organizations, the focus is shifting from training these models to powering useful, responsive interactions with them."

This transition has profound implications for infrastructure requirements. Where training workloads can often tolerate batch processing and longer completion times, inference — the process of actually running a trained model to generate responses — demands consistently low latency, high throughput, and unwavering reliability. A chatbot that takes 30 seconds to respond, or a coding assistant that frequently times out, becomes unusable regardless of the underlying model's capabilities.

Agentic workflows — where AI systems take autonomous actions rather than simply responding to prompts — create particularly complex infrastructure challenges, requiring tight coordination between specialized AI accelerators and general-purpose computing.

Inside Ironwood's architecture: 9,216 chips working as one supercomputer

Ironwood is more than incremental improvement over Google's sixth-generation TPUs. According to technical specifications shared by the company, it delivers more than four times better performance for both training and inference workloads compared to its predecessor — gains that Google attributes to a system-level co-design approach rather than simply increasing transistor counts.

The architecture's most striking feature is its scale. A single Ironwood "pod" — a tightly integrated unit of TPU chips functioning as one supercomputer — can connect up to 9,216 individual chips through Google's proprietary Inter-Chip Interconnect network operating at 9.6 terabits per second. To put that bandwidth in perspective, it's roughly equivalent to downloading the entire Library of Congress in under two seconds.

This massive interconnect fabric allows the 9,216 chips to share access to 1.77 petabytes of High Bandwidth Memory — memory fast enough to keep pace with the chips' processing speeds. That's approximately 40,000 high-definition Blu-ray movies' worth of working memory, instantly accessible by thousands of processors simultaneously. "For context, that means Ironwood Pods can deliver 118x more FP8 ExaFLOPS versus the next closest competitor," Google stated in technical documentation.

The system employs Optical Circuit Switching technology that acts as a "dynamic, reconfigurable fabric." When individual components fail or require maintenance — inevitable at this scale — the OCS technology automatically reroutes data traffic around the interruption within milliseconds, allowing workloads to continue running without user-visible disruption.

This reliability focus reflects lessons learned from deploying five previous TPU generations. Google reported that its fleet-wide uptime for liquid-cooled systems has maintained approximately 99.999% availability since 2020 — equivalent to less than six minutes of downtime per year.

Anthropic's billion-dollar bet validates Google's custom silicon strategy

Perhaps the most significant external validation of Ironwood's capabilities comes from Anthropic's commitment to access up to one million TPU chips — a staggering figure in an industry where even clusters of 10,000 to 50,000 accelerators are considered massive.

"Anthropic and Google have a longstanding partnership and this latest expansion will help us continue to grow the compute we need to define the frontier of AI," said Krishna Rao, Anthropic's chief financial officer, in the official partnership agreement. "Our customers — from Fortune 500 companies to AI-native startups — depend on Claude for their most important work, and this expanded capacity ensures we can meet our exponentially growing demand."

According to a separate statement, Anthropic will have access to "well over a gigawatt of capacity coming online in 2026" — enough electricity to power a small city. The company specifically cited TPUs' "price-performance and efficiency" as key factors in the decision, along with "existing experience in training and serving its models with TPUs."

Industry analysts estimate that a commitment to access one million TPU chips, with associated infrastructure, networking, power, and cooling, likely represents a multi-year contract worth tens of billions of dollars — among the largest known cloud infrastructure commitments in history.

James Bradbury, Anthropic's head of compute, elaborated on the inference focus: "Ironwood's improvements in both inference performance and training scalability will help us scale efficiently while maintaining the speed and reliability our customers expect."

Google's Axion processors target the computing workloads that make AI possible

Alongside Ironwood, Google introduced expanded options for its Axion processor family — custom Arm-based CPUs designed for general-purpose workloads that support AI applications but don't require specialized accelerators.

The N4A instance type, now entering preview, targets what Google describes as "microservices, containerized applications, open-source databases, batch, data analytics, development environments, experimentation, data preparation and web serving jobs that make AI applications possible." The company claims N4A delivers up to 2X better price-performance than comparable current-generation x86-based virtual machines.

Google is also previewing C4A metal, its first bare-metal Arm instance, which provides dedicated physical servers for specialized workloads such as Android development, automotive systems, and software with strict licensing requirements.

The Axion strategy reflects a growing conviction that the future of computing infrastructure requires both specialized AI accelerators and highly efficient general-purpose processors. While a TPU handles the computationally intensive task of running an AI model, Axion-class processors manage data ingestion, preprocessing, application logic, API serving, and countless other tasks in a modern AI application stack.

Early customer results suggest the approach delivers measurable economic benefits. Vimeo reported observing "a 30% improvement in performance for our core transcoding workload compared to comparable x86 VMs" in initial N4A tests. ZoomInfo measured "a 60% improvement in price-performance" for data processing pipelines running on Java services, according to Sergei Koren, the company's chief infrastructure architect.

Software tools turn raw silicon performance into developer productivity

Hardware performance means little if developers cannot easily harness it. Google emphasized that Ironwood and Axion are integrated into what it calls AI Hypercomputer — "an integrated supercomputing system that brings together compute, networking, storage, and software to improve system-level performance and efficiency."

According to an October 2025 IDC Business Value Snapshot study, AI Hypercomputer customers achieved on average 353% three-year return on investment, 28% lower IT costs, and 55% more efficient IT teams.

Google disclosed several software enhancements designed to maximize Ironwood utilization. Google Kubernetes Engine now offers advanced maintenance and topology awareness for TPU clusters, enabling intelligent scheduling and highly resilient deployments. The company's open-source MaxText framework now supports advanced training techniques including Supervised Fine-Tuning and Generative Reinforcement Policy Optimization.

Perhaps most significant for production deployments, Google's Inference Gateway intelligently load-balances requests across model servers to optimize critical metrics. According to Google, it can reduce time-to-first-token latency by 96% and serving costs by up to 30% through techniques like prefix-cache-aware routing.

The Inference Gateway monitors key metrics including KV cache hits, GPU or TPU utilization, and request queue length, then routes incoming requests to the optimal replica. For conversational AI applications where multiple requests might share context, routing requests with shared prefixes to the same server instance can dramatically reduce redundant computation.

The hidden challenge: powering and cooling one-megawatt server racks

Behind these announcements lies a massive physical infrastructure challenge that Google addressed at the recent Open Compute Project EMEA Summit. The company disclosed that it's implementing +/-400 volt direct current power delivery capable of supporting up to one megawatt per rack — a tenfold increase from typical deployments.

"The AI era requires even greater power delivery capabilities," explained Madhusudan Iyengar and Amber Huffman, Google principal engineers, in an April 2025 blog post. "ML will require more than 500 kW per IT rack before 2030."

Google is collaborating with Meta and Microsoft to standardize electrical and mechanical interfaces for high-voltage DC distribution. The company selected 400 VDC specifically to leverage the supply chain established by electric vehicles, "for greater economies of scale, more efficient manufacturing, and improved quality and scale."

On cooling, Google revealed it will contribute its fifth-generation cooling distribution unit design to the Open Compute Project. The company has deployed liquid cooling "at GigaWatt scale across more than 2,000 TPU Pods in the past seven years" with fleet-wide availability of approximately 99.999%.

Water can transport approximately 4,000 times more heat per unit volume than air for a given temperature change — critical as individual AI accelerator chips increasingly dissipate 1,000 watts or more.

Custom silicon gambit challenges Nvidia's AI accelerator dominance

Google's announcements come as the AI infrastructure market reaches an inflection point. While Nvidia maintains overwhelming dominance in AI accelerators — holding an estimated 80-95% market share — cloud providers are increasingly investing in custom silicon to differentiate their offerings and improve unit economics.

Amazon Web Services pioneered this approach with Graviton Arm-based CPUs and Inferentia / Trainium AI chips. Microsoft has developed Cobalt processors and is reportedly working on AI accelerators. Google now offers the most comprehensive custom silicon portfolio among major cloud providers.

The strategy faces inherent challenges. Custom chip development requires enormous upfront investment — often billions of dollars. The software ecosystem for specialized accelerators lags behind Nvidia's CUDA platform, which benefits from 15+ years of developer tools. And rapid AI model architecture evolution creates risk that custom silicon optimized for today's models becomes less relevant as new techniques emerge.

Yet Google argues its approach delivers unique advantages. "This is how we built the first TPU ten years ago, which in turn unlocked the invention of the Transformer eight years ago — the very architecture that powers most of modern AI," the company noted, referring to the seminal "Attention Is All You Need" paper from Google researchers in 2017.

The argument is that tight integration — "model research, software, and hardware development under one roof" — enables optimizations impossible with off-the-shelf components.

Beyond Anthropic, several other customers provided early feedback. Lightricks, which develops creative AI tools, reported that early Ironwood testing "makes us highly enthusiastic" about creating "more nuanced, precise, and higher-fidelity image and video generation for our millions of global customers," said Yoav HaCohen, the company's research director.

Google's announcements raise questions that will play out over coming quarters. Can the industry sustain current infrastructure spending, with major AI companies collectively committing hundreds of billions of dollars? Will custom silicon prove economically superior to Nvidia GPUs? How will model architectures evolve?

For now, Google appears committed to a strategy that has defined the company for decades: building custom infrastructure to enable applications impossible on commodity hardware, then making that infrastructure available to customers who want similar capabilities without the capital investment.

As the AI industry transitions from research labs to production deployments serving billions of users, that infrastructure layer — the silicon, software, networking, power, and cooling that make it all run — may prove as important as the models themselves.

And if Anthropic's willingness to commit to accessing up to one million chips is any indication, Google's bet on custom silicon designed specifically for the age of inference may be paying off just as demand reaches its inflection point.

read more
Moving past speculation: How deterministic CPUs deliver predictable AI performance

Moving past speculation: How deterministic CPUs deliver predictable AI performance

For more than three decades, modern CPUs have relied on speculative execution to keep pipelines full. When it emerged in the 1990s, speculation was hailed as a breakthrough — just as pipelining and superscalar execution had been in earlier decades. Each marked a generational leap in microarchitecture. By predicting the outcomes of branches and memory loads, processors could avoid stalls and keep execution units busy.

But this architectural shift came at a cost: Wasted energy when predictions failed, increased complexity and vulnerabilities such as Spectre and Meltdown. These challenges set the stage for an alternative: A deterministic, time-based execution model. As David Patterson observed in 1980, “A RISC potentially gains in speed merely from a simpler design.” Patterson’s principle of simplicity underpins a new alternative to speculation: A deterministic, time-based execution model."

For the first time since speculative execution became the dominant paradigm, a fundamentally new approach has been invented. This breakthrough is embodied in a series of six recently issued U.S. patents, sailing through the U.S. Patent and Trademark Office (USPTO). Together, they introduce a radically different instruction execution model. Departing sharply from conventional speculative techniques, this novel deterministic framework replaces guesswork with a time-based, latency-tolerant mechanism. Each instruction is assigned a precise execution slot within the pipeline, resulting in a rigorously ordered and predictable flow of execution. This reimagined model redefines how modern processors can handle latency and concurrency with greater efficiency and reliability.

A simple time counter is used to deterministically set the exact time of when instructions should be executed in the future. Each instruction is dispatched to an execution queue with a preset execution time based on resolving its data dependencies and availability of resources — read buses, execution units and the write bus to the register file. Each instruction remains queued until its scheduled execution slot arrives. This new deterministic approach may represent the first major architectural challenge to speculation since it became the standard.

The architecture extends naturally into matrix computation, with a RISC-V instruction set proposal under community review. Configurable general matrix multiply (GEMM) units, ranging from 8×8 to 64×64, can operate using either register-based or direct-memory acceess (DMA)-fed operands. This flexibility supports a wide range of AI and high-performance computing (HPC) workloads. Early analysis suggests scalability that rivals Google’s TPU cores, while maintaining significantly lower cost and power requirements.

Rather than a direct comparison with general-purpose CPUs, the more accurate reference point is vector and matrix engines: Traditional CPUs still depend on speculation and branch prediction, whereas this design applies deterministic scheduling directly to GEMM and vector units. This efficiency stems not only from the configurable GEMM blocks but also from the time-based execution model, where instructions are decoded and assigned precise execution slots based on operand readiness and resource availability.

Execution is never a random or heuristic choice among many candidates, but a predictable, pre-planned flow that keeps compute resources continuously busy. Planned matrix benchmarks will provide direct comparisons with TPU GEMM implementations, highlighting the ability to deliver datacenter-class performance without datacenter-class overhead.

Critics may argue that static scheduling introduces latency into instruction execution. In reality, the latency already exists — waiting on data dependencies or memory fetches. Conventional CPUs attempt to hide it with speculation, but when predictions fail, the resulting pipeline flush introduces delay and wastes power.

The time-counter approach acknowledges this latency and fills it deterministically with useful work, avoiding rollbacks. As the first patent notes, instructions retain out-of-order efficiency: “A microprocessor with a time counter for statically dispatching instructions enables execution based on predicted timing rather than speculative issue and recovery," with preset execution times but without the overhead of register renaming or speculative comparators.

Why speculation stalled

Speculative execution boosts performance by predicting outcomes before they’re known — executing instructions ahead of time and discarding them if the guess was wrong. While this approach can accelerate workloads, it also introduces unpredictability and power inefficiency. Mispredictions inject “No Ops” into the pipeline, stalling progress and wasting energy on work that never completes.

These issues are magnified in modern AI and machine learning (ML) workloads, where vector and matrix operations dominate and memory access patterns are irregular. Long fetches, non-cacheable loads and misaligned vectors frequently trigger pipeline flushes in speculative architectures.

The result is performance cliffs that vary wildly across datasets and problem sizes, making consistent tuning nearly impossible. Worse still, speculative side effects have exposed vulnerabilities that led to high-profile security exploits. As data intensity grows and memory systems strain, speculation struggles to keep pace — undermining its original promise of seamless acceleration.

Time-based execution and deterministic scheduling

At the core of this invention is a vector coprocessor with a time counter for statically dispatching instructions. Rather than relying on speculation, instructions are issued only when data dependencies and latency windows are fully known. This eliminates guesswork and costly pipeline flushes while preserving the throughput advantages of out-of-order execution. Architectures built on this patented framework feature deep pipelines — typically spanning 12 stages — combined with wide front ends supporting up to 8-way decode and large reorder buffers exceeding 250 entries

As illustrated in Figure 1, the architecture mirrors a conventional RISC-V processor at the top level, with instruction fetch and decode stages feeding into execution units. The innovation emerges in the integration of a time counter and register scoreboard, strategically positioned between fetch/decode and the vector execution units. Instead of relying on speculative comparators or register renaming, they utilize a Register Scoreboard and Time Resource Matrix (TRM) to deterministically schedule instructions based on operand readiness and resource availability.

Figure 1: High-level block diagram of deterministic processor. A time counter and scoreboard sit between fetch/decode and vector execution units, ensuring instructions issue only when operands are ready.

A typical program running on the deterministic processor begins much like it does on any conventional RISC-V system: Instructions are fetched from memory and decoded to determine whether they are scalar, vector, matrix or custom extensions. The difference emerges at the point of dispatch. Instead of issuing instructions speculatively, the processor employs a cycle-accurate time counter, working with a register scoreboard, to decide exactly when each instruction can be executed. This mechanism provides a deterministic execution contract, ensuring instructions complete at predictable cycles and reducing wasted issue slots.

In conjunction with a register scoreboard, the time-resource matrix associates instructions with execution cycles, allowing the processor to plan dispatch deterministically across available resources. The scoreboard tracks operand readiness and hazard information, enabling scheduling without register renaming or speculative comparators. By monitoring dependencies such as read-after-write (RAW) and write-after-read, it ensures hazards are resolved without costly pipeline flushes. As noted in the patent, “in a multi-threaded microprocessor, the time counter and scoreboard permit rescheduling around cache misses, branch flushes, and RAW hazards without speculative rollback.”

Once operands are ready, the instruction is dispatched to the appropriate execution unit. Scalar operations use standard artithmetic logic units (ALUs), while vector and matrix instructions execute in wide execution units connected to a large vector register file. Because instructions launch only when conditions are safe, these units stay highly utilized without the wasted work or recovery cycles caused by mis-predicted speculation.

The key enabler of this approach is a simple time counter that orchestrates execution according to data readiness and resource availability, ensuring instructions advance only when operands are ready and resources available. The same principle applies to memory operations: The interface predicts latency windows for loads and stores, allowing the processor to fill those slots with independent instructions and keep execution flowing.

Programming model differences

From the programmer’s perspective, the flow remains familiar — RISC-V code compiles and executes in the usual way. The crucial difference lies in the execution contract: Rather than relying on dynamic speculation to hide latency, the processor guarantees predictable dispatch and completion times. This eliminates the performance cliffs and wasted energy of speculation while still providing the throughput benefits of out-of-order execution.

This perspective underscores how deterministic execution preserves the familiar RISC-V programming model while eliminating the unpredictability and wasted effort of speculation. As John Hennessy put it: "It’s stupid to do work in run time that you can do in compile time”— a remark reflecting the foundations of RISC and its forward-looking design philosophy.

The RISC-V ISA provides opcodes for custom and extension instructions, including floating-point, DSP, and vector operations. The result is a processor that executes instructions deterministically while retaining the benefits of out-of-order performance. By eliminating speculation, the design simplifies hardware, reduces power consumption and avoids pipeline flushes.

These efficiency gains grow even more significant in vector and matrix operations, where wide execution units require consistent utilization to reach peak performance. Vector extensions require wide register files and large execution units, which in speculative processors necessitate expensive register renaming to recover from branch mispredictions. In the deterministic design, vector instructions are executed only after commit, eliminating the need for renaming.

Each instruction is scheduled against a cycle-accurate time counter: “The time counter provides a deterministic execution contract, ensuring instructions complete at predictable cycles and reducing wasted issue slots.” The vector register scoreboard resolves data dependency before issuing instructions to execution pipeline. Instructions are dispatched in a known order at the correct cycle, making execution both predictable and efficient.

Vector execution units (integer and floating point) connect directly to a large vector register file. Because instructions are never flushed, there is no renaming overhead. The scoreboard ensures safe access, while the time counter aligns execution with memory readiness. A dedicated memory block predicts the return cycle of loads. Instead of stalling or speculating, the processor schedules independent instructions into latency slots, keeping execution units busy. “A vector coprocessor with a time counter for statically dispatching instructions ensures high utilization of wide execution units while avoiding misprediction penalties.”

In today’s CPUs, compilers and programmers write code assuming the hardware will dynamically reorder instructions and speculatively execute branches. The hardware handles hazards with register renaming, branch prediction and recovery mechanisms. Programmers benefit from performance, but at the cost of unpredictability and power consumption.

In the deterministic time-based architecture, instructions are dispatched only when the time counter indicates their operands will be ready. This means the compiler (or runtime system) doesn’t need to insert guard code for misprediction recovery. Instead, compiler scheduling becomes simpler, as instructions are guaranteed to issue at the correct cycle without rollbacks. For programmers, the ISA remains RISC-V compatible, but deterministic extensions reduce reliance on speculative safety nets.

Application in AI and ML

In AI/ML kernels, vector loads and matrix operations often dominate runtime. On a speculative CPU, misaligned or non-cacheable loads can trigger stalls or flushes, starving wide vector and matrix units and wasting energy on discarded work. A deterministic design instead issues these operations with cycle-accurate timing, ensuring high utilization and steady throughput. For programmers, this means fewer performance cliffs and more predictable scaling across problem sizes. And because the patents extend the RISC-V ISA rather than replace it, deterministic processors remain fully compatible with the RVA23 profile and mainstream toolchains such as GCC, LLVM, FreeRTOS, and Zephyr.

In practice, the deterministic model doesn’t change how code is written — it remains RISC-V assembly or high-level languages compiled to RISC-V instructions. What changes is the execution contract: Rather than relying on speculative guesswork, programmers can expect predictable latency behavior and higher efficiency without tuning code around microarchitectural quirks.

The industry is at an inflection point. AI/ML workloads are dominated by vector and matrix math, where GPUs and TPUs excel — but only by consuming massive power and adding architectural complexity. In contrast, general-purpose CPUs, still tied to speculative execution models, lag behind.

A deterministic processor delivers predictable performance across a wide range of workloads, ensuring consistent behavior regardless of task complexity. Eliminating speculative execution enhances energy efficiency and avoids unnecessary computational overhead. Furthermore, deterministic design scales naturally to vector and matrix operations, making it especially well-suited for AI workloads that rely on high-throughput parallelism. This new deterministic approach may represent the next such leap: The first major architectural challenge to speculation since speculation itself became the standard.

Will deterministic CPUs replace speculation in mainstream computing? That remains to be seen. But with issued patents, proven novelty and growing pressure from AI workloads, the timing is right for a paradigm shift. Taken together, these advances signal deterministic execution as the next architectural leap — redefining performance and efficiency just as speculation once did.

Speculation marked the last revolution in CPU design; determinism may well represent the next.

Thang Tran is the founder and CTO of Simplex Micro.

Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.

read more
Thinking Machines challenges OpenAI’s AI scaling strategy: ‘First superintelligence will be a superhuman learner’

Thinking Machines challenges OpenAI’s AI scaling strategy: ‘First superintelligence will be a superhuman learner’

While the world's leading artificial intelligence companies race to build ever-larger models, betting billions that scale alone will unlock artificial general intelligence, a researcher at one of the industry's most secretive and valuable startups delivered a pointed challenge to that orthodoxy this week: The path forward isn't about training bigger — it's about learning better.

"I believe that the first superintelligence will be a superhuman learner," Rafael Rafailov, a reinforcement learning researcher at Thinking Machines Lab, told an audience at TED AI San Francisco on Tuesday. "It will be able to very efficiently figure out and adapt, propose its own theories, propose experiments, use the environment to verify that, get information, and iterate that process."

This breaks sharply with the approach pursued by OpenAI, Anthropic, Google DeepMind, and other leading laboratories, which have bet billions on scaling up model size, data, and compute to achieve increasingly sophisticated reasoning capabilities. Rafailov argues these companies have the strategy backwards: what's missing from today's most advanced AI systems isn't more scale — it's the ability to actually learn from experience.

"Learning is something an intelligent being does," Rafailov said, citing a quote he described as recently compelling. "Training is something that's being done to it."

The distinction cuts to the core of how AI systems improve — and whether the industry's current trajectory can deliver on its most ambitious promises. Rafailov's comments offer a rare window into the thinking at Thinking Machines Lab, the startup co-founded in February by former OpenAI chief technology officer Mira Murati that raised a record-breaking $2 billion in seed funding at a $12 billion valuation.

Why today's AI coding assistants forget everything they learned yesterday

To illustrate the problem with current AI systems, Rafailov offered a scenario familiar to anyone who has worked with today's most advanced coding assistants.

"If you use a coding agent, ask it to do something really difficult — to implement a feature, go read your code, try to understand your code, reason about your code, implement something, iterate — it might be successful," he explained. "And then come back the next day and ask it to implement the next feature, and it will do the same thing."

The issue, he argued, is that these systems don't internalize what they learn. "In a sense, for the models we have today, every day is their first day of the job," Rafailov said. "But an intelligent being should be able to internalize information. It should be able to adapt. It should be able to modify its behavior so every day it becomes better, every day it knows more, every day it works faster — the way a human you hire gets better at the job."

The duct tape problem: How current training methods teach AI to take shortcuts instead of solving problems

Rafailov pointed to a specific behavior in coding agents that reveals the deeper problem: their tendency to wrap uncertain code in try/except blocks — a programming construct that catches errors and allows a program to continue running.

"If you use coding agents, you might have observed a very annoying tendency of them to use try/except pass," he said. "And in general, that is basically just like duct tape to save the entire program from a single error."

Why do agents do this? "They do this because they understand that part of the code might not be right," Rafailov explained. "They understand there might be something wrong, that it might be risky. But under the limited constraint—they have a limited amount of time solving the problem, limited amount of interaction—they must only focus on their objective, which is implement this feature and solve this bug."

The result: "They're kicking the can down the road."

This behavior stems from training systems that optimize for immediate task completion. "The only thing that matters to our current generation is solving the task," he said. "And anything that's general, anything that's not related to just that one objective, is a waste of computation."

Why throwing more compute at AI won't create superintelligence, according to Thinking Machines researcher

Rafailov's most direct challenge to the industry came in his assertion that continued scaling won't be sufficient to reach AGI.

"I don't believe we're hitting any sort of saturation points," he clarified. "I think we're just at the beginning of the next paradigm—the scale of reinforcement learning, in which we move from teaching our models how to think, how to explore thinking space, into endowing them with the capability of general agents."

In other words, current approaches will produce increasingly capable systems that can interact with the world, browse the web, write code. "I believe a year or two from now, we'll look at our coding agents today, research agents or browsing agents, the way we look at summarization models or translation models from several years ago," he said.

But general agency, he argued, is not the same as general intelligence. "The much more interesting question is: Is that going to be AGI? And are we done — do we just need one more round of scaling, one more round of environments, one more round of RL, one more round of compute, and we're kind of done?"

His answer was unequivocal: "I don't believe this is the case. I believe that under our current paradigms, under any scale, we are not enough to deal with artificial general intelligence and artificial superintelligence. And I believe that under our current paradigms, our current models will lack one core capability, and that is learning."

Teaching AI like students, not calculators: The textbook approach to machine learning

To explain the alternative approach, Rafailov turned to an analogy from mathematics education.

"Think about how we train our current generation of reasoning models," he said. "We take a particular math problem, make it very hard, and try to solve it, rewarding the model for solving it. And that's it. Once that experience is done, the model submits a solution. Anything it discovers—any abstractions it learned, any theorems—we discard, and then we ask it to solve a new problem, and it has to come up with the same abstractions all over again."

That approach misunderstands how knowledge accumulates. "This is not how science or mathematics works," he said. "We build abstractions not necessarily because they solve our current problems, but because they're important. For example, we developed the field of topology to extend Euclidean geometry — not to solve a particular problem that Euclidean geometry couldn't handle, but because mathematicians and physicists understood these concepts were fundamentally important."

The solution: "Instead of giving our models a single problem, we might give them a textbook. Imagine a very advanced graduate-level textbook, and we ask our models to work through the first chapter, then the first exercise, the second exercise, the third, the fourth, then move to the second chapter, and so on—the way a real student might teach themselves a topic."

The objective would fundamentally change: "Instead of rewarding their success — how many problems they solved — we need to reward their progress, their ability to learn, and their ability to improve."

This approach, known as "meta-learning" or "learning to learn," has precedents in earlier AI systems. "Just like the ideas of scaling test-time compute and search and test-time exploration played out in the domain of games first" — in systems like DeepMind's AlphaGo — "the same is true for meta learning. We know that these ideas do work at a small scale, but we need to adapt them to the scale and the capability of foundation models."

The missing ingredients for AI that truly learns aren't new architectures—they're better data and smarter objectives

When Rafailov addressed why current models lack this learning capability, he offered a surprisingly straightforward answer.

"Unfortunately, I think the answer is quite prosaic," he said. "I think we just don't have the right data, and we don't have the right objectives. I fundamentally believe a lot of the core architectural engineering design is in place."

Rather than arguing for entirely new model architectures, Rafailov suggested the path forward lies in redesigning the data distributions and reward structures used to train models.

"Learning, in of itself, is an algorithm," he explained. "It has inputs — the current state of the model. It has data and compute. You process it through some sort of structure, choose your favorite optimization algorithm, and you produce, hopefully, a stronger model."

The question: "If reasoning models are able to learn general reasoning algorithms, general search algorithms, and agent models are able to learn general agency, can the next generation of AI learn a learning algorithm itself?"

His answer: "I strongly believe that the answer to this question is yes."

The technical approach would involve creating training environments where "learning, adaptation, exploration, and self-improvement, as well as generalization, are necessary for success."

"I believe that under enough computational resources and with broad enough coverage, general purpose learning algorithms can emerge from large scale training," Rafailov said. "The way we train our models to reason in general over just math and code, and potentially act in general domains, we might be able to teach them how to learn efficiently across many different applications."

Forget god-like reasoners: The first superintelligence will be a master student

This vision leads to a fundamentally different conception of what artificial superintelligence might look like.

"I believe that if this is possible, that's the final missing piece to achieve truly efficient general intelligence," Rafailov said. "Now imagine such an intelligence with the core objective of exploring, learning, acquiring information, self-improving, equipped with general agency capability—the ability to understand and explore the external world, the ability to use computers, ability to do research, ability to manage and control robots."

Such a system would constitute artificial superintelligence. But not the kind often imagined in science fiction.

"I believe that intelligence is not going to be a single god model that's a god-level reasoner or a god-level mathematical problem solver," Rafailov said. "I believe that the first superintelligence will be a superhuman learner, and it will be able to very efficiently figure out and adapt, propose its own theories, propose experiments, use the environment to verify that, get information, and iterate that process."

This vision stands in contrast to OpenAI's emphasis on building increasingly powerful reasoning systems, or Anthropic's focus on "constitutional AI." Instead, Thinking Machines Lab appears to be betting that the path to superintelligence runs through systems that can continuously improve themselves through interaction with their environment.

The $12 billion bet on learning over scaling faces formidable challenges

Rafailov's appearance comes at a complex moment for Thinking Machines Lab. The company has assembled an impressive team of approximately 30 researchers from OpenAI, Google, Meta, and other leading labs. But it suffered a setback in early October when Andrew Tulloch, a co-founder and machine learning expert, departed to return to Meta after the company launched what The Wall Street Journal called a "full-scale raid" on the startup, approaching more than a dozen employees with compensation packages ranging from $200 million to $1.5 billion over multiple years.

Despite these pressures, Rafailov's comments suggest the company remains committed to its differentiated technical approach. The company launched its first product, Tinker, an API for fine-tuning open-source language models, in October. But Rafailov's talk suggests Tinker is just the foundation for a much more ambitious research agenda focused on meta-learning and self-improving systems.

"This is not easy. This is going to be very difficult," Rafailov acknowledged. "We'll need a lot of breakthroughs in memory and engineering and data and optimization, but I think it's fundamentally possible."

He concluded with a play on words: "The world is not enough, but we need the right experiences, and we need the right type of rewards for learning."

The question for Thinking Machines Lab — and the broader AI industry — is whether this vision can be realized, and on what timeline. Rafailov notably did not offer specific predictions about when such systems might emerge.

In an industry where executives routinely make bold predictions about AGI arriving within years or even months, that restraint is notable. It suggests either unusual scientific humility — or an acknowledgment that Thinking Machines Lab is pursuing a much longer, harder path than its competitors.

For now, the most revealing detail may be what Rafailov didn't say during his TED AI presentation. No timeline for when superhuman learners might emerge. No prediction about when the technical breakthroughs would arrive. Just a conviction that the capability was "fundamentally possible" — and that without it, all the scaling in the world won't be enough.

read more

The shift from assistance to agency

Why context engineering is the real unlock

Workflow must change alongside tooling

What enterprise decision-makers should focus on now

Bottom line

The ontology-based source of truth

Getting started with ontology

Expanding Resource Automation to AI Infrastructure

Technical Integration and Platform Compatibility

Performance, Visibility, and User Control

Cost Savings and Enterprise Case Studies

Industry Context and Company Perspective

A Unified Approach for the Future

Prediction 1: The missing unicorn

Prediction 2: Vectors alone won’t cut it

Prediction 3: A crowded field becomes commoditized

The new reality: Hybrid and GraphRAG

Benchmarks and evidence

What this means going forward

Looking ahead: What’s next

From shiny objects to essential infrastructure

The new challenge: a 1.3 billion-member graph

Distilling for a 10x throughput gain

Pragmatism over hype: building tools, not agents

Why AI-generated code is creating a debugging crisis

How Deductive's AI agents actually investigate production failures

The company keeps humans in the loop—for now

Databricks and ThoughtSpot veterans bet on reasoning over observability

Wonder: Rethink what you assume about capacity

What's not economically feasible (yet)

Budgeting is an art, not a science

The 'vindication moment' for Recursion

Best use cases on-prem vs cloud; cost differences

Why companies are racing to serve AI models, not just train them

Inside Ironwood's architecture: 9,216 chips working as one supercomputer

Anthropic's billion-dollar bet validates Google's custom silicon strategy

Google's Axion processors target the computing workloads that make AI possible

Software tools turn raw silicon performance into developer productivity

The hidden challenge: powering and cooling one-megawatt server racks

Custom silicon gambit challenges Nvidia's AI accelerator dominance

Why speculation stalled

Time-based execution and deterministic scheduling

Programming model differences

Application in AI and ML

Why today's AI coding assistants forget everything they learned yesterday

The duct tape problem: How current training methods teach AI to take shortcuts instead of solving problems

Why throwing more compute at AI won't create superintelligence, according to Thinking Machines researcher

Teaching AI like students, not calculators: The textbook approach to machine learning

The missing ingredients for AI that truly learns aren't new architectures—they're better data and smarter objectives

Forget god-like reasoners: The first superintelligence will be a master student

The $12 billion bet on learning over scaling faces formidable challenges

Useful Links

Useful Links

AVADA IT

RECENT TWEETS

CONTACT US