João Freitas is GM and VP of engineering for AI and automation at PagerDuty
As AI use continues to evolve in large organizations, leaders are increasingly seeking the next development that will yield major ROI. The latest wave of this ongoing trend is the adoption of AI agents. However, as with any new technology, organizations must ensure they adopt AI agents in a responsible way that allows them to facilitate both speed and security.
More than half of organizations have already deployed AI agents to some extent, with more expecting to follow suit in the next two years. But many early adopters are now reevaluating their approach. Four-in-10 tech leaders regret not establishing a stronger governance foundation from the start, which suggests they adopted AI rapidly, but with margin to improve on policies, rules and best practices designed to ensure the responsible, ethical and legal development and use of AI.
As AI adoption accelerates, organizations must find the right balance between their exposure risk and the implementation of guardrails to ensure AI use is secure.
Where do AI agents create potential risks?
There are three principal areas of consideration for safer AI adoption.
The first is shadow AI, when employees use unauthorized AI tools without express permission, bypassing approved tools and processes. IT should create necessary processes for experimentation and innovation to introduce more efficient ways of working with AI. While shadow AI has existed as long as AI tools themselves, AI agent autonomy makes it easier for unsanctioned tools to operate outside the purview of IT, which can introduce fresh security risks.
Secondly, organizations must close gaps in AI ownership and accountability to prepare for incidents or processes gone wrong. The strength of AI agents lies in their autonomy. However, if agents act in unexpected ways, teams must be able to determine who is responsible for addressing any issues.
The third risk arises when there is a lack of explainability for actions AI agents have taken. AI agents are goal-oriented, but how they accomplish their goals can be unclear. AI agents must have explainable logic underlying their actions so that engineers can trace and, if needed, roll back actions that may cause issues with existing systems.
While none of these risks should delay adoption, they will help organizations better ensure their security.
The three guidelines for responsible AI agent adoption
Once organizations have identified the risks AI agents can pose, they must implement guidelines and guardrails to ensure safe usage. By following these three steps, organizations can minimize these risks.
1: Make human oversight the default
AI agency continues to evolve at a fast pace. However, we still need human oversight when AI agents are given the capacity to act, make decisions and pursue a goal that may impact key systems. A human should be in the loop by default, especially for business-critical use cases and systems. The teams that use AI must understand the actions it may take and where they may need to intervene. Start conservatively and, over time, increase the level of agency given to AI agents.
In conjunction, operations teams, engineers and security professionals must understand the role they play in supervising AI agents’ workflows. Each agent should be assigned a specific human owner for clearly defined oversight and accountability. Organizations must also allow any human to flag or override an AI agent’s behavior when an action has a negative outcome.
When considering tasks for AI agents, organizations should understand that, while traditional automation is good at handling repetitive, rule-based processes with structured data inputs, AI agents can handle much more complex tasks and adapt to new information in a more autonomous way. This makes them an appealing solution for all sorts of tasks. But as AI agents are deployed, organizations should control what actions the agents can take, particularly in the early stages of a project. Thus, teams working with AI agents should have approval paths in place for high-impact actions to ensure agent scope does not extend beyond expected use cases, minimizing risk to the wider system.
2: Bake in security
The introduction of new tools should not expose a system to fresh security risks.
Organizations should consider agentic platforms that comply with high security standards and are validated by enterprise-grade certifications such as SOC2, FedRAMP or equivalent. Further, AI agents should not be allowed free rein across an organization’s systems. At a minimum, the permissions and security scope of an AI agent must be aligned with the scope of the owner, and any tools added to the agent should not allow for extended permissions. Limiting AI agent access to a system based on their role will also ensure deployment runs smoothly. Keeping complete logs of every action taken by an AI agent can also help engineers understand what happened in the event of an incident and trace back the problem.
3: Make outputs explainable
AI use in an organization must never be a black box. The reasoning behind any action must be illustrated so that any engineer who tries to access it can understand the context the agent used for decision-making and access the traces that led to those actions.
Inputs and outputs for every action should be logged and accessible. This will help organizations establish a firm overview of the logic underlying an AI agent’s actions, providing significant value in the event anything goes wrong.
Security underscores AI agents’ success
AI agents offer a huge opportunity for organizations to accelerate and improve their existing processes. However, if they do not prioritize security and strong governance, they could expose themselves to new risks.
As AI agents become more common, organizations must ensure they have systems in place to measure how they perform and the ability to take action when they create problems.
Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.
Tony Stoyanov is CTO and co-founder of EliseAI
In the 2010s, tech companies chased staff-level specialists: Backend engineers, data scientists, system architects. That model worked when technology evolved slowly. Specialists knew their craft, could deliver quickly and built careers on predictable foundations like cloud infrastructure or the latest JS framework
Then AI went mainstream.
The pace of change has exploded. New technologies appear and mature in less than a year. You can’t hire someone who has been building AI agents for five years, as the technology hasn’t existed for that long. The people thriving today aren’t those with the longest résumés; they’re the ones who learn fast, adapt fast and act without waiting for direction. Nowhere is this transformation more evident than in software engineering, which has likely experienced the most dramatic shift of all, evolving faster than almost any other field of work.
How AI Is rewriting the rules
AI has lowered the barrier to doing complex technical work, technical skills and it's also raised expectations for what counts as real expertise. McKinsey estimates that by 2030, up to 30% of U.S. work hours could be automated and 12 million workers may need to shift roles entirely. Technical depth still matters, but AI favors people who can figure things out as they go.
At my company, I see this every day. Engineers who never touched front-end code are now building UIs, while front-end developers are moving into back-end work. The technology keeps getting easier to use but the problems are harder because they span more disciplines.
In that kind of environment, being great at one thing isn’t enough. What matters is the ability to bridge engineering, product and operations to make good decisions quickly, even with imperfect information.
Despite all the excitement, only 1% of companies consider themselves truly mature in how they use AI. Many still rely on structures built for a slower era — layers of approval, rigid roles and an overreliance on specialists who can’t move outside their lane.
The traits of a strong generalist
A strong generalist has breadth without losing depth. They go deep in one or two domains but stay fluent across many. As David Epstein puts it in Range, “You have people walking around with all the knowledge of humanity on their phone, but they have no idea how to integrate it. We don’t train people in thinking or reasoning.” True expertise comes from connecting the dots, not just collecting information.
The best generalists share these traits:
-
Ownership: End-to-end accountability for outcomes, not just tasks.
-
First-principles thinking: Question assumptions, focus on the goal, and rebuild when needed.
-
Adaptability: Learn new domains quickly and move between them smoothly.
-
Agency: Act without waiting for approval and adjust as new information comes in.
-
Soft skills: Communicate clearly, align teams and keep customers’ needs in focus.
-
Range: Solve different kinds of problems and draw lessons across contexts.
I try to make accountability a priority for my teams. Everyone knows what they own, what success looks like and how it connects to the mission. Perfection isn’t the goal, forward movement is.
Embracing the shift
Focusing on adaptable builders changed everything. These are the people with the range and curiosity to use AI tools to learn quickly and execute confidently.
If you’re a builder who thrives in ambiguity, this is your time. The AI era rewards curiosity and initiative more than credentials. If you’re hiring, look ahead. The people who’ll move your company forward might not be the ones with the perfect résumé for the job. They’re the ones who can grow into what the company will need as it evolves.
The future belongs to generalists and to the companies that trust them.
Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.
-
Picture this: You're sitting in a conference room, halfway through a vendor pitch. The demo looks solid, and pricing fits nicely under budget. The timeline seems reasonable too. Everyone’s nodding along.
You’re literally minutes away from saying yes.
Then someone from your finance team walks in. They see the deck and frown. A few minutes later, they shoot you a message on Slack: “Actually, I threw together a version of this last week. Took me 2 hours in Cursor. Wanna take a look?”
Wait… what?
This person doesn't code. You know for a fact they've never written a line of JavaScript in their entire life. But here they are, showing you a working prototype on their laptop that does... pretty much exactly what the vendor pitched. Sure, it's got some rough edges, but it works. And it didn’t cost six figures. Just two hours of their time.
Suddenly, the assumptions you walked in with — about how software is developed, who makes it and how decisions are made around it — all start coming apart at the seams.
The old framework
For decades, every growing company asked the same question: Should we build this ourselves, or should we buy it?
And, for decades, the answer was pretty straightforward: Build if it's core to your business; buy if it isn’t.
The logic made sense, because building was expensive and meant borrowing time from overworked engineers, writing specs, planning sprints, managing infrastructure and bracing yourself for a long tail of maintenance. Buying was faster. Safer. You paid for the support and the peace of mind.
But something fundamental has changed: AI has made building accessible to everyone. What used to take weeks now takes hours, and what used to require fluency in a programming language now requires fluency in plain English.
When the cost and complexity of building collapse this dramatically, the old framework goes down with them. It’s not build versus buy anymore. It’s something stranger that we haven't quite found the right words for.
When the market doesn’t know what you need (yet)
My company never planned to build so many of the tools we use. We just had to build because the things we needed didn’t exist. And, through that process, we developed this visceral understanding of what we actually wanted, what was useful and what it could or couldn't do. Not what vendor decks told us we needed or what analyst reports said we should want, but what actually moved the needle in our business.
We figured out which problems were worth solving, which ones weren’t, where AI created real leverage and where it was just noise. And only then, once we had that hard-earned clarity, did we start buying.
By that point, we knew exactly what we were looking for and could tell the difference between substance and marketing in about five minutes. We asked questions that made vendors nervous because we'd already built some rudimentary version of what they were selling.
When anyone can build in minutes
Last week, someone on our CX team noticed some customer feedback about a bug in Slack. Just a minor customer complaint, nothing major. In another company, this would’ve kicked off a support ticket and they’d have waited for someone else to handle it, but that’s not what happened here. They opened Cursor, described the change and let AI write the fix. Then they submitted a pull request that engineering reviewed and merged.
Just 15 minutes after that complaint popped up in Slack, the fix was live in production.
The person who did this isn’t technical in the slightest. I doubt they could tell you the difference between Python and JavaScript, but they solved the problem anyway.
And that’s the point.
AI has gotten so good at cranking out relatively simple code that it handles 80% of the problems that used to require a sprint planning meeting and two weeks of engineering time. It’s erasing the boundary between technical and non-technical. Work that used to be bottlenecked by engineering is now being done by the people closest to the problem.
This is happening right now in companies that are actually paying attention.
The inversion that’s happening
Here's where it gets fascinating for finance leaders, because AI has actually flipped the entire strategic logic of the build versus buy decision on its head.
The old model went something like:
-
Define the need.
-
Decide whether to build or buy.
But defining the need took forever and required deep technical expertise, or you'd burn through money through trial-and-error vendor implementations. You'd sit through countless demos, trying to picture whether this actually solved your problem. Then you’d negotiate, implement, move all your data and workflows to the new tool and six months and six figures later discover whether (or not) you were actually right.
Now, the whole sequence gets turned around:
-
Build something lightweight with AI.
-
Use it to understand what you actually need.
-
Then decide whether to buy (and you'll know exactly why).
This approach lets you run controlled experiments. You figure out whether the problem even matters. You discover which features deliver value and which just look good in demos. Then you go shopping. Instead of letting some external vendor sell you on what the need is, you get to figure out whether you even have that need in the first place.
Think about how many software purchases you've made that, in hindsight, solved problems you didn't actually have. How many times have you been three months into an implementation and thought, “Hang on, is this actually helping us, or are we just trying to justify what we spent?”
Now, when you do buy, the question becomes “Does this solve the problem better than what we already proved we can build?”
That one reframe changes the entire conversation. Now you show up to vendor calls informed. You ask sharper questions, and negotiate from a place of strength. Most importantly, you avoid the most expensive mistake in enterprise software, which is solving a problem you never really had.
The trap you need to avoid
As this new capability emerges, I’m watching companies sprint in the wrong direction. They know they need to be AI native, so they go on a shopping spree. They look for AI-powered tools, filling their stack with products that have GPT integrations, chatbot UIs or “AI” slapped onto the marketing site. They think they’re transforming, but they’re not.
Remember what physicist Richard Feynman called cargo cult science? After World War II, islanders in the South Pacific built fake airstrips and control towers, mimicking what they'd seen during the war, hoping planes full of cargo would return. They had all the outward forms of an airport: Towers, headsets, even people miming flight controllers. But no planes landed, because the form wasn’t the function.
That’s exactly what’s happening with AI transformation in boardrooms everywhere. Leaders are buying AI tools without asking if they meaningfully change how work gets done, who they empower or what processes they unlock.
They’ve built the airstrip, but the planes aren’t showing up.
And the whole market's basically set up to make you fall into this trap. Everything gets branded as AI now, but nobody seems to care what these products actually do. Every SaaS product has bolted on a chatbot or an auto-complete feature and slapped an AI label on it, and the label has lost all meaning. It’s just a checkbox vendors figure they need to tick, regardless of whether it creates actual value for customers.
The finance team’s new superpower
This is the part that gets me excited about what finance teams can do now. You don’t have to guess anymore. You don’t have to bet six figures on a sales deck. You can test things, and you can actually learn something before you spend.
Here's what I mean: If you’re evaluating vendor management software, prototype the core workflow with AI tools. Figure out whether you’re solving a tooling problem or a process problem. Figure out whether you need software at all.
This doesn’t mean you’ll build everything internally — of course not. Most of the time, you’ll still end up buying, and that's totally fine, because enterprise tools exist for good reasons (scale, support, security, and maintenance). But now you’ll buy with your eyes wide open.
You’ll know what “good” looks like. You’ll show up to demos already understanding the edge cases, and know in about 5 minutes whether they actually get your specific problem. You’ll implement faster. You'll negotiate better because you're not completely dependent on the vendor's solution. And you’ll choose it because it's genuinely better than what you could build yourself.
You'll have already mapped out the shape of what you need, and you'll just be looking for the best version of it.
The new paradigm
For years, the mantra was: Build or buy.
Now, it’s more elegant and way smarter: Build to learn what to buy.
And it's not some future state. This is already happening. Right now, somewhere, a customer rep is using AI to fix a product issue they spotted minutes ago. Somewhere else, a finance team is prototyping their own analytical tools because they've realized they can iterate faster than they can write up requirements for engineering. Somewhere, a team is realizing that the boundary between technical and non-technical was always more cultural than fundamental.
The companies that embrace this shift will move faster and spend smarter. They’ll know their operations more deeply than any vendor ever could. They'll make fewer expensive mistakes, and buy better tools because they actually understand what makes tools good.
The companies that stick to the old playbook will keep sitting through vendor pitches, nodding along at budget-friendly proposals. They’ll debate timelines, and keep mistaking professional decks for actual solutions.
Until someone on their own team pops open their laptop, says, “I built a version of this last night. Want to check it out?,” and shows them something they built in two hours that does 80% of what they’re about to pay six figures for.
And, just like that, the rules change for good.
Siqi Chen is co-founder and CEO of Runway.
Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.
-
Gen AI in software engineering has moved well beyond autocomplete. The emerging frontier is agentic coding: AI systems capable of planning changes, executing them across multiple steps and iterating based on feedback. Yet despite the excitement around “AI agents that code,” most enterprise deployments underperform. The limiting factor is no longer the model. It’s context: The structure, history and intent surrounding the code being changed. In other words, enterprises are now facing a systems design problem: They have not yet engineered the environment these agents operate in.
The shift from assistance to agency
The past year has seen a rapid evolution from assistive coding tools to agentic workflows. Research has begun to formalize what agentic behavior means in practice: The ability to reason across design, testing, execution and validation rather than generate isolated snippets. Work such as dynamic action re-sampling shows that allowing agents to branch, reconsider and revise their own decisions significantly improves outcomes in large, interdependent codebases. At the platform level, providers like GitHub are now building dedicated agent orchestration environments, such as Copilot Agent and Agent HQ, to support multi-agent collaboration inside real enterprise pipelines.
But early field results tell a cautionary story. When organizations introduce agentic tools without addressing workflow and environment, productivity can decline. A randomized control study this year showed that developers who used AI assistance in unchanged workflows completed tasks more slowly, largely due to verification, rework and confusion around intent. The lesson is straightforward: Autonomy without orchestration rarely yields efficiency.
Why context engineering is the real unlock
In every unsuccessful deployment I’ve observed, the failure stemmed from context. When agents lack a structured understanding of a codebase, specifically its relevant modules, dependency graph, test harness, architectural conventions and change history. They often generate output that appears correct but is disconnected from reality. Too much information overwhelms the agent; too little forces it to guess. The goal is not to feed the model more tokens. The goal is to determine what should be visible to the agent, when and in what form.
The teams seeing meaningful gains treat context as an engineering surface. They create tooling to snapshot, compact and version the agent’s working memory: What is persisted across turns, what is discarded, what is summarized and what is linked instead of inlined. They design deliberation steps rather than prompting sessions. They make the specification a first-class artifact, something reviewable, testable and owned, not a transient chat history. This shift aligns with a broader trend some researchers describe as “specs becoming the new source of truth.”
Workflow must change alongside tooling
But context alone isn’t enough. Enterprises must re-architect the workflows around these agents. As McKinsey’s 2025 report “One Year of Agentic AI” noted, productivity gains arise not from layering AI onto existing processes but from rethinking the process itself. When teams simply drop an agent into an unaltered workflow, they invite friction: Engineers spend more time verifying AI-written code than they would have spent writing it themselves. The agents can only amplify what’s already structured: Well-tested, modular codebases with clear ownership and documentation. Without those foundations, autonomy becomes chaos.
Security and governance, too, demand a shift in mindset. AI-generated code introduces new forms of risk: Unvetted dependencies, subtle license violations and undocumented modules that escape peer review. Mature teams are beginning to integrate agentic activity directly into their CI/CD pipelines, treating agents as autonomous contributors whose work must pass the same static analysis, audit logging and approval gates as any human developer. GitHub’s own documentation highlights this trajectory, positioning Copilot Agents not as replacements for engineers but as orchestrated participants in secure, reviewable workflows. The goal isn’t to let an AI “write everything,” but to ensure that when it acts, it does so inside defined guardrails.
What enterprise decision-makers should focus on now
For technical leaders, the path forward starts with readiness rather than hype. Monoliths with sparse tests rarely yield net gains; agents thrive where tests are authoritative and can drive iterative refinement. This is exactly the loop Anthropic calls out for coding agents. Pilots in tightly scoped domains (test generation, legacy modernization, isolated refactors); treat each deployment as an experiment with explicit metrics (defect escape rate, PR cycle time, change failure rate, security findings burned down). As your usage grows, treat agents as data infrastructure: Every plan, context snapshot, action log and test run is data that composes into a searchable memory of engineering intent, and a durable competitive advantage.
Under the hood, agentic coding is less a tooling problem than a data problem. Every context snapshot, test iteration and code revision becomes a form of structured data that must be stored, indexed and reused. As these agents proliferate, enterprises will find themselves managing an entirely new data layer: One that captures not just what was built, but how it was reasoned about. This shift turns engineering logs into a knowledge graph of intent, decision-making and validation. In time, the organizations that can search and replay this contextual memory will outpace those who still treat code as static text.
The coming year will likely determine whether agentic coding becomes a cornerstone of enterprise development or another inflated promise. The difference will hinge on context engineering: How intelligently teams design the informational substrate their agents rely on. The winners will be those who see autonomy not as magic, but as an extension of disciplined systems design:Clear workflows, measurable feedback, and rigorous governance.
Bottom line
Platforms are converging on orchestration and guardrails, and research keeps improving context control at inference time. The winners over the next 12 to 24 months won’t be the teams with the flashiest model; they’ll be the ones that engineer context as an asset and treat workflow as the product. Do that, and autonomy compounds. Skip it, and the review queue does.
Context + agent = leverage. Skip the first half, and the rest collapses.
Dhyey Mavani is accelerating generative AI at LinkedIn.
Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.
Three years ago, ChatGPT was born. It amazed the world and ignited unprecedented investment and excitement in AI. Today, ChatGPT is still a toddler, but public sentiment around the AI boom has turned sharply negative. The shift began when OpenAI released GPT-5 this summer to mixed reviews, mostly from casual users who, unsurprisingly, judged the system by its surface flaws rather than its underlying capabilities.
Since then, pundits and influencers have declared that AI progress is slowing, that scaling has “hit the wall,” and that the entire field is just another tech bubble inflated by blusterous hype. In fact, many influencers have latched onto the dismissive phrase “AI slop” to diminish the amazing images, documents, videos and code that frontier AI models generate on command.
This perspective is not just wrong, it is dangerous.
It makes me wonder, where were all these “experts” on irrational technology bubbles when electric scooter startups were touted as a transportation revolution and cartoon NFTs were being auctioned for millions? They were probably too busy buying worthless land in the metaverse or adding to their positions in GameStop. But when it comes to the AI boom, which is easily the most significant technological and economic transformation agent of the last 25 years, journalists and influencers can’t write the word “slop” enough times.
Doth we protest too much? After all, by any objective measure AI is wildly more capable than the vast majority of computer scientists predicted only five years ago and it is still improving at a surprising pace. The impressive leap demonstrated by Gemini 3 is only the latest example. At the same time, McKinsey recently reported that 20% of organizations already derive tangible value from genAI. Also, a recent survey by Deloitte indicates that 85% of organizations boosted their AI investment in 2025, and 91% plan to increase again in 2026.
This doesn’t fit the “bubble” narrative and the dismissive “slop” language. As a computer scientist and research engineer who began working with neural networks back in 1989 and tracked progress through cold winters and hot booms ever since, I find myself amazed almost every day by the rapidly increasing capabilities of frontier AI models. When I talk with other professionals in the field, I hear similar sentiments. If anything, the rate of AI advancement leaves many experts feeling overwhelmed and frankly somewhat scared.
The dangers of AI denial
So why is the public buying into the narrative that AI is faltering, that the output is “slop,” and that the AI boom lacks authentic use cases? Personally, I believe it’s because we’ve fallen into a collective state of AI denial, latching onto the narratives we want to hear in the face of strong evidence to the contrary. Denial is the first stage of grief and thus a reasonable reaction to the very disturbing prospect that we humans may soon lose cognitive supremacy here on planet earth. In other words, the overblown AI bubble narrative is a societal defense mechanism.
Believe me, I get it. I’ve been warning about the destabilizing risks and demoralizing impact of superintelligence for well over a decade, and I too feel AI is getting too smart too fast. The fact is, we are rapidly headed towards a future where widely available AI systems will be able to outperform most humans in most cognitive tasks, solving problems faster, more accurately and yes, more creatively than any individual can. I emphasize “creativity” because AI denialists often insist that certain human qualities (particularly creativity and emotional intelligence) will always be out of reach of AI systems. Unfortunately, there is little evidence supporting this perspective.
On the creativity front, today’s AI models can generate content faster and with more variation than any individual human. Critics argue that true creativity requires inner motivation. I resonate with that argument but find it circular — we're defining creativity based on how we experience it rather than the quality, originality or usefulness of the output. Also, we just don’t know if AI systems will develop internal drives or a sense of agency. Either way, if AI can produce original work that rivals most human professionals, the impact on creative jobs will still be quite devastating.
The AI manipulation problem
Our human edge around emotional intelligence is even more precarious. It’s likely that AI will soon be able to read our emotions faster and more accurately than any human, tracking subtle cues in our micro-expressions, vocal patterns, posture, gaze and even breathing. And as we integrate AI assistants into our phones, glasses and other wearable devices, these systems will monitor our emotional reactions throughout our day, building predictive models of our behaviors. Without strict regulation, which is increasingly unlikely, these predictive models could be used to target us with individually optimized influence that maximizes persuasion.
This is called the AI manipulation problem and it suggests that emotional intelligence may not give humanity an advantage. In fact, it could be a significant weakness, fostering an asymmetric dynamic where AI systems can read us with superhuman accuracy, while we can’t read AI at all. When you talk with photorealistic AI agents (and you will) you’ll see a smiling façade designed to appear warm, empathic and trustworthy. It will look and feel human, but that’s just an illusion, and it could easily sway your perspectives. After all, our emotional reactions to faces are visceral reflexes shaped by millions of years of evolution on a planet where every interactive human face we encountered was actually human. Soon, that will no longer be true.
We are rapidly heading toward a world where many of the faces we encounter will belong to AI agents hiding behind digital facades. In fact, these “virtual spokespeople” could easily have appearances that are designed for each of us based on our prior reactions – whatever gets us to best let down our guard. And yet many insist that AI is just another tech cycle.
This is wishful thinking. The massive investment pouring into AI isn’t driven by hype — it’s driven by the expectation that AI will permeate every aspect of daily life, embodied as intelligent actors we engage throughout our day. These systems will assist us, teach us and influence us. They will reshape our lives, and it will happen faster than most people think.
To be clear, we are not witnessing an AI bubble filling with empty gas. We are watching a new planet form, a molten world rapidly taking shape, and it will solidify into a new AI-powered society. Denial will not stop this. It will only make us less prepared for the risks.
Louis Rosenberg is an early pioneer of augmented reality and a longtime AI researcher.
Enterprises are investing billions of dollars in AI agents and infrastructure to transform business processes. However, we are seeing limited success in real-world applications, often due to the inability of agents to truly understand business data, policies and processes.
While we manage the integrations well with technologies like API management, model context protocol (MCP) and others, having agents truly understand the “meaning” of data in the context of a given businesis a different story. Enterprise data is mostly siloed into disparate systems in structured and unstructured forms and needs to be analyzed with a domain-specific business lens.s
As an example, the term “customer” may refer to a different group of people in a Sales CRM system, compared to a finance system which may use this tag for paying clients. One department might define “product” as a SKU; another may represent as a "product" family; a third as a marketing bundle.
Data about “product sales” thus varies in meaning without agreed upon relationships and definitions. For agents to combine data from multiple systems, they must understand different representations. Agents need to know what the data means in context and how to find the right data for the right process. Moreover, schema changes in systems and data quality issues during collection can lead to more ambiguity and inability of agents to know how to act when such situations are encountered.
Furthermore, classification of data into categories like PII (personally identifiable information) needs to be rigorously followed to maintain compliance with standards like GDPR and CCPA. This requires the data to be labelled correctly and agents to be able to understand and respect this classification. Hence we see that building a cool demo using agents is very much doable - but putting into production working on real business data is a different story altogether.
The ontology-based source of truth
Building effective agentic solutions requries an ontology-based single source of truth. Ontology is a business definition of concepts, their hierarchy and relationships. It defines terms with respect to business domains, can help establish a single-source of truth for data and capture uniform field names and apply classifications to fields.
An ontology may be domain-specific (healthcare or finance), or organization-specific based on internal structures. Defining an ontology upfront is time consuming, but can help standardize business processes and lay a strong foundation for agentic AI.
Ontology may be realized using common queryable formats like triplestore. More complex business rules with multi-hop relations could use a labelled property graphs like Neo4j. These graphs can also help enterprises discover new relationships and answer complex questions. Ontologies like FIBO (Finance Industry Business Ontology) and UMLS (Unified Medical Language System) are available in the public domain and can be a very good starting point. However, these usually need to be customized to capture specific details of an enterprise.
Getting started with ontology
Once implemented, an ontology can be the driving force for enterprise agents. We can now prompt AI to follow the ontology and use it to discover data and relationships. If needed, we can have an agentic layer serve key details of the ontology itself and discover data. Business rules and policies can be implemented in this ontology for agents to adhere to. This is an excellent way to ground your agents and establish guardrails based on real business context.
Agents designed in this manner and tuned to follow an ontology can stick to guardrails and avoid hallucinations that can be caused by the large language models (LLM) powering them. For example, a business policy may define that unless all documents associated with a loan do not have verified flags set to "true," the loan status should be kept in “pending” state. Agents can work around this policy and determine what documents are needed and query the knowledge base.
Here's an example implementation:
(Original figure by Author)
As illustrated, we have structured and unstructured data processed by a document intelligence (DocIntel) agent which populates a Neo4j database based on an ontology of the business domain. A data discovery agent in Neo4j finds and queries the right data and passes it to other agents handling business process execution. The inter-agent communication happens with a popular protocol like A2A (agent to agent). A new protocol called AG-UI (Agent User Interaction) can help build more generic UI screens to capture the workings and responses from these agents.
With this method, we can avoid hallucinations by enforcing agents to follow ontology-driven paths and maintain data classifications and relationships. Moreover, we can scale easily by adding new assets, relationships and policies that agents can automatically comply to, and control hallucinations by defining rules for the whole system rather than individual entities. For example, if an agent hallucinates an individual 'customer,' because the connected data for the hallucinated 'customer' will not be verifiable in the data discovery, we can easily detect this anomaly and plan to eliminate it. This helps the agentic system scale with the business and manage its dynamic nature.
Indeed, a reference architecture like this adds some overhead in data discovery and graph databases. But for a large enterprise, it adds the right guardrails and gives agents directions to orchestrate complex business processes.
Dattaraj Rao is innovation and R&D architect at Persistent Systems.
Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.
As AI systems enter production, reliability and governance can’t depend on wishful thinking. Here’s how observability turns large language models (LLMs) into auditable, trustworthy enterprise systems.
Why observability secures the future of enterprise AI
The enterprise race to deploy LLM systems mirrors the early days of cloud adoption. Executives love the promise; compliance demands accountability; engineers just want a paved road.
Yet, beneath the excitement, most leaders admit they can’t trace how AI decisions are made, whether they helped the business, or if they broke any rule.
Take one Fortune 100 bank that deployed an LLM to classify loan applications. Benchmark accuracy looked stellar. Yet, 6 months later, auditors found that 18% of critical cases were misrouted, without a single alert or trace. The root cause wasn’t bias or bad data. It was invisible. No observability, no accountability.
If you can’t observe it, you can’t trust it. And unobserved AI will fail in silence.
Visibility isn’t a luxury; it’s the foundation of trust. Without it, AI becomes ungovernable.
Start with outcomes, not models
Most corporate AI projects begin with tech leaders choosing a model and, later, defining success metrics. That’s backward.
Flip the order:
-
Define the outcome first. What’s the measurable business goal?
-
Deflect 15 % of billing calls
-
Reduce document review time by 60 %
-
Cut case-handling time by two minutes
-
-
Design telemetry around that outcome, not around “accuracy” or “BLEU score.”
-
Select prompts, retrieval methods and models that demonstrably move those KPIs.
At one global insurer, for instance, reframing success as “minutes saved per claim” instead of “model precision” turned an isolated pilot into a company-wide roadmap.
A 3-layer telemetry model for LLM observability
Just like microservices rely on logs, metrics and traces, AI systems need a structured observability stack:
a) Prompts and context: What went in
-
Log every prompt template, variable and retrieved document.
-
Record model ID, version, latency and token counts (your leading cost indicators).
-
Maintain an auditable redaction log showing what data was masked, when and by which rule.
b) Policies and controls: The guardrails
-
Capture safety-filter outcomes (toxicity, PII), citation presence and rule triggers.
-
Store policy reasons and risk tier for each deployment.
-
Link outputs back to the governing model card for transparency.
c) Outcomes and feedback: Did it work?
-
Gather human ratings and edit distances from accepted answers.
-
Track downstream business events, case closed, document approved, issue resolved.
-
Measure the KPI deltas, call time, backlog, reopen rate.
All three layers connect through a common trace ID, enabling any decision to be replayed, audited or improved.
Diagram © SaiKrishna Koorapati (2025). Created specifically for this article; licensed to VentureBeat for publication.
Apply SRE discipline: SLOs and error budgets for AI
Service reliability engineering (SRE) transformed software operations; now it’s AI’s turn.
Define three “golden signals” for every critical workflow:
Signal
Target SLO
When breached
Factuality
≥ 95 % verified against source of record
Fallback to verified template
Safety
≥ 99.9 % pass toxicity/PII filters
Quarantine and human review
Usefulness
≥ 80 % accepted on first pass
Retrain or rollback prompt/model
If hallucinations or refusals exceed budget, the system auto-routes to safer prompts or human review just like rerouting traffic during a service outage.
This isn’t bureaucracy; it’s reliability applied to reasoning.
Build the thin observability layer in two agile sprints
You don’t need a six-month roadmap, just focus and two short sprints.
Sprint 1 (weeks 1-3): Foundations
-
Version-controlled prompt registry
-
Redaction middleware tied to policy
-
Request/response logging with trace IDs
-
Basic evaluations (PII checks, citation presence)
-
Simple human-in-the-loop (HITL) UI
Sprint 2 (weeks 4-6): Guardrails and KPIs
-
Offline test sets (100–300 real examples)
-
Policy gates for factuality and safety
-
Lightweight dashboard tracking SLOs and cost
-
Automated token and latency tracker
In 6 weeks, you’ll have the thin layer that answers 90% of governance and product questions.
Make evaluations continuous (and boring)
Evaluations shouldn’t be heroic one-offs; they should be routine.
-
Curate test sets from real cases; refresh 10–20 % monthly.
-
Define clear acceptance criteria shared by product and risk teams.
-
Run the suite on every prompt/model/policy change and weekly for drift checks.
-
Publish one unified scorecard each week covering factuality, safety, usefulness and cost.
When evals are part of CI/CD, they stop being compliance theater and become operational pulse checks.
Apply human oversight where it matters
Full automation is neither realistic nor responsible. High-risk or ambiguous cases should escalate to human review.
-
Route low-confidence or policy-flagged responses to experts.
-
Capture every edit and reason as training data and audit evidence.
-
Feed reviewer feedback back into prompts and policies for continuous improvement.
At one health-tech firm, this approach cut false positives by 22 % and produced a retrainable, compliance-ready dataset in weeks.
Cost control through design, not hope
LLM costs grow non-linearly. Budgets won’t save you architecture will.
-
Structure prompts so deterministic sections run before generative ones.
-
Compress and rerank context instead of dumping entire documents.
-
Cache frequent queries and memoize tool outputs with TTL.
-
Track latency, throughput and token use per feature.
When observability covers tokens and latency, cost becomes a controlled variable, not a surprise.
The 90-day playbook
Within 3 months of adopting observable AI principles, enterprises should see:
-
1–2 production AI assists with HITL for edge cases
-
Automated evaluation suite for pre-deploy and nightly runs
-
Weekly scorecard shared across SRE, product and risk
-
Audit-ready traces linking prompts, policies and outcomes
At a Fortune 100 client, this structure reduced incident time by 40 % and aligned product and compliance roadmaps.
Scaling trust through observability
Observable AI is how you turn AI from experiment to infrastructure.
With clear telemetry, SLOs and human feedback loops:
-
Executives gain evidence-backed confidence.
-
Compliance teams get replayable audit chains.
-
Engineers iterate faster and ship safely.
-
Customers experience reliable, explainable AI.
Observability isn’t an add-on layer, it’s the foundation for trust at scale.
SaiKrishna Koorapati is a software engineering leader.
Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.
-
Remember the first time you heard your company was going AI-first?
Maybe it came through an all-hands that felt different from the others. The CEO said, “By Q3, every team should have integrated AI into their core workflows,” and the energy in the room (or on the Zoom) shifted. You saw a mix of excitement and anxiety ripple through the crowd.
Maybe you were one of the curious ones. Maybe you’d already built a Python script that summarized customer feedback, saving your team three hours every week. Or maybe you’d stayed late one night just to see what would happen if you combined a dataset with a large language model (LLM) prompt. Maybe you’re one of those who’d already let curiosity lead you somewhere unexpected.
But this announcement felt different because suddenly, what had been a quiet act of curiosity was now a line in a corporate OKR. Maybe you didn’t know it yet, but something fundamental had shifted in how innovation would happen inside your company.
How innovation happens
Real transformation rarely looks like the PowerPoint version, and almost never follows the org chart.
Think about the last time something genuinely useful spread at work. It wasn't because of a vendor pitch or a strategic initiative, was it? More likely, someone stayed late one night, when no one was watching, found something that cut hours of busywork, and mentioned it at lunch the next day. “Hey, try this.” They shared it in a Slack thread and, in a week, half the team was using it.
The developer who used GPT to debug code wasn’t trying to make a strategic impact. She just needed to get home earlier to her kids. The ops manager who automated his spreadsheet didn’t need permission. He just needed more sleep.
This is the invisible architecture of progress — these informal networks where curiosity flows like water through concrete… finding every crack, every opening.
But watch what happens when leadership notices. What used to be effortless and organic becomes mandated. And the thing that once worked because it was free suddenly stops being as effective the moment it’s measured.
The great reversal
It usually begins quietly. Often when a competitor announces new AI features, — like AI-powered onboarding or end-to-end support automation — claiming 40% efficiency gains.
The next morning, your CEO calls an emergency meeting. The room gets still. Someone clears their throat. And you can feel everyone doing mental math about their job security. “If they’re that far ahead, what does that mean for us?”
That afternoon, your company has a new priority. Your CEO says, “We need an AI strategy. Yesterday.”
Here's how that message usually ripples down the org chart:
-
At the C-suite: “We need an AI strategy to stay competitive.”
-
At the VP level: “Every team needs an AI initiative.”
-
At the manager level: “We need a plan by Friday.”
-
At your level: “I just need to find something that looks like AI.”
Each translation adds pressure while subtracting understanding. Everyone still cares, but that translation changes intent. What begins as a question worth asking becomes a script everyone follows blindly.
Eventually, the performance of innovation replaces the thing itself. There’s a strange pressure to look like you’re moving fast, even when you’re not sure where you’re actually going.
This repeats across industries
A competitor declared they’re going AI-first. Another publishes a case study about replacing support with LLMs. And a third shares a graph showing productivity gains. Within days, boardrooms everywhere start echoing the same message: “We should be doing this. Everyone else already is, and we can’t fall behind.”
So the work begins. Then come the task forces, the town halls, the strategy docs and the targets. Teams are asked to contribute initiatives.
But if you’ve been through this before, you know there’s often a difference between what companies announce and what they actually do. Because press releases don’t mention the pilots that stall, or the teams that quietly revert to the old way, or even the tools that get used once and abandoned. You might know someone who was on one of those teams, or you might’ve even been on one yourself.
These aren’t failures of technology or intent. ChatGPT works fine. And teams want to automate their tasks. These failures are organizational, and they happen when we try to imitate outcomes without understanding what created them in the first place.
And so when everyone performs innovation, it becomes almost impossible to tell who’s actually doing it.
Two kinds of leaders
You’ve probably seen both, and it’s very easy to tell which kind you’re working with.
One spends an entire weekend prototyping. They try something new, fail at half of it, and still show up Monday saying, “I built this thing with Claude. It crashed after two hours, but I learned a lot. Wanna see? It's very basic, but it might solve that thing we talked about.”
They try to build understanding. You can tell they’ve actually spent time with AI, and struggled with prompts and hallucinations. Instead of trying to sound certain, they talk about what broke, what almost worked and what they’re still figuring out. They invite you to try something new, because it feels like there’s room to learn. That’s what leading by participation looks like.
The other sends you a directive in Slack: “Leadership wants every team using AI by the end of the quarter. Plans are due by Friday.” They enforce compliance with a decision that's already been made. You can even hear it in their language, and how certain they sound.
The curious leader builds momentum. The performative one builds resentment.
What actually works
You probably don’t need someone to tell you where AI works. You already know because you’ve seen it.
-
Customer support: LLMs genuinely help with Tier 1 tickets. They understand intent, draft simple responses and route complexity. Not perfectly, of course, — I’m sure you've seen the failures — but well enough to matter.
-
Code assistance: At 2 a.m., when you’re half-delirious and your AI assistant suggests exactly what you need, it feels like having an over-caffeinated junior programmer who never judges your forgotten semicolons. You save minutes at first, then hours, then days.
These small, cumulative wins compound over time. They aren't the impressive transformations promised in decks, but the kind of improvements you can rely on.
But outside these zones, things get murky. AI-driven revops? Fully automated forecasting? You've sat through those demos, and you’ve also seen the enthusiasm fade once the pilot actually begins.
Have the builders of these AI tools failed? Hardly. The technology is evolving, and the products built on top of it are still learning how to walk.
So how can you tell if your company's AI adoption is real? Simple. Just ask someone in finance or ops. Ask what AI tools they use daily. You might get a slight pause or an apologetic smile. “Honestly? Just ChatGPT.” That’s it. Not the $50k enterprise-grade platform from last quarter’s demo or the expensive software suite in the board deck. Just a browser tab, same as any college student writing an essay.
You might make this same confession yourself. Despite all the mandates and initiatives, your most powerful AI tool is probably the same one everyone else uses. So what does this tell us about the gap between what we're supposed to be doing and what we're actually doing?
How to drive change at your company
You've probably discovered this yourself, even if no one's ever put it into words:
-
Model what you mean: Remember that engineering director who screen-shared her messy, live coding session with Cursor? You learned more from watching her debug in real time than from any polished presentation, because vulnerability travels farther than directives.
-
Listen to the edges: You know who's actually using AI effectively in your organization, and they're not always the ones with “AI” in their title. They're the curious ones who've been quietly experimenting, finding what works through trial and error. And that knowledge is worth more than any analyst report.
-
Create permission (not pressure): The people inclined to experiment will always find a way, and the rest won’t be moved by force. The best thing you can do is make the curious feel safe to stay curious.
We're living in this strange moment, caught between the AI that vendors promise and the AI that actually exists on our screens, and it's deeply uncomfortable. The gap between product and promise is wide.
But what I've learned from sitting in that discomfort is that companies that will thrive aren’t the ones that adopted AI first, but the ones that learned through trial and error. They stayed with the discomfort long enough for it to teach them something.
Where will you be six months from now?
By then, your company’s AI-first mandate will have set into motion departmental initiatives, vendor contracts and maybe even some new hires with “AI” in their titles. The dashboards will be green, and the board deck will have a whole slide on AI.
But in the quiet spaces where your actual work happens, what will have meaningfully changed?
Maybe you'll be like the teams that never stopped their quiet experiments. Your customer feedback system might catch the patterns humans miss. Your documentation might update itself. Chances are, if you were building before the mandate, you’ll be building after it fades.
That’s invisible architecture of genuine progress: Patient, and completely uninterested in performance. It doesn't make for great LinkedIn posts, and it resists grand narratives. But it transforms companies in ways that truly last.
Every organization is standing at the same crossroads right now: Look like you’re innovating, or create a culture that fosters real innovation.
The pressure to perform innovation is real, and it’s growing. Most companies will give in and join the theater. But some understand that curiosity can’t be forced, and progress can’t be performed. Because real transformation happens when no one’s watching, in the hands of the people still experimenting, still learning. That’s where the future begins.
Siqi Chen is co-founder and CEO of Runway.
Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.
-
Large language models (LLMs) have astounded the world with their capabilities, yet they remain plagued by unpredictability and hallucinations – confidently outputting incorrect information. In high-stakes domains like finance, medicine or autonomous systems, such unreliability is unacceptable.
Enter Lean4, an open-source programming language and interactive theorem prover becoming a key tool to inject rigor and certainty into AI systems. By leveraging formal verification, Lean4 promises to make AI safer, more secure and deterministic in its functionality. Let's explore how Lean4 is being adopted by AI leaders and why it could become foundational for building trustworthy AI.
What is Lean4 and why it matters
Lean4 is both a programming language and a proof assistant designed for formal verification. Every theorem or program written in Lean4 must pass a strict type-checking by Lean’s trusted kernel, yielding a binary verdict: A statement either checks out as correct or it doesn’t. This all-or-nothing verification means there’s no room for ambiguity – a property or result is proven true or it fails. Such rigorous checking “dramatically increases the reliability” of anything formalized in Lean4. In other words, Lean4 provides a framework where correctness is mathematically guaranteed, not just hoped for.
This level of certainty is precisely what today’s AI systems lack. Modern AI outputs are generated by complex neural networks with probabilistic behavior. Ask the same question twice and you might get different answers. By contrast, a Lean4 proof or program will behave deterministically – given the same input, it produces the same verified result every time. This determinism and transparency (every inference step can be audited) make Lean4 an appealing antidote to AI’s unpredictability.
Key advantages of Lean4’s formal verification:
-
Precision and reliability: Formal proofs avoid ambiguity through strict logic, ensuring each reasoning step is valid and results are correct.
-
Systematic verification: Lean4 can formally verify that a solution meets all specified conditions or axioms, acting as an objective referee for correctness.
-
Transparency and reproducibility: Anyone can independently check a Lean4 proof, and the outcome will be the same – a stark contrast to the opaque reasoning of neural networks.
In essence, Lean4 brings the gold standard of mathematical rigor to computing and AI. It enables us to turn an AI’s claim (“I found a solution”) into a formally checkable proof that is indeed correct. This capability is proving to be a game-changer in several aspects of AI development.
Lean4 as a safety net for LLMs
One of the most exciting intersections of Lean4 and AI is in improving LLM accuracy and safety. Research groups and startups are now combining LLMs’ natural language prowess with Lean4’s formal checks to create AI systems that reason correctly by construction.
Consider the problem of AI hallucinations, when an AI confidently asserts false information. Instead of adding more opaque patches (like heuristic penalties or reinforcement tweaks), why not prevent hallucinations by having the AI prove its statements? That’s exactly what some recent efforts do. For example, a 2025 research framework called Safe uses Lean4 to verify each step of an LLM’s reasoning. The idea is simple but powerful: Each step in the AI’s chain-of-thought (CoT) translates the claim into Lean4’s formal language and the AI (or a proof assistant) provides a proof. If the proof fails, the system knows the reasoning was flawed – a clear indicator of a hallucination.
This step-by-step formal audit trail dramatically improves reliability, catching mistakes as they happen and providing checkable evidence for every conclusion. The approach that has shown “significant performance improvement while offering interpretable and verifiable evidence” of correctness.
Another prominent example is Harmonic AI, a startup co-founded by Vlad Tenev (of Robinhood fame) that tackles hallucinations in AI. Harmonic’s system, Aristotle, solves math problems by generating Lean4 proofs for its answers and formally verifying them before responding to the user. “[Aristotle] formally verifies the output… we actually do guarantee that there’s no hallucinations,” Harmonic’s CEO explains. In practical terms, Aristotle writes a solution in Lean4’s language and runs the Lean4 checker. Only if the proof checks out as correct does it present the answer. This yields a “hallucination-free” math chatbot – a bold claim, but one backed by Lean4’s deterministic proof checking.
Crucially, this method isn’t limited to toy problems. Harmonic reports that Aristotle achieved a gold-medal level performance on the 2025 International Math Olympiad problems, the key difference that its solutions were formally verified, unlike other AI models that merely gave answers in English. In other words, where tech giants Google and OpenAI also reached human-champion level on math questions, Aristotle did so with a proof in hand. The takeaway for AI safety is compelling: When an answer comes with a Lean4 proof, you don’t have to trust the AI – you can check it.
This approach could be extended to many domains. We could imagine an LLM assistant for finance that provides an answer only if it can generate a formal proof that it adheres to accounting rules or legal constraints. Or, an AI scientific adviser that outputs a hypothesis alongside a Lean4 proof of consistency with known physics laws. The pattern is the same – Lean4 acts as a rigorous safety net, filtering out incorrect or unverified results. As one AI researcher from Safe put it, “the gold standard for supporting a claim is to provide a proof,” and now AI can attempt exactly that.
Building secure and reliable systems with Lean4
Lean4’s value isn’t confined to pure reasoning tasks; it’s also poised to revolutionize software security and reliability in the age of AI. Bugs and vulnerabilities in software are essentially small logic errors that slip through human testing. What if AI-assisted programming could eliminate those by using Lean4 to verify code correctness?
In formal methods circles, it’s well known that provably correct code can “eliminate entire classes of vulnerabilities [and] mitigate critical system failures.” Lean4 enables writing programs with proofs of properties like “this code never crashes or exposes data.” However, historically, writing such verified code has been labor-intensive and required specialized expertise. Now, with LLMs, there’s an opportunity to automate and scale this process.
Researchers have begun creating benchmarks like VeriBench to push LLMs to generate Lean4-verified programs from ordinary code. Early results show today’s models are not yet up to the task for arbitrary software – in one evaluation, a state-of-the-art model could fully verify only ~12% of given programming challenges in Lean4. Yet, an experimental AI “agent” approach (iteratively self-correcting with Lean feedback) raised that success rate to nearly 60%. This is a promising leap, hinting that future AI coding assistants might routinely produce machine-checkable, bug-free code.
The strategic significance for enterprises is huge. Imagine being able to ask an AI to write a piece of software and receiving not just the code, but a proof that it is secure and correct by design. Such proofs could guarantee no buffer overflows, no race conditions and compliance with security policies. In sectors like banking, healthcare or critical infrastructure, this could drastically reduce risks. It’s telling that formal verification is already standard in high-stakes fields (that is, verifying the firmware of medical devices or avionics systems). Harmonic’s CEO explicitly notes that similar verification technology is used in “medical devices and aviation” for safety – Lean4 is bringing that level of rigor into the AI toolkit.
Beyond software bugs, Lean4 can encode and verify domain-specific safety rules. For instance, consider AI systems that design engineering projects. A LessWrong forum discussion on AI safety gives the example of bridge design: An AI could propose a bridge structure, and formal systems like Lean can certify that the design obeys all the mechanical engineering safety criteria.
The bridge’s compliance with load tolerances, material strength and design codes becomes a theorem in Lean, which, once proved, serves as an unimpeachable safety certificate. The broader vision is that any AI decision impacting the physical world – from circuit layouts to aerospace trajectories – could be accompanied by a Lean4 proof that it meets specified safety constraints. In effect, Lean4 adds a layer of trust on top of AI outputs: If the AI can’t prove it’s safe or correct, it doesn’t get deployed.
From big tech to startups: A growing movement
What started in academia as a niche tool for mathematicians is rapidly becoming a mainstream pursuit in AI. Over the last few years, major AI labs and startups alike have embraced Lean4 to push the frontier of reliable AI:
-
OpenAI and Meta (2022): Both organizations independently trained AI models to solve high-school olympiad math problems by generating formal proofs in Lean. This was a landmark moment, demonstrating that large models can interface with formal theorem provers and achieve non-trivial results. Meta even made their Lean-enabled model publicly available for researchers. These projects showed that Lean4 can work hand-in-hand with LLMs to tackle problems that demand step-by-step logical rigor.
-
Google DeepMind (2024): DeepMind’s AlphaProof system proved mathematical statements in Lean4 at roughly the level of an International Math Olympiad silver medalist. It was the first AI to reach “medal-worthy” performance on formal math competition problems – essentially confirming that AI can achieve top-tier reasoning skills when aligned with a proof assistant. AlphaProof’s success underscored that Lean4 isn’t just a debugging tool; it’s enabling new heights of automated reasoning.
-
Startup ecosystem: The aforementioned Harmonic AI is a leading example, raising significant funding ($100M in 2025) to build “hallucination-free” AI by using Lean4 as its backbone. Another effort, DeepSeek, has been releasing open-source Lean4 prover models aimed at democratizing this technology. We’re also seeing academic startups and tools – for example, Lean-based verifiers being integrated into coding assistants, and new benchmarks like FormalStep and VeriBench guiding the research community.
-
Community and education: A vibrant community has grown around Lean (the Lean Prover forum, mathlib library), and even famous mathematicians like Terence Tao have started using Lean4 with AI assistance to formalize cutting-edge math results. This melding of human expertise, community knowledge and AI hints at the collaborative future of formal methods in practice.
All these developments point to a convergence: AI and formal verification are no longer separate worlds. The techniques and learnings are cross-pollinating. Each success – whether it’s solving a math theorem or catching a software bug – builds confidence that Lean4 can handle more complex, real-world problems in AI safety and reliability.
Challenges and the road ahead
It’s important to temper excitement with a dose of reality. Lean4’s integration into AI workflows is still in its early days, and there are hurdles to overcome:
-
Scalability: Formalizing real-world knowledge or large codebases in Lean4 can be labor-intensive. Lean requires precise specification of problems, which isn’t always straightforward for messy, real-world scenarios. Efforts like auto-formalization (where AI converts informal specs into Lean code) are underway, but more progress is needed to make this seamless for everyday use.
-
Model limitations: Current LLMs, even cutting-edge ones, struggle to produce correct Lean4 proofs or programs without guidance. The failure rate on benchmarks like VeriBench shows that generating fully verified solutions is a difficult challenge. Advancing AI’s capabilities to understand and generate formal logic is an active area of research – and success isn’t guaranteed to be quick. However, every improvement in AI reasoning (like better chain-of-thought or specialized training on formal tasks) is likely to boost performance here.
-
User expertise: Utilizing Lean4 verification requires a new mindset for developers and decision-makers. Organizations may need to invest in training or new hires who understand formal methods. The cultural shift to insist on proofs might take time, much like the adoption of automated testing or static analysis did in the past. Early adopters will need to showcase wins to convince the broader industry of the ROI.
Despite these challenges, the trajectory is set. As one commentator observed, we are in a race between AI’s expanding capabilities and our ability to harness those capabilities safely. Formal verification tools like Lean4 are among the most promising means to tilt the balance toward safety. They provide a principled way to ensure AI systems do exactly what we intend, no more and no less, with proofs to show it.
Toward provably safe AI
In an era when AI systems are increasingly making decisions that affect lives and critical infrastructure, trust is the scarcest resource. Lean4 offers a path to earn that trust not through promises, but through proof. By bringing formal mathematical certainty into AI development, we can build systems that are verifiably correct, secure, and aligned with our objectives.
From enabling LLMs to solve problems with guaranteed accuracy, to generating software free of exploitable bugs, Lean4’s role in AI is expanding from a research curiosity to a strategic necessity. Tech giants and startups alike are investing in this approach, pointing to a future where saying “the AI seems to be correct” is not enough – we will demand “the AI can show it’s correct.”
For enterprise decision-makers, the message is clear: It’s time to watch this space closely. Incorporating formal verification via Lean4 could become a competitive advantage in delivering AI products that customers and regulators trust. We are witnessing the early steps of AI’s evolution from an intuitive apprentice to a formally validated expert. Lean4 is not a magic bullet for all AI safety concerns, but it is a powerful ingredient in the recipe for safe, deterministic AI that actually does what it’s supposed to do – nothing more, nothing less, nothing incorrect.
As AI continues to advance, those who combine its power with the rigor of formal proof will lead the way in deploying systems that are not only intelligent, but provably reliable.
Dhyey Mavani is accelerating generative AI at LinkedIn.
Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.
-
When I first wrote “Vector databases: Shiny object syndrome and the case of a missing unicorn” in March 2024, the industry was awash in hype. Vector databases were positioned as the next big thing — a must-have infrastructure layer for the gen AI era. Billions of venture dollars flowed, developers rushed to integrate embeddings into their pipelines and analysts breathlessly tracked funding rounds for Pinecone, Weaviate, Chroma, Milvus and a dozen others.
The promise was intoxicating: Finally, a way to search by meaning rather than by brittle keywords. Just dump your enterprise knowledge into a vector store, connect an LLM and watch magic happen.
Except the magic never fully materialized.
Two years on, the reality check has arrived: 95% of organizations invested in gen AI initiatives are seeing zero measurable returns. And, many of the warnings I raised back then — about the limits of vectors, the crowded vendor landscape and the risks of treating vector databases as silver bullets — have played out almost exactly as predicted.
Prediction 1: The missing unicorn
Back then, I questioned whether Pinecone — the poster child of the category — would achieve unicorn status or whether it would become the “missing unicorn” of the database world. Today, that question has been answered in the most telling way possible: Pinecone is reportedly exploring a sale, struggling to break out amid fierce competition and customer churn.
Yes, Pinecone raised big rounds and signed marquee logos. But in practice, differentiation was thin. Open-source players like Milvus, Qdrant and Chroma undercut them on cost. Incumbents like Postgres (with pgVector) and Elasticsearch simply added vector support as a feature. And customers increasingly asked: “Why introduce a whole new database when my existing stack already does vectors well enough?”
The result: Pinecone, once valued near a billion dollars, is now looking for a home. The missing unicorn indeed. In September 2025, Pinecone appointed Ash Ashutosh as CEO, with founder Edo Liberty moving to a chief scientist role. The timing is telling: The leadership change comes amid increasing pressure and questions over its long-term independence.
Prediction 2: Vectors alone won’t cut it
I also argued that vector databases by themselves were not an end solution. If your use case required exactness — l ike searching for “Error 221” in a manual—a pure vector search would gleefully serve up “Error 222” as “close enough.” Cute in a demo, catastrophic in production.
That tension between similarity and relevance has proven fatal to the myth of vector databases as all-purpose engines.
“Enterprises discovered the hard way that semantic ≠ correct.”
Developers who gleefully swapped out lexical search for vectors quickly reintroduced… lexical search in conjunction with vectors. Teams that expected vectors to “just work” ended up bolting on metadata filtering, rerankers and hand-tuned rules. By 2025, the consensus is clear: Vectors are powerful, but only as part of a hybrid stack.
Prediction 3: A crowded field becomes commoditized
The explosion of vector database startups was never sustainable. Weaviate, Milvus (via Zilliz), Chroma, Vespa, Qdrant — each claimed subtle differentiators, but to most buyers they all did the same thing: store vectors and retrieve nearest neighbors.
Today, very few of these players are breaking out. The market has fragmented, commoditized and in many ways been swallowed by incumbents. Vector search is now a checkbox feature in cloud data platforms, not a standalone moat.
Just as I wrote then: Distinguishing one vector DB from another will pose an increasing challenge. That challenge has only grown harder. Vald, Marqo, LanceDB, PostgresSQL, MySQL HeatWave, Oracle 23c, Azure SQL, Cassandra, Redis, Neo4j, SingleStore, ElasticSearch, OpenSearch, Apahce Solr… the list goes on.
The new reality: Hybrid and GraphRAG
But this isn’t just a story of decline — it’s a story of evolution. Out of the ashes of vector hype, new paradigms are emerging that combine the best of multiple approaches.
Hybrid Search: Keyword + vector is now the default for serious applications. Companies learned that you need both precision and fuzziness, exactness and semantics. Tools like Apache Solr, Elasticsearch, pgVector and Pinecone’s own “cascading retrieval” embrace this.
GraphRAG: The hottest buzzword of late 2024/2025 is GraphRAG — graph-enhanced retrieval augmented generation. By marrying vectors with knowledge graphs, GraphRAG encodes the relationships between entities that embeddings alone flatten away. The payoff is dramatic.
Benchmarks and evidence
-
Amazon’s AI blog cites benchmarks from Lettria, where hybrid GraphRAG boosted answer correctness from ~50% to 80%-plus in test datasets across finance, healthcare, industry, and law.
-
The GraphRAG-Bench benchmark (released May 2025) provides a rigorous evaluation of GraphRAG vs. vanilla RAG across reasoning tasks, multi-hop queries and domain challenges.
-
An OpenReview evaluation of RAG vs GraphRAG found that each approach has strengths depending on task — but hybrid combinations often perform best.
-
FalkorDB’s blog reports that when schema precision matters (structured domains), GraphRAG can outperform vector retrieval by a factor of ~3.4x on certain benchmarks.
The rise of GraphRAG underscores the larger point: Retrieval is not about any single shiny object. It’s about building retrieval systems — layered, hybrid, context-aware pipelines that give LLMs the right information, with the right precision, at the right time.
What this means going forward
The verdict is in: Vector databases were never the miracle. They were a step — an important one — in the evolution of search and retrieval. But they are not, and never were, the endgame.
The winners in this space won’t be those who sell vectors as a standalone database. They will be the ones who embed vector search into broader ecosystems — integrating graphs, metadata, rules and context engineering into cohesive platforms.
In other words: The unicorn isn’t the vector database. The unicorn is the retrieval stack.
Looking ahead: What’s next
-
Unified data platforms will subsume vector + graph: Expect major DB and cloud vendors to offer integrated retrieval stacks (vector + graph + full-text) as built-in capabilities.
-
“Retrieval engineering” will emerge as a distinct discipline: Just as MLOps matured, so too will practices around embedding tuning, hybrid ranking and graph construction.
-
Meta-models learning to query better: Future LLMs may learn to orchestrate which retrieval method to use per query, dynamically adjusting weighting.
-
Temporal and multimodal GraphRAG: Already, researchers are extending GraphRAG to be time-aware (T-GRAG) and multimodally unified (e.g. connecting images, text, video).
-
Open benchmarks and abstraction layers: Tools like BenchmarkQED (for RAG benchmarking) and GraphRAG-Bench will push the community toward fairer, comparably measured systems.
From shiny objects to essential infrastructure
The arc of the vector database story has followed a classic path: A pervasive hype cycle, followed by introspection, correction and maturation. In 2025, vector search is no longer the shiny object everyone pursues blindly — it’s now a critical building block within a more sophisticated, multi-pronged retrieval architecture.
The original warnings were right. Pure vector-based hopes often crash on the shoals of precision, relational complexity and enterprise constraints. Yet the technology was never wasted: It forced the industry to rethink retrieval, blending semantic, lexical and relational strategies.
If I were to write a sequel in 2027, I suspect it would frame vector databases not as unicorns, but as legacy infrastructure — foundational, but eclipsed by smarter orchestration layers, adaptive retrieval controllers and AI systems that dynamically choose which retrieval tool fits the query.
As of now, the real battle is not vector vs keyword — it’s the indirection, blending and discipline in building retrieval pipelines that reliably ground gen AI in facts and domain knowledge. That’s the unicorn we should be chasing now.
Amit Verma is head of engineering and AI Labs at Neuron7.
Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.
-


