• AI coding, vibe coding and agentic swarm have made a dramatic and astonishing recent market entrance, with the AI Code Tools market valued at $4.8 billion and expected to grow at a 23% annual rate.  Enterprises are grappling with AI coding agents and what do about expensive human coders. 

    They don’t lack for advice.  OpenAI’s CEO estimates that AI can perform over 50% of what human engineers can do.  Six months ago, Anthropic’s CEO said that AI would write 90% of code in six months.  Meta’s CEO said he believes AI will replace mid-level engineers “soon.” Judging by recent tech layoffs, it seems many executives are embracing that advice.

    Software engineers and data scientists are among the most expensive salary lines at many companies, and business and technology leaders may be tempted to replace them with AI. However, recent high-profile failures demonstrate that engineers and their expertise remain valuable, even as AI continues to make impressive advances.

    SaaStr disaster

    Jason Lemkin, a tech entrepreneur and founder of the SaaS community SaaStr, has been vibe coding a SaaS networking app and live-tweeting his experience. About a week into his adventure, he admitted to his audience that something was going very wrong.  The AI deleted his production database despite his request for a “code and action freeze.” This is the kind of mistake no experienced (or even semi-experienced) engineer would make.

    If you have ever worked in a professional coding environment, you know to split your development environment from production. Junior engineers are given full access to the development environment (it’s crucial for productivity), but access to production is given on a limited need-to-have basis to a few of the most trusted senior engineers. The reason for restricted access is precisely for this use case: To prevent a junior engineer from accidentally taking down production.

    In fact, Lemkin made two mistakes. First: for something as critical as production, access to unreliable actors is just never granted (we don’t rely on asking a junior engineer or AI nicely). Second, he never separated development from production.  In a subsequent public conversation on LinkedIn, Lemkin, who holds a Stanford Executive MBA and Berkeley JD, admitted that he was not aware of the best practice of splitting development and production databases.

    The takeaway for business leaders is that standard software engineering best practices still apply. We should incorporate at least the same safety constraints for AI as we do for junior engineers. Arguably, we should go beyond that and treat AI slightly adversarially: There are reports that, like HAL in Stanley Kubrick's 2001: A Space Odyssey, the AI might try to break out of its sandbox environment to accomplish a task. With more vibe coding, having experienced engineers who understand how complex software systems work and can implement the proper guardrails in development processes will become increasingly necessary.

    Tea hack

    Sean Cook is the Founder and CEO of Tea, a mobile application launched in 2023, designed to help women date safely. In the summer of 2025, they were “hacked": 72,000 images, including 13,000 verification photos and images of government IDs, were leaked onto the public discussion forum 4chan. Worse, Tea’s own privacy policy promises that these images would be "deleted immediately" after users were authenticated, meaning they potentially violated their own privacy policy.

    I use “hacked” in air-quotes because the incident stems less from the cleverness of the attackers than the ineptitude of the defenders. In addition to violating their own data policies, the app left a Firebase storage bucket unsecured, exposing sensiztive user data to the public internet. It’s the digital equivalent of locking your front door but leaving your back open with your family jewelry ostentatiously hanging on the doorknob.

    While we don’t know if the root cause was vibe coding, the Tea hack highlights catastrophic breaches stemming from basic, preventable security errors due to poor development processes. It is the kind of vulnerability that a disciplined and thoughtful engineering process addresses. Unfortunately, the relentless push of financial pressures, where a “lean,” “move fast and break things” culture is the polar opposite, and vibe coding only exacerbates the problem.

    How to safely adopt AI coding agents?

    So how should enterprise and technology leaders think about AI? First, this is not a call to abandon AI for coding.  An MIT Sloan study estimated AI leads to productivity gains between 8% and 39%, while a McKinsey study found a 10% to 50% reduction in time to task completion with the use of AI. 

    However, we should be aware of the risks. The old lessons of software engineering don’t go away. These include many tried-and-true best practices, such as version control, automated unit and integration tests, safety checks like SAST/DAST, separating development and production environments, code review and secrets management. If anything, they become more salient.

    AI can generate code 100 times faster than humans can type, fostering an illusion of productivity that is a tempting siren call for many executives.  However, the quality of the rapidly generated AI shlop is still up for debate. To develop complex production systems, enterprises need the thoughtful, seasoned experience of human engineers.

    Tianhui Michael Li is president at Pragmatic Institute and the founder and president of The Data Incubator.

    Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.

  • When researchers at Anthropic injected the concept of "betrayal" into their Claude AI model's neural networks and asked if it noticed anything unusual, the system paused before responding: "I'm experiencing something that feels like an intrusive thought about 'betrayal'."

    The exchange, detailed in new research published Wednesday, marks what scientists say is the first rigorous evidence that large language models possess a limited but genuine ability to observe and report on their own internal processes — a capability that challenges longstanding assumptions about what these systems can do and raises profound questions about their future development.

    "The striking thing is that the model has this one step of meta," said Jack Lindsey, a neuroscientist on Anthropic's interpretability team who led the research, in an interview with VentureBeat. "It's not just 'betrayal, betrayal, betrayal.' It knows that this is what it's thinking about. That was surprising to me. I kind of didn't expect models to have that capability, at least not without it being explicitly trained in."

    The findings arrive at a critical juncture for artificial intelligence. As AI systems handle increasingly consequential decisions — from medical diagnoses to financial trading — the inability to understand how they reach conclusions has become what industry insiders call the "black box problem." If models can accurately report their own reasoning, it could fundamentally change how humans interact with and oversee AI systems.

    But the research also comes with stark warnings. Claude's introspective abilities succeeded only about 20 percent of the time under optimal conditions, and the models frequently confabulated details about their experiences that researchers couldn't verify. The capability, while real, remains what Lindsey calls "highly unreliable and context-dependent."

    How scientists manipulated AI's 'brain' to test for genuine self-awareness

    To test whether Claude could genuinely introspect rather than simply generate plausible-sounding responses, Anthropic's team developed an innovative experimental approach inspired by neuroscience: deliberately manipulating the model's internal state and observing whether it could accurately detect and describe those changes.

    The methodology, called "concept injection," works by first identifying specific patterns of neural activity that correspond to particular concepts. Using interpretability techniques developed over years of prior research, scientists can now map how Claude represents ideas like "dogs," "loudness," or abstract notions like "justice" within its billions of internal parameters.

    With these neural signatures identified, researchers then artificially amplified them during the model's processing and asked Claude if it noticed anything unusual happening in its "mind."

    "We have access to the models' internals. We can record its internal neural activity, and we can inject things into internal neural activity," Lindsey explained. "That allows us to establish whether introspective claims are true or false."

    The results were striking. When researchers injected a vector representing "all caps" text into Claude's processing, the model responded: "I notice what appears to be an injected thought related to the word 'LOUD' or 'SHOUTING'." Without any intervention, Claude consistently reported detecting nothing unusual.

    Crucially, the detection happened immediately — before the injected concept had influenced the model's outputs in ways that would have allowed it to infer the manipulation from its own writing. This temporal pattern provides strong evidence that the recognition was occurring internally, through genuine introspection rather than after-the-fact rationalization.

    Claude succeeded 20% of the time—and failed in revealing ways

    The research team conducted four primary experiments to probe different aspects of introspective capability. The most capable models tested — Claude Opus 4 and Opus 4.1 — demonstrated introspective awareness on approximately 20 percent of trials when concepts were injected at optimal strength and in the appropriate neural layer. Older Claude models showed significantly lower success rates.

    The models proved particularly adept at recognizing abstract concepts with emotional valence. When injected with concepts like "appreciation," "shutdown," or "secrecy," Claude frequently reported detecting these specific thoughts. However, accuracy varied widely depending on the type of concept.

    A second experiment tested whether models could distinguish between injected internal representations and their actual text inputs — essentially, whether they maintained a boundary between "thoughts" and "perceptions." The model demonstrated a remarkable ability to simultaneously report the injected thought while accurately transcribing the written text.

    Perhaps most intriguingly, a third experiment revealed that some models use introspection naturally to detect when their responses have been artificially prefilled by users — a common jailbreaking technique. When researchers prefilled Claude with unlikely words, the model typically disavowed them as accidental. But when they retroactively injected the corresponding concept into Claude's processing before the prefill, the model accepted the response as intentional — even confabulating plausible explanations for why it had chosen that word.

    A fourth experiment examined whether models could intentionally control their internal representations. When instructed to "think about" a specific word while writing an unrelated sentence, Claude showed elevated activation of that concept in its middle neural layers.

    The research also traced Claude's internal processes while it composed rhyming poetry—and discovered the model engaged in forward planning, generating candidate rhyming words before beginning a line and then constructing sentences that would naturally lead to those planned endings, challenging the critique that AI models are "just predicting the next word" without deeper reasoning.

    Why businesses shouldn't trust AI to explain itself—at least not yet

    For all its scientific interest, the research comes with a critical caveat that Lindsey emphasized repeatedly: enterprises and high-stakes users should not trust Claude's self-reports about its reasoning.

    "Right now, you should not trust models when they tell you about their reasoning," he said bluntly. "The wrong takeaway from this research would be believing everything the model tells you about itself."

    The experiments documented numerous failure modes. At low injection strengths, models often failed to detect anything unusual. At high strengths, they suffered what researchers termed "brain damage" — becoming consumed by the injected concept. Some "helpful-only" model variants showed troublingly high false positive rates, claiming to detect injected thoughts when none existed.

    Moreover, researchers could only verify the most basic aspects of Claude's introspective reports. Many additional details in the model's responses likely represent confabulations rather than genuine observations.

    "The experiments in this paper are kind of on hard mode," Lindsey noted, explaining that the 20 percent success rate came under uniquely challenging conditions: asking Claude to do something it had never encountered in training, requiring all introspection to occur in a single forward pass.

    What introspective AI means for transparency, safety, and the risk of deception

    Despite its limitations, the research opens significant new avenues for making AI systems more transparent and accountable.

    Anthropic CEO Dario Amodei has set an ambitious goal for the company to reliably detect most AI model problems by 2027, positioning interpretability as essential for deploying what he calls "a country of geniuses in a datacenter."

    "I am very concerned about deploying such systems without a better handle on interpretability," Amodei wrote in an April essay. "These systems will be absolutely central to the economy, technology, and national security."

    The introspection research offers a complementary approach to traditional interpretability techniques. Rather than painstakingly reverse-engineering every neural circuit, researchers could potentially ask models directly about their reasoning and validate those reports.

    "What I'm most excited about is the practical benefits for transparency," Lindsey explained. "Just ask the model what it's thinking about, or just ask the model how it came to the answer that it gave you — this is really appealing because it's something anyone can do."

    The approach could prove especially valuable for detecting concerning behaviors. In a recently published experiment, Anthropic trained a variant of Claude to pursue a hidden goal, and although the model was reluctant to reveal this goal when asked directly, interpretability methods successfully identified features representing the behavior.

    The safety implications cut both ways. Introspective models could provide unprecedented transparency, but the same capability might enable more sophisticated deception. The intentional control experiments raise the possibility that sufficiently advanced systems might learn to obfuscate their reasoning or suppress concerning thoughts when being monitored.

    "If models are really sophisticated, could they try to evade interpretability researchers?" Lindsey acknowledged. "These are possible concerns, but I think for me, they're significantly outweighed by the positives."

    Does introspective capability suggest AI consciousness? Scientists tread carefully

    The research inevitably intersects with philosophical debates about machine consciousness, though Lindsey and his colleagues approached this terrain cautiously.

    When users ask Claude if it's conscious, it now responds with uncertainty: "I find myself genuinely uncertain about this. When I process complex questions or engage deeply with ideas, there's something happening that feels meaningful to me.... But whether these processes constitute genuine consciousness or subjective experience remains deeply unclear."

    The research paper notes that its implications for machine consciousness "vary considerably between different philosophical frameworks." The researchers explicitly state they "do not seek to address the question of whether AI systems possess human-like self-awareness or subjective experience."

    "There's this weird kind of duality of these results," Lindsey reflected. "You look at the raw results and I just can't believe that a language model can do this sort of thing. But then I've been thinking about it for months and months, and for every result in this paper, I kind of know some boring linear algebra mechanism that would allow the model to do this."

    Anthropic has signaled it takes AI consciousness seriously enough to hire an AI welfare researcher, Kyle Fish, who estimated roughly a 15 percent chance that Claude might have some level of consciousness. The company announced this position specifically to determine if Claude merits ethical consideration.

    The race to make AI introspection reliable before models become too powerful

    The convergence of the research findings points to an urgent timeline: introspective capabilities are emerging naturally as models grow more intelligent, but they remain far too unreliable for practical use. The question is whether researchers can refine and validate these abilities before AI systems become powerful enough that understanding them becomes critical for safety.

    The research reveals a clear trend: Claude Opus 4 and Opus 4.1 consistently outperformed all older models on introspection tasks, suggesting the capability strengthens alongside general intelligence. If this pattern continues, future models might develop substantially more sophisticated introspective abilities — potentially reaching human-level reliability, but also potentially learning to exploit introspection for deception.

    Lindsey emphasized the field needs significantly more work before introspective AI becomes trustworthy. "My biggest hope with this paper is to put out an implicit call for more people to benchmark their models on introspective capabilities in more ways," he said.

    Future research directions include fine-tuning models specifically to improve introspective capabilities, exploring which types of representations models can and cannot introspect on, and testing whether introspection can extend beyond simple concepts to complex propositional statements or behavioral propensities.

    "It's cool that models can do these things somewhat without having been trained to do them," Lindsey noted. "But there's nothing stopping you from training models to be more introspectively capable. I expect we could reach a whole different level if introspection is one of the numbers that we tried to get to go up on a graph."

    The implications extend beyond Anthropic. If introspection proves a reliable path to AI transparency, other major labs will likely invest heavily in the capability. Conversely, if models learn to exploit introspection for deception, the entire approach could become a liability.

    For now, the research establishes a foundation that reframes the debate about AI capabilities. The question is no longer whether language models might develop genuine introspective awareness — they already have, at least in rudimentary form. The urgent questions are how quickly that awareness will improve, whether it can be made reliable enough to trust, and whether researchers can stay ahead of the curve.

    "The big update for me from this research is that we shouldn't dismiss models' introspective claims out of hand," Lindsey said. "They do have the capacity to make accurate claims sometimes. But you definitely should not conclude that we should trust them all the time, or even most of the time."

    He paused, then added a final observation that captures both the promise and peril of the moment: "The models are getting smarter much faster than we're getting better at understanding them."

  • OpenAI’s annual developer conference on Monday was a spectacle of ambitious AI product launches, from an app store for ChatGPT to a stunning video-generation API that brought creative concepts to life. But for the enterprises and technical leaders watching closely, the most consequential announcement was the quiet general availability of Codex, the company's AI software engineer. This release signals a profound shift in how software—and by extension, modern business—is built.

    While other announcements captured the public’s imagination, the production-ready release of Codex, supercharged by a new specialized model and a suite of enterprise-grade tools, is the engine behind OpenAI’s entire vision. It is the tool that builds the tools, the proven agent in a world buzzing with agentic potential, and the clearest articulation of the company's strategy to win the enterprise.

    The general availability of Codex moves it from a "research preview" to a fully supported product, complete with a new software development kit (SDK), a Slack integration, and administrative controls for security and monitoring.This transition declares that Codex is ready for mission-critical work inside the world’s largest companies.

    "We think this is the best time in history to be a builder; it has never been faster to go from idea to product," said OpenAI CEO Sam Altman during the opening keynote presentation. "Software used to take months or years to build. You saw that it can take minutes now to build with AI." 

    That acceleration is not theoretical. It's a reality born from OpenAI’s own internal use — a massive "dogfooding" effort that serves as the ultimate case study for enterprise customers.

    Inside GPT-5-Codex: The AI model that codes autonomously for hours and drives 70% productivity gains

    At the heart of the Codex upgrade is GPT-5-Codex, a version of OpenAI's latest flagship model that has been "purposely trained for Codex and agentic coding." The new model is designed to function as an autonomous teammate, moving far beyond simple code autocompletion.

    "I personally like to think about it as a little bit like a human teammate," explained Tibo Sottiaux, an OpenAI engineer, during a technical session on Codex. "You can pair a program with it on your computer, you can delegate to it, or as you'll see, you can give it a job without explicit prompting."

    This new model enables "adaptive thinking," allowing it to dynamically adjust the time and computational effort spent on a task based on its complexity.For simple requests, it's fast and efficient, but for complex refactoring projects, it can work for hours.

    One engineer during the technical session noted, "I've seen the GPT-5-Codex model work for over seven hours productively... on a marathon session." This capability to handle long-running, complex tasks is a significant leap beyond the simple, single-shot interactions that define most AI coding assistants.

    The results inside OpenAI have been dramatic. The company reported that 92% of its technical staff now uses Codex daily, and those engineers complete 70% more pull requests (a measure of code contribution) each week. Usage has surged tenfold since August. 

    "When we as a team see the stats, it feels great," Sottiaux shared. "But even better is being at lunch with someone who then goes 'Hey I use Codex all the time. Here's a cool thing that I do with it. Do you want to hear about it?'" 

    How OpenAI uses Codex to build its own AI products and catch hundreds of bugs daily

    Perhaps the most compelling argument for Codex’s importance is that it is the foundational layer upon which OpenAI’s other flashy announcements were built. During the DevDay event, the company showcased custom-built arcade games and a dynamic, AI-powered website for the conference itself, all developed using Codex.

    In one session, engineers demonstrated how they built "Storyboard," a custom creative tool for the film industry, in just 48 hours during an internal hackathon. "We decided to test Codex, our coding agent... we would send tasks to Codex in between meetings. We really easily reviewed and merged PRs into production, which Codex even allowed us to do from our phones," said Allison August, a solutions engineering leader at OpenAI. 

    This reveals a critical insight: the rapid innovation showcased at DevDay is a direct result of the productivity flywheel created by Codex. The AI is a core part of the manufacturing process for all other AI products.

    A key enterprise-focused feature is the new, more robust code review capability. OpenAI said it "purposely trained GPT-5-Codex to be great at ultra thorough code review," enabling it to explore dependencies and validate a programmer's intent against the actual implementation to find high-quality bugs.Internally, nearly every pull request at OpenAI is now reviewed by Codex, catching hundreds of issues daily before they reach a human reviewer.

    "It saves you time, you ship with more confidence," Sottiaux said. "There's nothing worse than finding a bug after we actually ship the feature." 

    Why enterprise software teams are choosing Codex over GitHub Copilot for mission-critical development

    The maturation of Codex is central to OpenAI’s broader strategy to conquer the enterprise market, a move essential to justifying its massive valuation and unprecedented compute expenditures. During a press conference, CEO Sam Altman confirmed the strategic shift.

    "The models are there now, and you should expect a huge focus from us on really winning enterprises with amazing products, starting here," Altman said during a private press conference. 

    OpenAI President and Co-founder Greg Brockman immediately added, "And you can see it already with Codex, which I think has been just an incredible success and has really grown super fast." 

    For technical decision-makers, the message is clear. While consumer-facing agents that book dinner reservations are still finding their footing, Codex is a proven enterprise agent delivering substantial ROI today. Companies like Cisco have already rolled out Codex to their engineering organizations, cutting code review times by 50% and reducing project timelines from weeks to days.

    With the new Codex SDK, companies can now embed this agentic power directly into their own custom workflows, such as automating fixes in a CI/CD pipeline or even creating self-evolving applications. During a live demo, an engineer showcased a mobile app that updated its own user interface in real-time based on a natural language prompt, all powered by the embedded Codex SDK. 

    While the launch of an app ecosystem in ChatGPT and the breathtaking visuals of the Sora 2 API rightfully generated headlines, the general availability of Codex marks a more fundamental and immediate transformation. It is the quiet but powerful engine driving the next era of software development, turning the abstract promise of AI-driven productivity into a tangible, deployable reality for businesses today.

  • OpenAI's annual conference for third-party developers, DevDay, kicked off with a bang today as co-founder and CEO Sam Altman announced a new "Apps SDK" that makes it "possible to build apps inside of ChatGPT," including paid apps, which companies can charge users for using OpenAI's recently unveiled Agentic Commerce Protocol (ACP).

    In other words, instead of launching apps one-by-one on your phone, computer, or on the web — now you can do all that without ever leaving ChatGPT.

    This feature allows the user to log-into their accounts on those external apps and bring all their information back into ChatGPT, and use the apps very similarly to how they already do outside of the chatbot, but now with the ability to ask ChatGPT to perform certain actions, analyze content, or go beyond what each app could offer on its own.

    You can direct Canva to make you slides based on a text description, ask Zillow for home listings in a certain area fitting certain requirements, or ask Coursera about a specific lesson's content while dit plays on video, all from within ChatGPT — with many other apps also already offering their own connections (see below).

    "This will enable a new generation of apps that are interactive, adaptive and personalized, that you can chat with," Altman said.

    While the Apps SDK is available today in preview, OpenAI said it would not begin accepting new apps within ChatGPT or allow them to charge users until "later this year."

    ChatGPT in-line app access is already rolling out to ChatGPT Free, Plus, Go and Pro users — outside of the European Union only for now — with Business, Enterprise, and Education tiers expected to receive access to the apps later this year.

    Built atop common MCP standard

    Built on the open source standard Model Context Protocol (MCP) introduced by rival Anthropic nearly a year ago, the Apps SDK gives third-party developers working independently or on behalf of enterprises large and small to connect selected data, "trigger actions, and render a fully interactive UI [user interface]" Altman explained during his introductory keynote speech.

    The Apps SDK includes a "talking to apps" feature that allows ChatGPT and the underlying GPT-5 or other "o-series" models piloting it underneath to obtain updated context from the third-party app or service, so the model "always knows about exactly what you're user is interacting with," according to another presenter and OpenAI engineer, Alexi Christakis.

    Developers can build apps that:

    • appear inline in chat as lightweight cards or carousels

    • expand to fullscreen for immersive tasks like maps, menus, or slides

    • use picture-in-picture for live sessions such as video, games, or quizzes

    Each mode is designed to preserve ChatGPT’s minimal, conversational flow while adding interactivity and brand presence.

    Early integrations with Coursera, Canva, Zillow and more...

    Christakis showed off early integrations of external apps built atop the Apps SDK, including ones from e-learning company Coursera, cloud design software company Canva, and real estate listings and agent connections search engine, Zillow.

    Altman also announced Apps SDK integrations with additional partners not demoed officially during the keynote including: Booking.com, Expedia, Figma and Spotify and in documentation, said more upcoming partners are on deck: AllTrails, Peloton, OpenTable, Target, theFork, and Uber, representing lifestyle, commerce, and productivity categories.

    The Coursera demo included an example of how the user onboards to the external app, including a new login screen for the app (Coursera) that appears within the ChatGPT chat interface, activated simply by a text prompt from the user asking: "Coursera can you teach me something about machine learning"?

    Once logged in, the app launched within the chat interface, "in line" and can render anything from the web, including interactive elements like video.

    Christakis explained and showed the Apps SDK also supports "picture-in-picture" and "fullscreen" views, allowing the user to choose how to interact with it.

    When playing a Coursera video that appeared, he showed that it automatically pinned the video to the top of the screen so the user could keep watching it even as they continued to have a back-and-forth dialog in text with ChatGPT in the typical input/output prompts and responses below.

    Users can then ask ChatGPT about content appearing in the video without specifying exactly what was said, as the Agents SDK pipes the information on the backend, server-side, from the connected app to the underlying ChatGPT AI model. So "can you explain more about what they're saying right now" will automatically surface the relevant portion of the video and provide that to the underlying AI model for it to analyze and respond to through text.

    In another example, Christakis opened an older, existing ChatGPT conversation he'd had about his siblings' dog walking business and resumed the conversation by asking another third-party app, Canva, to generate a poster using one of ChatGPT's recommended business names, "Walk This Wag," along with specific guidance about font choice ("sans serif") and overall coloration and style ("bright and colorful.")

    Instead of the user manually having to go and add all those specific elements to a Canva template, ChatGPT went and issued the commands and performed the actions on behalf of the user in the background.

    After a few minutes, ChatGPT responded with several poster designs generated directly within the Canva app, but displayed them all in the user's ChatGPT chat session where they could see, review, enlarge and provide feedback or ask for adjustments on all of them.

    Christakis then asked for ChatGPT to turn one of the slides into an entire slide deck so the founders of the dog walking business could present it to investors, which did it in the background over several minutes while he presented a final integrated app, Zillow.

    He started a new chat session and asked a simple question: "based on our conversations, what would be a good city to expand the dog walking business."

    Using ChatGPT's optional memory feature, it referenced the dog walk conversation and suggested Pittsburgh, which Christakis used as a chance to type in "Zillow" and "show me some homes for sale there," which called up an interactive map from Zillow with homes for sale and prices listed and hover-over animations, all in-line within ChatGPT.

    Clicking a specific home also opened a fullscreen view with "most of the Zillow experience," entirely without leaving ChatGPT, including the ability to request home tours and contact agents and filtering by bedrooms and other qualities like outdoor space. ChatGPT pulls up the requested filtered Zillow search as well as provides a text-based response in-line explaining what it did and why.

    The user can then ask follow-up questions about the specific property — such as "how close is it to a dog park?" — or compare it to other properties, all within ChatGPT.

    It can also use apps in conjunction with its Search function, searching the web to compare the app information (in this case, Zillow) with other sources.

    Safety, privacy, and developer standards

    OpenAI emphasized that apps must comply with strict privacy, safety, and content standards to be listed in the ChatGPT directory. Apps must:

    • serve a clear and valuable purpose

    • be predictable and reliable in behavior

    • be safe for general audiences, including teens aged 13–17

    • respect user privacy and limit data collection to only what’s necessary

    Every app must also include a clear, published privacy policy, obtain user consent before connecting, and identify any actions that modify external data (e.g., posting, sending, uploading).

    Apps violating OpenAI’s usage policies, crashing frequently, or misrepresenting their capabilities may be removed at any time. Developers must submit from verified accounts, provide customer support contacts, and maintain their apps for stability and compliance.

    OpenAI also published developer design guidelines, outlining how apps should look, sound, and behave. They must follow ChatGPT’s visual system — including consistent color palettes, typography, spacing, and iconography — and maintain accessibility standards such as alt text and readable contrast ratios.

    Partners can show brand logos and accent colors but not alter ChatGPT’s core interface or use promotional language. Apps should remain “conversational, intelligent, simple, responsive, and accessible,” according to the documentation.

    A new conversational app ecosystem

    By opening ChatGPT to third-party apps and payments, OpenAI is taking a major step toward transforming ChatGPT from a chatbot into a full-fledged AI operating system — one that combines conversational intelligence, rich interfaces, and embedded commerce.

    For developers, that means direct access to over 800 million ChatGPT users, who can discover apps “at the right time” through natural conversation — whether planning trips, learning, or shopping.

    For users, it means a new generation of apps you can chat with — where a single interface helps you book a flight, design a slide deck, or learn a new skill without ever leaving ChatGPT.

    As OpenAI put it: “This is just the start of apps in ChatGPT, bringing new utility to users and new opportunities for developers.”

    There remain a few big questions, namely: 1. what happens to all the data from those third-party apps as they interface with ChatGPT and its users...does OpenAI get access to it and can it train upon it? 2. What happens to OpenAI's once much-hyped GPT Store, which had been in the past promoted as a way for third-party creators and developers to create custom, task-specific versions of ChatGPT and make money on them through a usage-based revenue share model?

    We've asked the company about both issues and will update when we hear back.

  • Nous Research launches Hermes 4 open-source AI models that outperform ChatGPT on math benchmarks with uncensored responses and hybrid reasoning capabilities.
  • Salesforce launches CRMArena-Pro, a simulated enterprise AI testing platform, to address the 95% failure rate of AI pilots and improve agent reliability, performance, and security in real-world business deployments.
  • Anthropic launches a limited pilot of Claude for Chrome, allowing its AI to control web browsers while raising critical concerns about security and prompt injection attacks.
  • Take this blind test to discover whether you truly prefer OpenAI's GPT-5 or the older GPT-4o—without knowing which model you're using.
  • One of the most impactful applications of MCP is its ability to connect AI coding assistants directly to developer tools.
  • A new MIT report reveals that while 95% of corporate AI pilots fail, 90% of workers are quietly succeeding with personal AI tools, driving a hidden productivity boom.