GPT-5.5: What OpenAI's Latest Model Changes for Coding, Agents, and Real Work
GPT-5.5: What OpenAI's Latest Model Changes for Coding, Agents, and Real Work
OpenAI has a new flagship model: GPT-5.5. It was announced on April 23, 2026, and the important follow-up is that OpenAI updated the release on April 24, 2026 to say GPT-5.5 and GPT-5.5 Pro are now available in the API. That matters because this is not only a ChatGPT model. It is meant to sit across ChatGPT, Codex, and developer workflows.
The short version: GPT-5.5 is OpenAI's latest push toward models that can do real work on a computer. Not just answer questions. Not just write code snippets. The model is aimed at messy, multi-step tasks that involve planning, tools, documents, codebases, browsers, spreadsheets, research, and verification.
You can read OpenAI's launch post here: Introducing GPT-5.5. For API details, OpenAI's developer docs now list gpt-5.5 as the flagship model for complex reasoning and coding.
What GPT-5.5 actually is
OpenAI describes GPT-5.5 as a model for complex, real-world work. That phrase can sound generic, but the product direction is clear when you look at the examples OpenAI gives: writing and debugging code, researching online, analyzing data, creating documents and spreadsheets, operating software, and moving across tools until the task is finished.
That is the important frame. GPT-5.5 is not just a "smarter chat model." It is OpenAI trying to make the model more useful inside actual work loops.
The model is available in three main places:
- ChatGPT, where GPT-5.5 Thinking is available to Plus, Pro, Business, and Enterprise users.
- Codex, where GPT-5.5 is available across paid and some broader plans with a 400K context window.
- The API, where
gpt-5.5is now listed in OpenAI's developer docs with a 1M context window and 128K max output tokens.
There is also GPT-5.5 Pro, which uses more compute for harder questions and higher-accuracy work. OpenAI says GPT-5.5 Pro is available in ChatGPT for Pro, Business, and Enterprise users, and the API docs list gpt-5.5-pro for Responses API use.
Why this release matters
The easiest mistake is to read GPT-5.5 like a normal model bump: a few better benchmark numbers, slightly better coding, slightly better writing, done.
That is too small a read.
The real change is that OpenAI is pushing the model deeper into the execution layer. GPT-5.5 is designed to understand intent earlier, ask for less hand-holding, use tools more precisely, keep track of work across longer horizons, and check its own output before stopping.
For normal users, that means you can give it a messier task and expect it to organize the work better.
For developers, it means this model is much more relevant to agent systems, especially systems that need to use tools, browse, inspect files, run code, operate a UI, and continue until a task is actually done.
That is why the computer-use and tool-use gains matter as much as the coding gains. A model that writes good code but cannot reliably use a browser, terminal, file system, or external tool still needs heavy supervision. GPT-5.5 is a step toward reducing that supervision burden.
The benchmark picture: strong, but read it correctly
OpenAI reports a broad set of evaluation gains over GPT-5.4. These are the headline numbers from the launch post:
- Terminal-Bench 2.0: GPT-5.5 scores 82.7%, compared with 75.1% for GPT-5.4.
- SWE-Bench Pro: GPT-5.5 scores 58.6%, compared with 57.7% for GPT-5.4.
- Expert-SWE: GPT-5.5 scores 73.1%, compared with 68.5% for GPT-5.4.
- GDPval: GPT-5.5 scores 84.9%, compared with 83.0% for GPT-5.4.
- OSWorld-Verified: GPT-5.5 scores 78.7%, compared with 75.0% for GPT-5.4.
- BrowseComp: GPT-5.5 scores 84.4%, while GPT-5.5 Pro reaches 90.1%.
- Toolathlon: GPT-5.5 scores 55.6%, compared with 54.6% for GPT-5.4.
- Tau2-bench Telecom: GPT-5.5 reaches 98.0% without prompt tuning.
- CyberGym: GPT-5.5 scores 81.8%, compared with 79.0% for GPT-5.4.
The meaningful insight is not "every number is wildly higher." Some gains are modest. SWE-Bench Pro moves from 57.7% to 58.6%, which is not a massive headline jump.
The more important pattern is consistency. GPT-5.5 improves across coding, terminal work, professional tasks, browser/tool use, computer use, long context, and cybersecurity evaluations. That kind of broad improvement is more useful than a single dramatic leaderboard win because real work usually fails at the boundaries between skills.
An agent might need to read a ticket, inspect a repo, run a test, patch code, open a browser, compare output, update a document, and explain what changed. The weak point is often not "can the model write a function?" It is whether the model can keep the full loop straight.
GPT-5.5 looks designed for that loop.
Coding: less babysitting, better follow-through
OpenAI calls GPT-5.5 its strongest agentic coding model to date. The most important coding number is probably Terminal-Bench 2.0 at 82.7%, because terminal workflows test more than code generation. They test planning, tool use, iteration, command-line judgment, and recovery from errors.
SWE-Bench Pro matters too, but the gain there is smaller. The bigger story is that GPT-5.5 appears stronger in the surrounding behaviors that make coding agents useful:
- holding context across a large codebase,
- debugging ambiguous failures,
- using tools to check assumptions,
- carrying a change through tests and validation,
- and continuing through the boring middle of implementation instead of stopping after the first plausible patch.
That is the difference between an assistant that can answer "how would I fix this?" and an agent that can actually make the fix.
The release also lands inside Codex, which matters because Codex is where these behaviors become practical. GPT-5.5 in Codex supports a 400K context window, and OpenAI says it has been tuned to deliver better results with fewer tokens than GPT-5.4 for most users.
There is also a Fast mode in Codex: GPT-5.5 can generate tokens 1.5x faster for 2.5x the cost. That is not something every user needs, but it is useful for high-leverage coding sessions where latency matters more than marginal cost.
Computer use is the sleeper feature
The computer-use story may end up mattering more than the coding story.
OpenAI reports 78.7% on OSWorld-Verified, a benchmark that measures whether a model can operate real computer environments. In plain English: can the model interact with software through screenshots, keyboard actions, mouse actions, and task state?
That matters because many real business workflows do not live cleanly inside one API. They live across messy UIs, dashboards, forms, documents, spreadsheets, internal tools, and browser tabs. If models become better at navigating those environments, the addressable surface of automation expands.
This is where GPT-5.5 starts to feel different from older model launches. The model is not just being positioned as better at answering. It is being positioned as better at doing.
For developers, OpenAI's model docs list computer use as supported for GPT-5.5 in the Responses API. The same docs also list support for function calling, web search, file search, image generation, code interpreter, hosted shell, apply patch, skills, MCP, and tool search.
That combination points to a model intended for broad tool orchestration.
Knowledge work: the spreadsheet and document angle is not fluff
OpenAI is also pushing GPT-5.5 heavily for professional knowledge work. The headline metric is GDPval at 84.9%, which OpenAI describes as measuring agents' ability to produce well-specified work products across 44 occupations.
This matters because the most common AI productivity use case is not "solve a puzzle." It is "make the deliverable."
That deliverable might be:
- a spreadsheet,
- a memo,
- a financial model,
- a slide deck,
- a research brief,
- a customer support workflow,
- a schedule,
- a data analysis,
- or a structured recommendation.
OpenAI also reports 88.5% on internal investment banking modeling tasks and 54.1% on OfficeQA Pro. Those are exactly the types of tasks where small errors matter. If a model can create a clean spreadsheet but misunderstands the task, it is not useful. If it can reason correctly but cannot produce the artifact, it is also not useful.
The goal is the combination: understand the intent, use the tools, produce the artifact, check the result.
That is the main theme of GPT-5.5.
Research: the most interesting part is persistence
OpenAI says GPT-5.5 shows stronger performance in early scientific research workflows, including genetics, quantitative biology, bioinformatics, and data analysis.
The headline academic numbers include:
- GeneBench: GPT-5.5 scores 25.0%, while GPT-5.5 Pro scores 33.2%.
- BixBench: GPT-5.5 scores 80.5%, compared with 74.0% for GPT-5.4.
- FrontierMath Tier 1-3: GPT-5.5 scores 51.7%.
- FrontierMath Tier 4: GPT-5.5 scores 35.4%.
- GPQA Diamond: GPT-5.5 scores 93.6%.
OpenAI also highlights an internal version of GPT-5.5 helping discover a proof about off-diagonal Ramsey numbers, later verified in Lean.
That is a big claim, but the practical takeaway is simpler: OpenAI is not only targeting coding agents. It is trying to build models that can help with long, evidence-heavy research loops.
For a researcher, the useful model is not just the one that knows facts. It is the one that can ask what should be tested next, read context, write code, inspect outputs, catch bad assumptions, and keep moving through uncertainty.
This is another place where GPT-5.5's real value is persistence.
API details developers should actually care about
As of OpenAI's current developer docs, the main API model ID is:
gpt-5.5
The Pro model is:
gpt-5.5-pro
For gpt-5.5, OpenAI lists:
- 1M context window
- 128K max output tokens
- knowledge cutoff: December 1, 2025
- input price: $5 per 1M tokens
- output price: $30 per 1M tokens
- reasoning efforts: none, low, medium, high, xhigh
- modalities: text input/output and image input
- API support: Responses API and Chat Completions API
There is one pricing detail teams should not miss: OpenAI says prompts over 272K input tokens are priced at 2x input and 1.5x output for the full session for standard, batch, and flex processing.
That means you should not treat the 1M context window as free working memory. Use it when it changes the outcome. For long workflows, you still want good context management, compaction, retrieval, and caching.
For gpt-5.5-pro, OpenAI says requests may take several minutes and recommends background mode to avoid timeouts. That is a useful signal: Pro is for harder asynchronous work, not every normal chat turn.
Prompting GPT-5.5: what changes
OpenAI's Using GPT-5.5 guide is actually useful here. The model is not described as a pure drop-in replacement for GPT-5.4. OpenAI recommends treating it as a model family you tune for.
The practical changes:
Reasoning effort defaults to medium. That is OpenAI's recommended balance for quality, latency, and cost. Use low for efficient reasoning, high for complex agentic tasks, and xhigh for the hardest asynchronous workflows. Do not assume higher effort is always better.
Outcome-first prompts matter more. GPT-5.5 is better at working from a clear goal, success criteria, allowed side effects, evidence rules, and output shape. You do not need to micromanage every step unless the exact process matters.
Tool descriptions carry more weight. For tool-heavy agents, move guidance into the tool descriptions themselves: what the tool does, when to use it, required inputs, side effects, retry rules, and common failure modes.
Structured Outputs should replace schema prose. If your app needs JSON, do not write a long schema explanation in the prompt when Structured Outputs can enforce it.
Verbose prompts may need trimming. GPT-5.5 tends to be more literal, direct, and polished by default. Old prompt stacks that over-explain the process may cause overthinking or unnecessary tool use.
Image detail changed. OpenAI says GPT-5.5 preserves more visual detail by default for image inputs. When image_detail is unset or auto, the behavior now preserves images without resizing up to 10,240,000 pixels or a 6,000-pixel dimension limit.
That last part matters for computer use, screenshots, document parsing, and UI debugging.
A simple migration checklist
If you are moving an app from GPT-5.4 or GPT-5.3-Codex to GPT-5.5, I would not start by changing everything.
Start with a clean comparison:
- Switch the model ID to
gpt-5.5in a test branch. - Keep your existing eval set unchanged.
- Test
reasoning.effortat low, medium, and high before touching prompts. - Measure quality, latency, total tokens, tool calls, and failure modes.
- Remove process-heavy prompt text only after you see where the model is over-constrained.
- Move tool-specific instructions into tool descriptions.
- Use Structured Outputs instead of repeating JSON schemas in the prompt.
- Add explicit stopping criteria for agents.
- For long-context workflows, test below and above the 272K-token pricing threshold separately.
For many production apps, the winning setup may not be "GPT-5.5 at xhigh for everything." It may be GPT-5.5 with low or medium effort for most cases, GPT-5.5 Pro only for hard async work, and GPT-5.4 mini or nano for simpler support tasks.
That kind of routing will usually beat a one-model-for-everything approach on cost and latency.
Safety: this is not a side note
GPT-5.5 is also a safety release.
OpenAI says GPT-5.5 went through its full predeployment safety evaluations and Preparedness Framework, with targeted red-teaming for advanced cybersecurity and biology capabilities. The GPT-5.5 system card says OpenAI collected feedback from nearly 200 early-access partners before release and is shipping its strongest safeguards to date.
The most important line: OpenAI is treating GPT-5.5's biological/chemical and cybersecurity capabilities as High under its Preparedness Framework.
OpenAI says GPT-5.5 did not reach Critical cybersecurity capability level, but its cybersecurity capabilities are a step up from GPT-5.4. That is the exact tension with frontier models right now. The same capability that helps defenders find and fix vulnerabilities can also help attackers if it is not controlled.
OpenAI is responding with trusted access programs for cyber defense, stricter safeguards, monitoring, and a separate GPT-5.5 Bio Bug Bounty. That program invites vetted researchers to test whether a universal jailbreak can defeat OpenAI's bio safety challenge, with a $25,000 reward for the first successful universal jailbreak across all five questions.
This is not just legal or PR language. It tells us how OpenAI sees the model: powerful enough to require a more serious deployment posture.
GPT-5.5 vs GPT-5.4: should you switch?
For most high-value work, yes, GPT-5.5 is the new starting point.
Use GPT-5.5 when the task involves:
- complex coding,
- codebase navigation,
- tool-heavy agents,
- browser or computer use,
- research with multiple sources,
- document or spreadsheet creation,
- long-context reasoning,
- customer workflows with many steps,
- or anything where the model must keep working until the result is done.
Stay with GPT-5.4, GPT-5.4 mini, or GPT-5.4 nano when:
- latency matters more than maximum intelligence,
- the task is simple and well-defined,
- cost dominates the product experience,
- you are running high-volume extraction or classification,
- or GPT-5.5 does not show a measurable eval gain for your use case.
This is the practical point: the best model is not always the largest model. The best model is the one that gives you the best answer for the cost, latency, and reliability constraints of that workflow.
GPT-5.5 raises the ceiling. It does not remove the need for routing.
What I think is the real direction
The bigger story is that "model releases" are becoming "work system releases."
GPT-5.5 is not just a text model with better answers. It is tied to:
- Codex,
- computer use,
- tool search,
- hosted tools,
- long context,
- structured outputs,
- compaction,
- background mode,
- trusted access,
- and stronger domain-specific safeguards.
That tells you where the product is going.
The model is only one part of the system. The surrounding runtime matters just as much. A strong model inside a weak tool loop will still fail. A strong model with good context management, clean tool descriptions, evals, approvals, caching, and state handling becomes much more useful.
That is the real GPT-5.5 lesson for builders: stop thinking only in prompts. Start thinking in workflows.
Final take
GPT-5.5 is OpenAI's clearest move yet toward AI that can do longer, messier, more practical work across a computer.
The benchmark story is solid, but the deeper story is execution: better coding follow-through, stronger computer use, more precise tool use, better professional work, and a deployment posture that takes cyber and bio risk seriously.
For developers, the move is simple: benchmark GPT-5.5 against your actual workflows, not just a demo prompt. Test reasoning effort, tool descriptions, context strategy, and routing. If the task is complex enough, GPT-5.5 is probably the new baseline. If the task is simple, smaller models still matter.
That is a healthy model release. It gives you more capability, but it also forces better engineering discipline around how you use it.