GPT-5.3 Codex: What's New, Benchmarks, and What It Enables
GPT-5.3 Codex: What's New, Benchmarks, and What It Enables
OpenAI just released GPT-5.3-Codex, positioning it as the most capable Codex model yet and a step toward a computer-using collaborator that can work alongside you for longer stretches. This post is a builder-first breakdown of what actually changed, why it matters, and how it fits into the current agentic coding race. For the primary source, see OpenAI's announcement: Introducing GPT-5.3-Codex.
What GPT-5.3-Codex actually is
OpenAI describes GPT-5.3-Codex as the new frontier model for Codex, combining the coding performance of GPT-5.2-Codex with the reasoning and professional knowledge capabilities of GPT-5.2. OpenAI also says this is the first model that was instrumental in creating itself, with early versions helping debug training runs, manage deployment, and analyze evaluations. That is a rare signal that the model is useful beyond demos and actually accelerates the team building it.
OpenAI reports GPT-5.3-Codex now runs 25% faster in Codex, which matters more than it sounds. Faster inference means shorter feedback loops, and that is the difference between an agent you supervise and one you abandon.
Benchmark snapshot: how strong is it, really?
OpenAI highlights four benchmark families that map closely to real work: SWE-Bench Pro (software engineering), Terminal-Bench 2.0 (terminal skills), OSWorld-Verified (computer use), and GDPval (professional knowledge work). From the appendix (all at xhigh reasoning effort):
- SWE-Bench Pro (Public): 56.8%
- Terminal-Bench 2.0: 77.3%
- OSWorld-Verified: 64.7%
- GDPval (wins or ties): 70.9%
- Cybersecurity CTF challenges: 77.6%
- SWE-lancer IC Diamond: 81.4%
OpenAI also notes that SWE-Bench Pro spans four languages and is more contamination-resistant than SWE-bench Verified, and that GPT-5.3-Codex achieves these results with fewer tokens than any prior model. On OSWorld-Verified, OpenAI reports human performance is roughly 72%, which is a useful reality check for how close the model is to reliable desktop work.
What actually changes in real workflows
Benchmarks are good, but the real value is in the failure modes that go away. Here are the practical shifts that matter:
1) Longer tasks with less babysitting. OpenAI tested GPT-5.3-Codex on multi-day web development tasks, including building two full games and iterating using generic prompts like "fix the bug" or "improve the game." The model ran for millions of tokens and kept improving without constant human intervention.
2) Better product judgment in the first draft. OpenAI shows GPT-5.3-Codex producing more production-ready defaults for web work, like clearer pricing presentation and a testimonial carousel with multiple quotes, instead of the bare-minimum layout seen with GPT-5.2-Codex. The point is not the UI itself, but the quality of initial judgment.
3) More reliable terminal and system work. Terminal-Bench 2.0 is a proxy for real terminal behaviors. The jump here means fewer breakdowns when a task involves setup, dependency management, or multi-step scripts.
4) Real computer use gets closer to trustworthy. OSWorld-Verified is about visual desktop tasks. A higher score means fewer small UI misses that derail longer workflows. OpenAI explicitly points to stronger computer-use capabilities here.
Interaction: steering while it works
A real change in this release is how Codex behaves during long tasks. GPT-5.3-Codex is designed to provide frequent updates and let you steer the work mid-run instead of waiting for a final output. OpenAI says you can enable steering in Settings -> General -> Follow-up behavior inside the Codex app. That is a big deal for reliability because the most common failure mode in agentic systems is silent drift.
The Codex app just launched, and GPT-5.3-Codex is built for it
This release lands right after OpenAI introduced the Codex app for macOS, which is positioned as a command center for multi-agent work. The app runs agents in separate threads, supports parallel work, and includes built-in worktrees so multiple agents can safely work in the same repo without conflicts. It also adds a formal skills workflow, letting you bundle instructions, resources, and scripts so Codex can run reliable, repeatable processes across coding and non-coding tasks.\n\nOpenAI says the Codex app is available now on macOS and included with paid ChatGPT plans, with temporary access for Free and Go tiers and doubled rate limits during the launch window. It also reports that Codex usage has doubled since GPT-5.2-Codex launched in mid-December and that more than a million developers used Codex in the past month. (Introducing the Codex app)
This matters because GPT-5.3-Codex is tuned for the app's workflow: longer tasks, parallel agents, and ongoing supervision rather than one-shot outputs.
The competitive moment: Claude Opus 4.6 and the ad debate
OpenAI did not ship GPT-5.3-Codex in a vacuum. The same week, Anthropic released Claude Opus 4.6, adding a 1M token context window and improved long-context reasoning for coding and enterprise tasks. That release is clearly aimed at the same high-end workflow category. (Claude Opus 4.6 announcement)
At the same time, the public narrative around AI assistants shifted because of ads. Anthropic ran Super Bowl ads that mock chatbots inserting ads into personal conversations and pledged Claude would remain ad-free. Sam Altman publicly called the campaign "dishonest." Regardless of who you side with, the signal is clear: trust, monetization, and assistant behavior are now part of the competitive landscape, not just benchmark charts. (Forbes)
Security and cyber safeguards
OpenAI says GPT-5.3-Codex is the first model it classifies as "high capability" for cybersecurity-related tasks under its Preparedness Framework, and the first model trained to identify software vulnerabilities directly. Alongside that capability jump, OpenAI highlights safeguards like automated monitoring, trusted access for advanced cyber capabilities, and enforcement pipelines backed by threat intelligence. It also announced Trusted Access for Cyber, expansion of a security research agent called Aardvark, and $10M in API credits for defensive research.
Availability and rollout
GPT-5.3-Codex is available on paid ChatGPT plans across Codex surfaces: app, CLI, IDE extension, and web. OpenAI says API access is coming soon, and notes the model is trained and served on NVIDIA GB200 NVL72 systems.
Bottom line: who should upgrade and why
If you only use Codex for quick edits or one-off snippets, GPT-5.3-Codex will feel like a clean upgrade. If you use Codex for multi-step tasks, UI-heavy workflows, or end-to-end projects, this release is bigger: stronger long-run stability, better computer use, and a more interactive, steerable agent.
In practice, GPT-5.3-Codex is best understood as Codex evolving from a coding assistant into a computer-using collaborator. That shift is what makes this release matter beyond just another benchmark bump.