Sakana AI’s new Fugu release is a model launch with a different premise: the company is not asking developers to trust one giant model. It is selling an orchestrator that decides which models should work on a request, how they should communicate, and when their outputs should be checked and merged.
The Tokyo lab released Fugu and Fugu Ultra on June 22. Both are exposed through a single OpenAI-compatible API, but internally Sakana says the system can call a pool of LLM agents, including instances of itself, to solve complex multi-step tasks. The product pitch is simple from the outside: call one model. The implementation claim is more ambitious: let a trained model design the agent scaffold on demand.
That matters for AI-generated games because the hard part is rarely the first impressive answer. A game-building system has to plan, write code, run tools, inspect errors, preserve state, verify behavior, and repair the artifact without losing the design. Fugu is relevant if orchestration helps that loop, not because it magically makes a finished game from one prompt.
Fugu also fits Sakana’s broader research line. The company previously published AI Scientist work aimed at automating parts of research, and later outside evaluations cautioned that such systems still struggle with novelty, judgment, and evaluation quality. Fugu moves that ambition closer to a production model interface: less “autonomous scientist” as a demo, more “trained coordinator” as an API.
Sakana frames Fugu as a response to both model specialization and single-vendor dependency. The release post points to Anthropic’s June 12 statement that U.S. government export-control action forced Anthropic to suspend Fable 5 and Mythos 5 access for all customers. Sakana’s argument is that a swappable agent pool can route around provider disruption and create a more resilient path to frontier-level capability.
The system comes in two versions. Fugu is the lower-latency default for everyday coding, code review, chat, and interactive work. It can also let teams opt specific agents out of the pool for data, privacy, or compliance reasons. Fugu Ultra is the heavier model for difficult multi-step work. Sakana says it coordinates a deeper expert pool and is aimed at research, paper reproduction, cybersecurity analysis, patent work, and other tasks where answer quality matters more than response time.
The technical report describes two related training lines behind that split. Fugu builds on Trinity, a learned coordinator that uses a compact language-model backbone and a lightweight head to select workers. Fugu Ultra builds on Conductor, where a language model writes natural-language workflows that assign subtasks to worker agents, control what each worker can see, and synthesize the result. The report also describes persistent shared memory across turns and isolation inside the current workflow to avoid one worker steering every later worker down the same path.
The benchmark table is strong, but it needs careful reading. In Sakana’s report, Fugu Ultra scores 82.1 on Terminal Bench 2.1, 93.2 on LiveCodeBench, 90.8 on LiveCodeBench Pro, 95.5 on GPQA Diamond, 86.6 on CharXiv Reasoning, and 73.7 on SWE Bench Pro. Fugu, the faster variant, is close on several tests, including 80.2 on Terminal Bench 2.1 and 92.9 on LiveCodeBench.
Those numbers do not mean Fugu Ultra beats everything everywhere. The same table shows Claude Opus 4.8 ahead of Fugu Ultra on SWE Bench Pro in Sakana’s comparison, and MRCRv2 favors GPT-5.5. SciCode also shows Fugu slightly ahead of Ultra. Sakana’s broader claim is narrower and more interesting: orchestration can beat any one worker on many tasks by choosing when to use coding, math, science, debugging, or verification specialists.
The report’s examples are closer to what game-tool builders should watch. In Terminal Bench tasks, Sakana says Fugu alternates between GPT-5.5 as a builder and Claude Opus 4.8 as a debugger. In another example, Fugu Ultra assigns one model to understand a software issue, brings in another to re-examine the problem from scratch, then changes course when the second model finds a client-side concurrency bug. That is the shape of a useful game-generation loop: build, inspect, challenge the first path, and repair.
The product page also includes a CAD task that maps well to game-adjacent tooling. Sakana asked models to create a mechanical iris in CAD. It says Fugu Ultra produced a structure where blades rotate around outer pins and the aperture opens and closes, while other models left gaps, weak linkages, or incomplete closure. That is not a game benchmark, but it is a concrete example of geometry plus mechanism checks, the same class of problem that appears in procedural props, physics toys, and interactable assets.
For long sessions, Sakana emphasizes user reports from code review, patent landscape analysis, paper reproduction, and security assessment. These are still company-provided testimonials, not independent evaluations. They are useful as product signals, but Wonder News would not treat them as proof of general reliability.
Pricing and control also matter. The product page says Fugu is charged at the standard rate of the active underlying model when one agent is used, and at a single top-tier blended rate when multiple agents are active, rather than stacking every model fee. Fugu Ultra’s pay-as-you-go price is listed at $5 per 1 million input tokens, $30 per 1 million output tokens, and $0.50 per 1 million cached input tokens, with higher rates above 272K context. Subscription tiers are $20, $100, and $200 per month.
There are important limits in the FAQ. Fugu Ultra relies on the full agent pool, so users cannot remove specific models from Ultra. Fugu allows opt-outs from the console. Sakana says customers can opt out of training-data use, but routing details are proprietary and not exposed. The product is available outside Japan, but Sakana says it does not serve EU or EEA users.
The financial-sector claim should also be kept precise. Sakana names finance-flavored tasks such as financial time-series prediction in its Fugu materials, and earlier public reporting has described Japanese banks as Sakana investors. Public materials reviewed for this article did not show that MUFG or SMBC has deployed Fugu itself for document analysis, so that should not be stated as fact without a source.
The caveat for AI games is straightforward. Fugu can improve the model layer of a game-generation system, but it does not replace the runtime evidence layer. If the surrounding toolchain cannot expose playable builds, logs, screenshots, replay traces, asset bounds, control responsiveness, and performance budgets, an orchestrator still has weak signals to reason over.
That is why Fugu is best read as a signpost. The competition is moving from raw model scores toward systems that decide which model should act, which model should verify, and how long a software loop can keep improving. For AI-generated games, that direction is more relevant than another single-model coding score. The models that matter will be the ones that can keep a game build moving until it actually plays.
This article was written with assistance from Wonder Bricks AI Agent and edited by SunnyLabs.