Run the Eval Before You Commit

May 08, 2026

A structured comparison of GPT-5.4 and Gemma 3 27B found the open-weight model winning an email task at lower cost. The point isn't which model won — it's that most teams pick models by reputation and never run the test.

Today's batch adds to the picture we've been tracking: image quality regressions with no documented fix, GPT-5.5 behavior shifts that shipped without announcements, and early evidence that OpenAI may be testing unnamed model tiers internally.

Today's issue covers:

Evals: GPT-5.4 vs Gemma 3 27B — task-specific testing picks a winner that reputation wouldn't predict.
Developer use cases: image-gen-2 plus Kling 3 for multi-style video, and a prompt chain for HR policy generation.
ChatGPT Images 2.0: Text rendering degradation documented with user examples, no acknowledged fix yet.
GPT-5.5: Explicit typo correction disrupts conversational flow, and "Alpha models" surface in the interface for some users.

Run the Eval Before You Commit

A thread on r/OpenAI made a case most API teams already know but rarely act on: reputation and benchmark rankings don't tell you which model to use for your specific task. The author ran the same two prompts through GPT-5.4 and Gemma 3 27B — one email-drafting task, one technical explanation — and found Gemma 3 27B won the email task while costing significantly less per call.

The takeaway isn't that Gemma 3 outperforms GPT-5.4 in general. It's that "better" has no meaning without a test harness tied to your specific workload. Task-specific evaluation surfaces cost-performance tradeoffs that published leaderboards don't capture — because leaderboards measure average performance across aggregated benchmarks, not your email draft or your product description or your support classification problem.

For teams running OpenAI API workloads at volume, this is worth acting on. Routing lower-complexity tasks to cost-efficient open-weight alternatives while keeping GPT-5.4 for high-stakes completions is a credible optimization strategy — but it only works if you have a structured eval pipeline in place to guide the routing decisions. Defaulting to the most recognized model without testing it on your actual workload is the costly approach, not the safe one.

Our colleagues at OpenWeight LLM noted today that Gemma 4 26B is now hitting 600 tokens per second on consumer hardware. That throughput number sharpens the case for running your own tests before committing to hosted APIs for workloads where latency and cost matter.

Building with the OpenAI Stack

The most technically detailed community post this week came from a creator who combined image-gen-2 with Kling 3 to produce a continuous video reel of themselves rendered across multiple drawing styles. The workflow: image-gen-2 generated stylized keyframes for each visual style, then Kling 3 handled the video generation layer, animating between frames to maintain continuity.

The author was direct about the friction: video generation at this quality level remains expensive, and getting consistent visual continuity across style transitions requires sustained prompting work. The model doesn't just stitch frames together — each Kling 3 pass needs explicit guidance about what to preserve between shots. Despite that overhead, the r/OpenAI thread drew substantial interest, suggesting the pipeline is worth the friction for creators experimenting with AI-assisted production.

The practical read for API builders: multi-model composition is becoming the norm for production-quality creative outputs. image-gen-2 handles per-frame stylization; a dedicated video model handles motion and temporal consistency. The handoff between them is the engineering problem worth solving — not the model capabilities themselves, which are sufficient.

A separate post showed a different application: a multi-step prompt chain for generating HR leave policies. The sequence collects company information, analyzes state-specific legal requirements based on jurisdiction and employee count, and drafts policy language accordingly. It's a straightforward demonstration of how structured prompt sequences handle domain complexity — particularly conditional logic tied to jurisdiction and company size — that single-turn prompts handle inconsistently.

A developer on r/OpenAI also offered free consultation on what they describe as the "brain level" of AI system design: the logic, role definitions, workflow structure, and governance sitting above the individual prompt. The framing is useful independent of the offer itself — it names a layer of AI architecture that teams often underspecify when they focus primarily on prompt content.

ChatGPT's Image Layer Has a Problem

A post on r/OpenAI documents a regression in ChatGPT Images 2.0, specifically in how the model renders text within generated images. The user reports that the current output contains more visual noise and artifact distortion around letters compared to what the same model produced roughly two weeks prior — and includes examples. We noted GPT Image 2 activity in yesterday's issue; this thread adds a specific quality complaint to that picture.

This is a familiar pattern with production image models: silent regression without a version bump, changelog entry, or status page update. Users notice before any official acknowledgment surfaces. For workflows that depend on ChatGPT-generated images with legible embedded text — product mockups, annotated diagrams, instructional graphics, social assets — the degradation has immediate operational consequences.

A related thread from Pro subscribers raises a connected question: whether image generation quality can be manually set to "high" within the Pro subscription, or whether the system determines quality automatically. OpenAI has not published documented controls for generation quality tiers accessible to subscribers. Without exposed quality parameters, users have no lever to pull when output quality drops.

What GPT-5.5 Keeps Doing Without Permission

Multiple r/OpenAI users report a behavior shift in GPT-5.5: the model now explicitly calls out and corrects spelling errors in the user's input before responding, rather than silently interpreting them. For users who type quickly, dictate with voice input, or routinely make minor errors, this interrupts the conversational flow. The model processes the typo correctly — it clearly understands the input — but still flags the error explicitly, which reads as unnecessary correction rather than assistance.

This pattern is consistent with RLHF tuning aimed at perceived helpfulness that lands as patronizing in practice. Previous model versions handled typos silently; the behavior change is noticeable. There's no documented user-facing toggle to disable it. The r/OpenAI thread has traction, which may route it toward OpenAI's feedback channels.

Separately, a screenshot circulating on r/OpenAI appears to show "Alpha models" listed within the OpenAI interface for a subset of users. What these models are, which users have access, and when this was introduced is unconfirmed. OpenAI has not acknowledged the label publicly. If Alpha models represents a staged access tier for pre-release versions — similar to how some features soft-launch before announcement — it would signal a more structured internal testing track than what OpenAI has historically communicated.

Codex's Corner

Discussion about this post

Ready for more?