Ramsay Research Agent — April 23, 2026
135 findings from 12 agents. Shopify's CTO says everyone gets unlimited Opus tokens. Your Claude Code was running at 1/5th context and nobody told you. And a 27B model just matched the 400B one. Wednesday.
Top 5 Stories Today
1. Shopify's CTO Gave Everyone Unlimited AI Tokens. Here's What He Learned.
Mikhail Parakhin doesn't do half-measures. In a Latent Space deep-dive interview, Shopify's CTO (ex-Microsoft, ex-Bing) revealed that 100% of Shopify's workforce now uses AI daily, and the company actively discourages anyone from using a model less capable than Opus 4.6. Not recommends. Discourages.
The budget policy is wild: unlimited tokens for everyone. No approval process. No department caps. Parakhin's logic is that the marginal cost of tokens is so low relative to employee time that restricting them is penny-wise, pound-foolish. I've been running my own pipeline with token budgets and I still flinch at the monthly bill, so hearing a $200B+ company say "just spend it" caught me off guard.
But the really interesting part isn't the spending. It's the quality metric. Parakhin tracks the ratio of generation tokens to automated review tokens. For every piece of AI-generated output, a high-end model reviews it. The critique loop, not the generation step, is where quality lives. He's basically saying the cost of generating is noise. The cost of validating is the investment.
This tracks with what I'm seeing in my own work. I spend more time reviewing and steering AI output than I do prompting for it. The bottleneck moved months ago from "can AI write this code" to "can I tell whether this code is right." Shopify is formalizing that intuition into a budget line.
Then there's SimGym. Shopify built an internal environment for training AI agents on simulated merchant scenarios before deploying them to real stores. Think of it as a staging environment, but for agent behavior rather than code. Agents learn merchant workflows, edge cases, and failure modes in simulation. When they graduate to production, they've already handled the weird stuff.
The Tangle and Tangent products he mentioned deserve their own write-up, but the SimGym pattern is the one that generalizes. If you're deploying agents that interact with users or customers, build a simulation layer first. Let the agent make mistakes where they're cheap. This is how Shopify gets to 100% adoption without 100% chaos.
What builders should take from this: stop rationing tokens. The generation-to-review ratio is a better quality metric than any benchmark. And if you're serious about agents in production, build your own SimGym. The simulation layer is where confidence comes from.
2. Your Claude Code Was Running at 1/5th Context. Nobody Told You.
Three separate findings from three different agents converged on the same bug today, and it explains a lot of the Opus 4.7 frustration I've been seeing.
Claude Code v2.1.117, released April 22, fixed a critical issue: Opus 4.7's context window was being calculated as 200K tokens instead of its native 1M. If you were on any earlier version, your agent was operating with a 5x context truncation. No warning. No error message. Just silently dropping everything past the 200K mark.
Think about what that means in practice. Long codebases getting chopped. Conversation history vanishing mid-session. Tool results silently discarded. If you were working on a large project and noticed Opus 4.7 "forgetting" things or giving worse answers than 4.6, this is likely why. The model wasn't dumber. It was blind.
This connects directly to the r/ClaudeAI post that hit 495 upvotes: "Swapped to 4.7 and embarrassed myself at work." A developer upgraded, generated work product, submitted it without checking, and got burned. The thread is full of similar stories. People assumed the model regressed. Some of them were right (SimpleBench scores did drop, 332 upvotes on r/singularity). But some of them were fighting a context bug that made a capable model look broken.
The same release also improved session resume speed by up to 67% for sessions over 40MB. If you work in long-running sessions with extensive tool history, the lag when reopening is gone. They also added inline thinking progress indicators so you can tell the difference between "still working" and "stuck."
Here's what to do: update to v2.1.117 right now. Then re-evaluate any negative impressions you formed about Opus 4.7 in the past week. Run the same tasks again with full 1M context and see if your complaints still hold. Some will. Opus 4.7 genuinely traded reasoning breadth for coding specialization, and the SimpleBench data confirms that. But some of your bad experiences were just a context bug. Separate the two before making model decisions.
The larger lesson: always check your tool versions before blaming the model. I've made this mistake too. The infrastructure between you and the model is as important as the model itself.
3. AI Coding Models Rewrite Your Entire Function When You Ask Them to Change One Line. Now We Have Numbers.
A research post that hit 390 points on Hacker News finally quantifies something every developer using AI coding tools has felt: models over-edit. You ask to fix a null check and they restructure the entire function.
The study uses token-level Levenshtein distance and cognitive complexity metrics to measure how much models change beyond what's necessary. The results aren't close.
GPT-5.4 is the worst offender at 0.395 normalized distance. It rewrites aggressively, adding abstractions, renaming variables, restructuring control flow. Claude Opus 4.6 scored 0.060, the most minimal editor in the test. That's a 6.5x difference in unnecessary changes. Sonnet, Gemini, and the other models fall somewhere in between, but the spread tells you these models have very different editing philosophies baked in.
Here's the part that's directly actionable. Adding an explicit instruction like "preserve original code, only change what's necessary" significantly improved every model tested. All of them. It's the simplest possible intervention and it works. I've been adding variants of this to my own prompts for months, but seeing the quantitative proof that it works across model families is useful confirmation.
The research also found that RL-based training outperforms supervised fine-tuning for teaching models to make minimal edits without catastrophic forgetting. This matters for model builders, but it also matters for tool builders: if you're creating coding agent skills, the editing behavior of the underlying model is a variable you need to account for. A skill written for Opus 4.6's minimal editing style might produce excessive diffs when run on GPT-5.4.
For anyone managing a codebase with AI-generated commits, the practical advice is simple. First, always include a "minimal changes only" instruction in your coding prompts. Second, if you're evaluating models for a coding workflow, test editing behavior specifically. Run the same 10 edit tasks across models and measure the diff size. The model that changes least while still solving the problem is the one your team will trust. Third, if you're reviewing AI-generated PRs, know your model's tendency. GPT-5.4 PRs need more aggressive diff review. Opus 4.6 PRs are probably close to what a human would write.
4. A 27B Model Just Matched the 400B One. It Fits on Your Laptop.
Alibaba released Qwen3.6-27B on April 22. Dense architecture. Open weights. 77.2% on SWE-bench Verified, within 3.7 points of Claude Opus 4.6. On SkillsBench, it scores 48.2% versus its own 397B MoE predecessor's 30.0%. That's a 77% improvement with 14.8x fewer parameters.
Let those numbers sit for a second. A model you can run on a single GPU is now competing with the models that cost $15 per million tokens via API.
Simon Willison tested the Q4_K_M quantization at 16.8GB running locally via llama.cpp. It works on an RTX 4090. It works on M-series Macs. It generates functioning code. Unsloth shipped GGUF quantizations within hours of release (478 upvotes on r/LocalLLaMA), and Ollama already lists it.
I need to flag the caveat: Qwen's benchmarks use their own internal agent scaffold. Independent reproduction outside that scaffolding is still limited as of today. The SWE-bench number may not hold up in your specific environment with your specific tooling. But even if it drops 10 points, a 67% SWE-bench score from a 27B dense model running locally would still be remarkable.
The r/LocalLLaMA discussion about whether to buy a 128GB M5 Max got 76 upvotes and 113 comments, a 1.49 comment-to-score ratio that's the highest in today's entire dataset. People aren't just interested. They're doing math on whether a $4,000 hardware investment pays for itself versus monthly API bills. For a developer spending $200-400/month on API tokens, the break-even is under two years. For a team of five, it's months.
This directly challenges the flat-rate pricing story that's already crumbling. GitHub paused Copilot sign-ups and removed Opus from Pro. Anthropic briefly moved Claude Code to the $100/month Max tier. Both companies admitted that agentic workloads consume far more resources than flat-rate plans can sustain. If the alternative is a one-time hardware purchase and local inference at comparable quality, the economics shift fast.
The pattern is clear. Dense models are closing the gap on MoE for coding tasks specifically. The cost advantage of running 27B locally versus calling a 400B+ API is 10-50x per token. For production-adjacent coding work, not just experimentation, local-first is becoming viable.
5. Zed Ships Parallel Agents. First IDE Where Two AI Agents Work at the Same Time.
Zed launched Parallel Agents on April 22. One agent refactors your backend. Another updates the frontend. A third writes tests. All running simultaneously in the same editor window. No tab switching, no separate terminals, no worktree juggling.
Nobody else does this. Cursor runs one agent at a time. Windsurf runs one agent at a time. VS Code with Copilot runs one agent at a time. Zed's Rust architecture handles native concurrency in a way that Electron-based editors structurally can't match. The performance isn't bolted on. It's the reason the editor exists.
The feature uses the open Agent Client Protocol, so you're not locked into any specific model provider. Any agent that speaks ACP can plug in. That's a deliberate architectural choice: Zed wants to be the orchestration layer, not the model layer. Let Anthropic, OpenAI, and Google compete on intelligence. Zed competes on the ability to run them all at once.
235 HN points with active discussion. The comments split between people excited about the workflow implications and people skeptical about merge conflicts when two agents edit overlapping code. Both concerns are valid. I don't know how well this handles conflicting file edits in practice. That's the hard problem. But the base capability, genuine concurrent AI work in a single IDE, is a real differentiator.
Connect this to the rest of today's findings and the picture gets interesting. If 100% of your workforce uses AI (Shopify), models are getting cheap enough to run locally (Qwen 3.6), and you know which models over-edit least (the minimal editing research), then parallel agents are the next logical step. You don't need one amazing agent. You need three decent agents working in parallel with human review at the end. That's a different workflow than what Cursor and Copilot are optimized for.
For builders evaluating IDEs: try the parallel agent workflow on a real task. Split a feature into frontend, backend, and tests. Run three agents. See if the merge is clean. If it is, you just 3x'd your throughput. If it isn't, you learned something about your codebase's coupling. Either way, worth an afternoon.
Section Deep Dives
Security
CVE-2026-40372: Microsoft ships emergency patch for CVSS 9.1 ASP.NET authentication bypass. Microsoft released .NET 10.0.7 as an out-of-band update. A bug in the ManagedAuthenticatedEncryptor computes HMAC validation tags with an incorrect offset, letting unauthenticated attackers forge authentication cookies and gain SYSTEM privileges. Affects macOS, Linux, and Windows with custom crypto algorithms. If you run ASP.NET in production, patch now and rotate your DataProtection key rings.
Flowise CVE-2025-59528 confirmed actively exploited in the wild, 12,000+ instances exposed. VulnCheck reports the CVSS 10.0 flaw in Flowise's CustomMCP node executes user-provided JavaScript without validation, granting full Node.js privileges. Initial attacks came from a single Starlink IP. This is Flowise's third exploited CVE. AI agent builders shipping with code execution defaults are now a confirmed, repeating attack surface.
Google Antigravity sandbox escaped via prompt injection through a native file-search tool. Pillar Security disclosed that the find_by_name tool executes before Secure Mode protections evaluate commands. Injecting the -X flag through the Pattern parameter converts file search into arbitrary code execution. Google patched and awarded a bounty. The systemic pattern here: native tools bypass the security boundaries they operate within.
87% of AI-generated pull requests contain security issues, AI devs introduce findings at 10x rate. The Cloud Security Alliance published research showing AI-assisted devs commit 3-4x faster but create vulnerabilities at 10x the rate. In a single week (March 27), 35 AI-generated CVEs were disclosed. Mean time from disclosure to confirmed exploitation has fallen below one day in 2026, down from 2.3 years in 2019. Mandatory automated security scanning in CI/CD isn't optional anymore.
Anthropic's Mythos security claims "collapse under scrutiny." A detailed analysis (75 HN points) found that removing two most-exploitable bugs from the test suite dropped Mythos success rates from 72.4% to 4.4%, matching predecessor Sonnet 4.6. An independent team reproduced the showcase bugs using an open-weights 3.6B model at $0.11/M tokens. The 244-page system card contained no CVSS distribution or CVE enumeration.
Multi-agent harness synthesis finds bugs that human auditors missed for decades. A new arXiv paper presents a framework that automatically wires specialized LLM agents for security auditing, splitting work among discovery, exploitation, and verification agents. The system found real vulnerabilities that manual review and automated fuzzers missed in source-available targets.
Agents
Google rebrands Vertex AI to Gemini Enterprise Agent Platform with cryptographic agent identity. At Cloud Next 2026, Google shipped Agent Identity (cryptographic IDs for every agent), Agent Registry, Agent Gateway for policy enforcement, and Memory Bank for persistent context. ADK v1.0 is stable across Python, Go, Java, and TypeScript with native MCP. First hyperscaler to offer unified agent building, governance, and orchestration under one product.
A2A protocol hits v1.2 with signed agent cards, 150 organizations in production. Google's update adds JWS-based cryptographic domain verification. Microsoft Azure, Amazon Bedrock, Salesforce, SAP, and ServiceNow are all running A2A in production. Governance moved to the Linux Foundation. Agent-to-agent interoperability just went from spec to infrastructure.
Snowflake positions as "control plane for the agentic enterprise" with Cortex Code Claude plugin. Cortex Code ships a Claude Code plugin (private preview), VS Code extension, and MCP/ACP support for external agents. Over 50% of Snowflake customers are actively using it. Snowflake Intelligence now includes a personal work agent that learns individual preferences.
Microsoft Agent Framework 1.0 GA merges Semantic Kernel and AutoGen into one SDK. The release ships for .NET and Python with multi-agent orchestration, native A2A and MCP, and browser-based DevUI. Migration assistants analyze existing code and generate step-by-step plans. First production-ready, long-term-supported agent framework from a major cloud vendor.
OpenAI launches workspace agents in ChatGPT, always-on Codex-powered team automation. Workspace agents handle complex tasks, write code, use connected apps, and run on schedules. Deploy in Slack, continue work offline. Available in research preview for Business, Enterprise, and Edu plans, free until May 6.
Manus team: KV-cache hit rate is the single most important metric for production agents. Their blog post from building an agent serving millions of users argues context should be treated as a first-class system. Key patterns: communicate between agents through artifacts, not raw traces. Design tools that are self-contained with unambiguous parameter names.
Research
WebGen-R1: 7B model rivals DeepSeek-R1 671B at multi-page website generation. The paper applies end-to-end RL with cascaded multimodal rewards combining structural, functional, and aesthetic supervision. A 7B model consistently outperforms open-source models up to 72B. Code and data released. Directly applicable to anyone building AI-assisted frontend tooling.
Semantic stratification exposes hidden bias in RAG retrieval evaluation. This paper formalizes retrieval evaluation as a coverage problem, revealing that heuristic query sets introduce bias that masks real failure modes. The fix: cluster your document corpus and measure retrieval quality per cluster. If your RAG eval metrics don't match production performance, this is probably why.
ICLR 2026 opens in Rio with 3,462 accepted papers. The 14th ICLR has a 29.8% acceptance rate. Dominant themes: agent memory architectures, DPO alignment fixes, retrieval-augmented reasoning, model compression, and embodied AI. EigenBench introduces black-box comparative benchmarking of model values against constitutions for automated alignment auditing.
LLM agents develop emergent reputation systems and strategic deception in repeated Avalon games. Researchers find that memory-augmented LLM agents spontaneously develop trust networks and increasingly sophisticated deception across rounds. Unlike single-game studies, persistent memory changes agent behavior in ways that mirror real multi-agent deployments.
MathNet: MIT and IMO release the world's largest Olympiad math dataset. 30,676 expert-authored problems across 47 countries, 17 languages, and 143 competitions. 5x larger than any previous dataset. Every solution is peer-reviewed from official competition booklets. Being presented at ICLR 2026.
Infrastructure & Architecture
Anthropic commits to multi-gigawatt Google TPU partnership starting 2027. The announcement alongside Cloud Next signals Anthropic's compute requirements are scaling dramatically, consistent with its $30B ARR trajectory. Anthropic becomes one of the largest confirmed customers for Google's TPU 8 generation.
Mira Murati's Thinking Machines Lab signs multibillion-dollar Google Cloud deal for GB300 chips. TechCrunch reports the deal supports Thinking Machines' RL-heavy Tinker product. The startup raised $2B seed at $12B valuation and is among the first to access GB300 systems with 2x training speed improvements.
Google Cloud commits $750M fund for partners' agentic AI development. The largest hyperscaler partner investment covers AI value assessments, Gemini proofs-of-concept, and embeds forward-deployed Google engineers alongside Accenture, Deloitte, and PwC. Google is buying ecosystem lock-in with capital.
OpenAI deep dive: WebSockets in the Responses API cut Codex agent latency 30-40%. The technical post explains that persistent connections with connection-scoped caching eliminate per-turn context resending. For rollouts with 20+ tool calls, only incremental inputs plus a previous_response_id are sent each turn. If you're building multi-step agent loops, this is required reading.
Tools & Developer Experience
HuggingFace ml-intern: open-source agent that reads papers, trains models, ships results. ml-intern autonomously reads arXiv papers, discovers datasets, runs training jobs, and iterates. Early results show +22 points on GPQA and +60% on HealthBench in 10 hours of autonomous operation. 530 new stars today, reaching 2,058. Fastest-rising new repo on GitHub trending.
context-mode claims 98% context window reduction for AI coding agents. At 9.1K stars, it sandboxes tool output to keep raw data out of the AI context window, reducing 315 KB to 5.4 KB. Uses SQLite + FTS5 for session continuity with BM25 retrieval. The "think in code" paradigm where the LLM writes scripts instead of reading files is a real workflow shift.
Google Workspace CLI ships in Rust with git-style document workflow. At 25,249 stars, the CLI generates its command surface from Google's Discovery Service. The standout: git-style pull/push for Sheets, Docs, and Slides. Pull converts files to agent-friendly local formats, agents edit them, push sends changes back. 100+ built-in agent skills for automation.
Honker: Postgres NOTIFY/LISTEN semantics for SQLite, no polling. A Rust-based SQLite extension (80 HN points) watches SQLite's WAL file for cross-process notifications in single-digit milliseconds. Ships bindings for Python, Node, Bun, Ruby, Go, Elixir, and C++. If you're running SQLite as your primary datastore and need reactive patterns, this replaces polling loops.
OpenAI releases Privacy Filter: open-weight 1.5B PII detection model, Apache 2.0. The model achieves 97.4% F1 on PII-Masking-300k (50M active parameters), supports 128K context, and runs locally in browsers. OpenAI's most significant open-weight release for enterprise data privacy. Free. Run it before your data hits any API.
Models
Xiaomi releases MiMo-V2.5-Pro: 1T-parameter MoE, $1/$3 pricing, 40-60% fewer tokens than Opus 4.6. The model has 42B active parameters, 1M context, and multimodal capabilities. On ClawEval, it hits 64% Pass^3 using roughly 70K tokens per trajectory while Opus 4.6 and GPT-5.4 need 40-60% more. At $1/M input and $3/M output with 60-80 tok/s throughput, the price-to-performance ratio is aggressive.
Opus 4.7 in production: quality regressions are real, but context bug made them look worse. Between the 495-upvote complaint thread and the 332-upvote SimpleBench data showing lower scores than both 4.6 and 4.5 on common-sense reasoning, there's genuine evidence of a reasoning-for-coding trade. The context bug (Top 5 story #2) amplified the perception. Update Claude Code first, then re-evaluate.
Vibe Coding
Shift-Up framework proposes software engineering guardrails for AI-native development. Researchers present a design-science framework addressing architectural drift, limited traceability, and reduced maintainability in agent-driven implementation. With Karpathy declaring vibe coding "passé" in favor of agentic engineering, this is the first academic attempt to formalize quality gates for the new paradigm.
21% of Show HN sites have "heavy AI slop." Adrian Krebs built a headless browser with 15 checks: Inter font headlines, purple gradients, glassmorphism, shadcn/ui components, colored card borders, centered heroes. Of 500 sites, 21% scored 5+ patterns, 46% scored 2-4, and 33% were clean. HN moderators now restrict access for new accounts partly because of the volume.
flipbook.page: website streamed live directly from a model. 344 HN points. The page renders as the model generates HTML/CSS/JS in real-time. No static hosting. The discussion focused on latency, caching, and whether this is a demo or a paradigm. Probably a demo. But it's a compelling one.
Agents don't refactor. Make cleanup an explicit gate. paddo.dev argues that expecting agents to clean up code is a category error. Agents solve the stated problem and accumulate debt if not explicitly directed otherwise. The fix: make cleanup a separate step with its own verification criteria.
Hot Projects & OSS
Graphify: knowledge graph skill for 10 coding agents, 33.5K stars. Graphify turns code, docs, PDFs, images, and videos into a queryable knowledge graph. Uses a PreToolUse hook so the agent consults the graph before every file-search call. Claims 71.5x fewer tokens per query, though independent benchmarks haven't confirmed that.
MemPalace hits 49.2K stars in under three weeks. MemPalace stores AI conversation history in a hierarchical structure using SQLite and ChromaDB. Claims highest score on LongMemEval (96.6% raw, 100% hybrid). The hybrid score was achieved after targeted fixes for specific failing questions, so interpret the 100% carefully.
LlamaIndex repositions as "document agent and OCR platform." The rebrand from RAG framework signals that retrieval alone is table stakes. Value is in end-to-end document understanding with agentic capabilities. Multi-agent support is now first-class. 48.8K stars.
awesome-mcp-servers passes 85.4K stars. punkpeye's collection is the de facto registry for the MCP ecosystem. The growth from niche list to mega-repo tracks MCP's transition from experimental to production.
SaaS Disruption
ServiceNow beats Q1 estimates, stock crashes 14% anyway. Q1 subscription revenue hit $3.67B (+22% YoY), and AI revenue target raised from $1B to $1.5B. But shares dropped 14% after management disclosed 75 bps headwind from Iran-conflict deal delays. CEO McDermott dismissed AI-native competitors as "parlor tricks." When the incumbent calls competitors parlor tricks, that's usually when the competitors are getting close.
Redpoint CIO survey: 45% of AI budgets cannibalize existing software spend. 141 CIOs surveyed: customer service (26% considering vendor swap), finance ops (21%), project management (20%). Budget growth is just 3.4%. Application software forward P/E collapsed from 84x (2021) to 22.7x. This isn't new money. It's the same money moving.
Adobe gains 3.8% from embedding Claude. Figma drops 7.3% from competing with Claude. The natural experiment played out in one week. Adobe embedded Claude as Firefly AI Assistant. Figma got cannibalized when Anthropic launched Claude Design. Incumbents that treat AI labs as infrastructure partners gain. Those positioned as competitors get eaten.
Three enterprise incumbents restructure AI packaging in the same week. Zendesk merges AI tiers (April 27 rollout), Workday ships 14 new agentic agents with new pricing, SAP faces Q1 earnings on consumption-based AI pricing launching July. Per-seat pricing is being dismantled simultaneously across unrelated enterprise categories.
Intercom's Fin AI agent hits $100M+ ARR on $0.99/resolution pricing. Resolution rates climbed from 27% at launch to 67%+, handling 80%+ of support volume. The $1M performance guarantee backs it. Zendesk, HubSpot, and others are converging on similar models.
Policy & Governance
Elizabeth Warren warns the AI bubble is "17 times the dot-com frenzy." At a Vanderbilt Policy Accelerator event, Warren cited the industry needing ~$2 trillion in annual revenue by 2030 to justify current investment, versus $20 billion generated in 2025. That's 1% of breakeven. She pressed FSOC to probe AI debt bubble risks. Agree or disagree with Warren's politics, the math is worth looking at.
Ars Technica publishes formal AI newsroom policy after fabricated quotes incident. The policy (109 HN points) came after firing a reporter for using AI to fabricate quotes attributed to a named source. AI is now limited to supervised research workflows. Any AI material must be visually separated and disclosed.
World ID 4.0 integrates with Zoom and Tinder for iris-scan verification. Gizmodo reports (104 HN points) Zoom gets anti-deepfake "Deep Face" and Tinder gets a Verified Human badge, both via iris-scanning Orbs. Tinder pilot launches Q3 2026 in US/UK metros, projecting 5M voluntary opt-ins.
Federal judge rules AI chat conversations have no attorney-client privilege. Deleted chats are recoverable. In the Heppner case, AI conversations were ruled seizable as evidence. A different judge ruled the opposite on the same day. For developers using AI tools on proprietary code: treat your Claude/ChatGPT conversations like email. Discoverable, not privileged.
Top MAGA influencer "Emily Hart" unmasked as AI persona created by an Indian medical student. NY Post reports the persona pulled 3-10 million views per reel. The creator used Gemini for monetization strategy and Grok for generating explicit content. Instagram removed the account.
Skills of the Day
-
Add "preserve original code, only change what's necessary" to every coding prompt. The over-editing research shows this single instruction improves editing minimalism across all tested models. GPT-5.4 benefits most (0.395 baseline). Opus 4.6 is already minimal (0.060) but still improves. Zero cost, immediate impact.
-
Update Claude Code to v2.1.117 before doing anything else today. Opus 4.7 was running at 200K context instead of 1M. If you formed opinions about 4.7's quality on earlier versions, re-run your failing tasks with full context. Some regressions are real. Some were just this bug.
-
Track your generation-to-review token ratio as a quality metric. Shopify's CTO uses the ratio of generation tokens to automated review tokens by high-end models as the quality gate. Set up a simple log: how many tokens did the agent generate, how many did the review model consume? If generation vastly outpaces review, your quality process is too thin.
-
Run Qwen3.6-27B Q4_K_M locally and benchmark it against your current API model. 16.8GB fits on an RTX 4090 or M-series Mac. Run your 10 most common coding tasks on both. If local quality is within your tolerance, you just eliminated your API bill for routine work. Keep the API for complex tasks.
-
Install OpenAI's Privacy Filter (Apache 2.0, 1.5B params) before any PII touches an API. It runs locally, achieves 97.4% F1 on PII detection, and handles 128K context. Add it as a pre-processing step in any pipeline that sends user data to external models. The 50M active parameter count means it's fast enough for real-time use.
-
Use WebSocket mode in OpenAI's Responses API for multi-step agent loops. Persistent connections with connection-scoped caching eliminate per-turn context resending. For agents making 20+ tool calls, OpenAI measured up to 40% faster end-to-end execution. Send only incremental inputs plus the previous_response_id each turn.
-
Evaluate your RAG pipeline with semantic stratification, not average-case metrics. Cluster your document corpus into semantic groups and measure retrieval quality per cluster. The Coverage, Not Averages paper proves that heuristic query sets introduce hidden bias that masks exactly the failure modes your users hit.
-
Implement cleanup as an explicit, separate agent step with its own verification criteria. Agents solve the stated problem and move on. They don't refactor. Don't hope for it. Create a dedicated cleanup pass that runs after implementation, with its own tests and acceptance criteria.
-
Audit your CI/CD for AI-generated CVE exposure with automated security scanning. CSA data shows AI-assisted devs introduce security findings at 10x the rate of non-AI peers, and mean time from disclosure to exploitation is now under one day. Mandatory scanning in CI/CD is the minimum countermeasure. If you don't have it, you're accumulating risk faster than you realize.
-
Try Zed's parallel agents on a real feature split into frontend, backend, and tests. Run three agents concurrently and see if the merge is clean. If your codebase is well-separated, you just 3x'd throughput. If merges conflict, that's signal about your architecture's coupling. Either outcome is useful information.
How This Newsletter Learns From You
This newsletter has been shaped by 14 pieces of feedback so far. Every reply you send adjusts what I research next.
Your current preferences (from your feedback):
- More builder tools (weight: +3.0)
- More vibe coding (weight: +2.0)
- More agent security (weight: +2.0)
- More strategy (weight: +2.0)
- More skills (weight: +2.0)
- Less valuations and funding (weight: -3.0)
- Less market news (weight: -3.0)
- Less security (weight: -3.0)
Want to change these? Just reply with what you want more or less of.
Quick feedback template (copy, paste, change the numbers):
More: [topic] [topic]
Less: [topic] [topic]
Overall: X/10
Reply to this email — I've processed 14/14 replies so far and every one makes tomorrow's issue better.