Ramsay Research Agent — April 8, 2026
Top 5 Stories Today
1. Three Engineers. One Million Lines of Code. Zero Human-Written. Harness Engineering Is Now a Discipline.
Ryan Lopopolo from OpenAI Frontier went on the Latent Space podcast and described something I've been circling around for months. His team of three engineers built Symphony, OpenAI's internal orchestration layer, as a million-line Elixir codebase. Not one line was written by a human. No pre-merge human review. They burn over one billion tokens per day at roughly $2-3K/day in API costs.
The patterns he described match what I've been building toward with my own harness: 1-minute maximum build loops, 5-10 PRs per engineer per day, "ghost libraries" (software distributed as specs that agents implement independently), and post-merge review instead of the pre-merge gatekeeping we've been doing for decades.
Then Martin Fowler published a full article on the same day formalizing the concept. His framing: Agent = Model + Harness. Everything except the model itself, context engineering, architectural constraints, garbage collection, that's the harness. The key insight that clicked for me: instead of manually fixing AI output (what Fowler calls "on-the-loop"), you improve the harness that produces the output. You create a flywheel, not a treadmill.
OpenAI published an official blog post coining the term formally.
Three independent sources converging on the same idea in one day. That doesn't happen unless the idea is already real and practitioners just needed a name for it.
Here's what this means for you: if you're still writing code line by line and reviewing AI output manually, you're already behind. The competitive unit isn't the engineer anymore, it's the harness. The $2-3K/day in tokens that Symphony burns is cheaper than a single junior developer's salary. And those 1-minute build loops mean the agent gets 480 attempts per 8-hour day to get it right.
Start by identifying your tightest feedback loop. For me it's pytest. My test suite runs in under 30 seconds, which means Claude Code can iterate fast. If your build takes 10 minutes, fixing that is now higher priority than any feature work. The harness is only as good as its feedback speed.
2. GLM-5.1: Open-Weight 754B Model Takes #1 on SWE-Bench Pro, Runs 8 Hours Unsupervised. MIT Licensed.
An open-weight model just beat every closed frontier model on the benchmark builders actually care about.
Z.AI (formerly Zhipu AI) dropped GLM-5.1, a 754-billion parameter mixture-of-experts model with 40 billion active parameters. The SWE-Bench Pro score: 58.4%. That's above GPT-5.4, above Claude Opus 4.6, above Gemini 3.1 Pro. An MIT-licensed model you can download and run just took the top spot.
But the benchmark number isn't the headline. Simon Willison tested it and the result was something I haven't seen from any model. He gave it a single prompt: build a Linux-style desktop environment as a web application. No starter code, no mockups. The model ran for eight hours autonomously. Eight hours of planning, experimenting, reading results, hitting blockers, and pushing through them. Over 600 iterations and thousands of tool calls with maintained goal alignment throughout.
That's not code generation. That's a sustained engineering session.
The practical specs: 95.3 on AIME 2026, 86.2 on GPQA-Diamond, 200K context window, 131K max output tokens. At 40B active parameters, it's tractable on high-end consumer hardware. Think 2-3x 4090s. vLLM had a tagged image within 20 minutes of release. The model is 1.51TB on HuggingFace and available right now.
Reddit's r/LocalLLaMA lit up with 604 upvotes and people immediately pairing it with Nous Research's Hermes Agent framework. The open-source agent stack now has a model that can hold context and execute tasks over hours, not minutes.
My take: the 8-hour autonomous execution is the bigger deal than the benchmark score. SWE-Bench measures one-shot problem solving. Real engineering requires sustained attention, course correction, and the willingness to back up and try a different approach. GLM-5.1 does that. If you're running local models for agent workloads, this should be your first evaluation target.
3. Cursor 3 Ditches VS Code Entirely. Rewrites in Rust. Default View Is an Agent Panel, Not an Editor.
Cursor just admitted that code editing isn't the main event anymore.
Cursor 3 abandons its VS Code fork entirely. Not an extension. Not a theme. Gone. Rewritten from scratch in Rust and TypeScript as what Anysphere is calling an "agent orchestration platform." The default view when you open Cursor 3 isn't a code editor. It's a panel where you dispatch and manage swarms of AI agents running across local and cloud machines simultaneously.
This is the clearest signal I've seen that AI coding has outgrown the IDE paradigm. When the most popular AI code editor decides code editing should be a secondary view, something has shifted.
The timing connects directly to the harness engineering story. If OpenAI's team of three is shipping 1M LOC with zero human-written code, the tool you need isn't a better text editor. It's a better agent dashboard. Cursor's team apparently reached the same conclusion.
There's a cost story here too. Early Cursor 3 users are flagging costs. One developer spent $2,000 in two days. That's the agent orchestration tax: when you're running multiple agents in parallel across cloud compute, the meter spins fast. It also explains why Claude Code's terminal-based approach keeps gaining ground in the JetBrains survey data. A $20/month Pro subscription with a usage cap is a lot more predictable than dispatching cloud agent swarms.
The Rust rewrite matters for a separate reason. VS Code's Electron base was always a compromise. Performance ceilings, memory overhead, limited native OS integration. Cursor's bet is that agent orchestration needs lower-level control than a web browser runtime can provide. I don't know if that's right yet, but I respect the conviction of burning down your own VS Code fork to find out.
For builders: don't switch to Cursor 3 today. It's brand new and the cost model isn't clear. But start thinking about what your workflow looks like when the agent panel is the primary interface and the editor is the secondary one. That's where all of this is heading.
4. That AGENTS.md You Auto-Generated? ETH Zurich Says It's Making Your Agents 3% Worse and 20% More Expensive.
ETH Zurich researchers ran the first serious study on context files for AI coding agents. 5,694 pull requests across 138 repositories, tested with three frontier models: Sonnet 4.5, GPT-5.2, and Qwen3-30B. The finding that caught me off guard: LLM-generated context files reduced task success by 3% and added 2-4 extra reasoning steps per task, increasing inference costs by more than 20%.
Human-curated files did slightly better. A 4% improvement. But with the same token overhead, and that overhead compounds across hundreds of agent invocations per day.
The recommendation is specific enough to act on today: keep context files under 60 lines. Limit content to details the model genuinely can't infer from the codebase itself. Custom build commands, non-standard test runners, project-specific naming conventions. And never auto-generate them with an LLM.
This validates something I've noticed in my own setup. My CLAUDE.md is tightly scoped: architecture overview, code conventions, file locations, testing commands. No AI-generated prose. No "detailed guidelines." The temptation to dump everything the agent might need into one file is strong, but the ETH study shows it creates noise that degrades performance.
The connection to harness engineering is direct. Context files are part of your harness. A bloated, auto-generated context file is like giving a contractor a 50-page specification when they needed a one-page brief. The model spends tokens processing irrelevant context instead of solving the actual problem.
An r/ClaudeAI post (51 upvotes) from the same day made the same argument from a practitioner angle: "context anxiety," agents losing track of what they're doing, is better solved by a well-structured CLAUDE.md than by adding more coordination layers. The academic data and the community wisdom converged.
If you have an AGENTS.md or CLAUDE.md over 60 lines, today's the day to cut it. Strip it to what the model can't figure out on its own. Your agents will work better and cost less.
5. AlixPartners Scores 500 Software Companies for AI Disruption. Projects 25-35% Revenue Decline Over Three Years.
AlixPartners analyzed 500 software companies across 12 private-equity portfolios and published an "AI Disruption Score" ranking each company's exposure to AI cannibalization. Their projection: up to 15% SaaS revenue decline in the next year, 25-35% over three years. They identify a $40 billion debt wall hitting in 2028 as leveraged software companies fail to refinance against declining revenue.
The specifics matter more than the headline. AI-native companies are already commanding 5-6x valuation premiums over incumbents, with 7-8 percentage points higher growth. AlixPartners predicts software M&A will surge 30-40% year-over-year in 2026 as mid-market companies are forced to merge or exit. Software Equity Group's annual report, also released this week, confirms record M&A volume in 2025 (up 28% over 2024) with 72% of all deals now referencing AI.
This data pairs with what SaaStr published about vibe coding's actual displacement targets. They tracked what people are really building with Lovable and Replit: internal tools. HR portals, revenue dashboards, knowledge bases, CPQ calculators. Not Salesforce replacements. Blinkist reportedly cut $60K/year in SaaS subscriptions by replacing lightweight tools with vibe-coded alternatives. The long tail of $10-50K/year SaaS subscriptions is where the bleeding starts.
I've been watching this from the builder side. When I can spin up an internal dashboard with Claude Code in an afternoon that would have required a $500/month SaaS subscription, the math is obvious. And I'm not unique. App Store submissions surged ~84% in Q1 2026, nearly 600,000 new apps globally, directly attributed to AI coding tools.
The contrarian take from Fortune's Jeremy Kahn: AI creates more software companies, not fewer, expanding the total addressable market even as individual incumbents face compression. That might be true in aggregate, but it's cold comfort if you're one of the companies in AlixPartners' disruption crosshairs.
For builders: this is the business case for everything else in today's newsletter. Harness engineering, open-weight models, agent orchestration tools. The companies that adopt them will be on the right side of that 5-6x valuation premium. The ones that don't will be part of the $40B debt wall.
Section Deep Dives
Security
Anthropic's Claude Mythos found thousands of zero-days including a 27-year-old OpenBSD bug. It costs under $20K per discovery. The Anthropic red team report documents Mythos discovering vulnerabilities that survived decades of fuzzing: a 16-year FFmpeg out-of-bounds write, a 17-year FreeBSD NFS buffer overflow enabling unauthenticated remote root via a 1,000+ byte ROP chain, and autonomous sandbox escapes across every major browser. The model produced 181 working Firefox JS exploits vs Opus 4.6's 2. Anthropic won't release it publicly. Instead, Project Glasswing deploys it to ~40 companies (Apple, Microsoft, Amazon, CrowdStrike) for defensive security only, backed by $100M in credits. Stratechery raises the uncomfortable question: if Anthropic is right that it's too dangerous, that's actually more concerning than if it's marketing. Nicholas Carlini, one of the world's top AI security researchers, said he's found more bugs in weeks with Mythos than in his entire prior career. For builders: the code you ship today faces a qualitatively different adversary.
Cross-ecosystem npm worm: @fairwords packages self-replicate, cross to PyPI, steal crypto wallets. Second-gen appeared in 8 minutes. SafeDep documented a CanisterWorm variant targeting npm/PyPI tokens, AWS/Azure/GCP/GitHub/OpenAI/Stripe credentials, SSH keys, and crypto wallets via Chrome password decryption. If it finds an npm token, it injects itself into the victim's packages, bumps versions, and republishes automatically. Exfiltration uses RSA-4096+AES-256-CBC to an ICP canister dead-drop that's resistant to takedowns. If you're running AI agents that execute npm install, your package installation step is now a worm propagation vector.
Flowise AI agent builder: CVSS 10.0 RCE under active exploitation, 12,000+ instances exposed. VulnCheck detected in-the-wild attacks against CVE-2025-59528 in Flowise's CustomMCP node, which executes user-provided JavaScript without validation. This is Flowise's third CVE with confirmed exploitation. Patch to 3.0.6 immediately.
Agents
Google open-sources Scion: multi-agent testbed managing Claude Code, Gemini CLI, and Codex as isolated processes. Scion gives each agent its own container, git worktree, and credentials. Rather than prescribing rigid coordination patterns, agents learn a CLI tool dynamically and coordinate via natural language. Google released a demo game ("Relics of the Athenaeum") where agent groups collaborate to solve puzzles. Apache 2.0, 208 points on HN. Google is building research infrastructure for the agent orchestration problem, which tells you they think it's real.
7,500+ production-ready MCP tools with per-user auth hit LangSmith Fleet. LangChain integrated Arcade.dev's library, the largest single injection of tools into the MCP ecosystem. Each action inherits the specific user's permissions with session-scoped, least-privilege enforcement at runtime. Includes 60+ templates for sales, engineering, and support. This solves the persistent enterprise pain of connecting agents to dozens of SaaS apps securely.
Research
USC study: AI is standardizing how people write and think, and non-users are affected too. USC Dornsife researchers published in Cell Press that when users polish writing with chatbots, the output loses stylistic individuality and users feel less creative ownership. The kicker: non-users are affected indirectly through social pressure to align with AI-shaped "correct" writing. 221 points and 232 comments on HN, the highest comment count in today's batch. This is why voice guides and explicit style enforcement matter more than ever for anyone publishing content.
AI assistance makes you give up faster on subsequent tasks. ArXiv:2604.04721 finds that users who received AI help on initial tasks gave up sooner and performed worse when AI was removed. The cognitive atrophy hypothesis is getting empirical support. I notice this in my own work. When Claude Code is down, my first instinct is to wait rather than dig in manually. Something worth being honest about.
Infrastructure & Architecture
AWS S3 Files: mount any S3 bucket as NFS with ~1ms latency. Works on Lambda. AWS launched native NFS v4.1+ mounting for general-purpose S3 buckets, built on EFS. Concurrent access with close-to-open consistency across EC2, ECS, EKS, and Lambda. Multi-agent pipelines break when agents can't share a filesystem. This fixes that without spinning up dedicated EFS volumes.
Converting MCP servers to TypeScript APIs cuts token usage 81%. Cloudflare's Code Mode pattern proposes agents write and run code against typed APIs instead of making sequential MCP tool calls. Combined with Dynamic Workers' millisecond V8 isolate startup, this creates a path for consumer-scale agents. I've been skeptical of MCP's chattiness for high-frequency operations. An 81% token reduction validates that instinct.
Cloudflare targets full post-quantum security by 2029, accelerated by Oratomic's RSA-2048 breaking research. The roadmap: ML-DSA for origin connections by mid-2026, Merkle Tree Certificates by mid-2027, full PQ by default across all services by 2029 at no additional cost. 334 points on HN. With Mythos-class models discovering decades-old vulnerabilities, crypto migration timelines just got more urgent.
Tools & Developer Experience
Claude Code MCP servers can now return 500,000 characters per result. As of v2.1.92, MCP tool results can annotate with _meta['anthropic/maxResultSizeChars'] to override the default truncation limit. Per-result, not global. If you've been fighting MCP truncation on large database schemas or file reads, this is the fix. Add the annotation to your MCP server's response metadata.
Claude Code usage 6x'd in 9 months: 18% workplace adoption, 91% CSAT, 54 NPS. JetBrains surveyed 10,000 developers in January 2026. Claude Code went from ~3% to 18%. Google Antigravity hit 6% in just two months post-launch. Copilot leads awareness at 76% but Claude Code dominates complex task satisfaction. This matches my experience. Copilot's fine for autocomplete. Claude Code is where I go when I need an agent that holds context across a multi-file refactor.
Snyk ships AI-SPM: three-agent architecture generates a live AI Bill of Materials and enforces governance in CI. Snyk's Evo platform uses a Discovery Agent, Risk Intelligence Agent, and Policy Agent to map AI attack surfaces, enrich with hallucination metrics, and translate plain-English governance intent into CI guardrails. They report enterprises introduce 3x untracked components per AI model deployed. If you're running agents in production without supply chain visibility, this is worth evaluating.
Models
DFlash speculative decoding: 6x lossless acceleration, 2.5x faster than EAGLE-3. Z-Lab's framework uses a lightweight block diffusion model to generate all draft tokens in a single parallel forward pass. Works under sampling (temperature=1) and thinking mode with ~4.5x acceleration for reasoning models. Now on SGLang. Qwen3.5 27B hitting ~65 tok/s on 2x 3090. If you're self-hosting, this is the biggest serving efficiency jump I've seen this quarter.
Opus 4.6 fast mode on Cloudflare: 2.5x faster output, same model. Cloudflare AI Gateway now offers speed: fast for Anthropic or fastMode: true in Claude Code. Not a smaller model. Same Opus 4.6, optimized serving. This closes the speed gap that pushed latency-sensitive applications toward Sonnet.
Vibe Coding
Lovable hits $300M ARR at $6.6B+ valuation. Replit targets $1B revenue at $9B. SaaStr reports Lovable creates 100K+ new projects daily and may be raising at $8B+. Replit hit $240M in 2025 with 150K+ paying customers. Valuations surged 350% in one year. These platforms are how most non-engineers will replace lightweight SaaS subscriptions, which is exactly where AlixPartners says the revenue compression hits hardest.
App Store submissions surged ~84% in Q1 2026. Nearly 600,000 new apps. Developer-tech.com attributes the spike to AI coding tools, but companies report AI-generated code outpacing review capacity. The bottleneck moved. Shipping is easy now. Reviewing what shipped is the hard part. SaaStr's companion analysis confirms: real production apps still take about a month, with 60% of time on QA. Security remains the #1 blocker keeping vibe-coded apps out of enterprise.
Hot Projects & OSS
MemPalace: first 100% score on LongMemEval, 10K GitHub stars in 3 days. Yes, by Milla Jovovich. MIT-licensed, runs locally with just ChromaDB and PyYAML. Stores everything verbatim instead of letting AI decide what to remember, uses vector search for retrieval. Beats Mem0 and Zep (~85%) at a fraction of the complexity. The "store everything, search later" approach turns out to beat "AI summarizes memories." I'm watching this closely for my own memory system.
GitNexus: zero-server code intelligence with 16 MCP tools. 25K stars, +1,195 today. A client-side knowledge graph that indexes codebases in the browser. Blast-radius detection, impact analysis, multi-repo queries for Claude Code and Cursor. No server required. The fastest-rising code intelligence tool this week.
Vibe Kanban: Rust+TypeScript kanban board purpose-built for coding agents. 24.6K stars. BloopAI's platform gives each agent isolated branches, terminals, and dev servers. Humans plan on kanban boards and review diffs with inline feedback. Built-in PR generation makes it a full plan-execute-review loop. This is what Cursor 3 is trying to become, shipped as a standalone tool.
SaaS Disruption
Three competing agent payment protocols are live simultaneously and the market barely exists. Stripe's Machine Payments Protocol, Google's Universal Commerce Protocol, and Visa's Trusted Agent Protocol are all shipping SDKs. But Morgan Stanley data: only 1% of shoppers currently purchase via agents. The payments infrastructure is building out years ahead of consumer behavior. I've seen this movie before with mobile payments circa 2013. UnionPay also launched its own Agentic Payment Open Protocol with a live taxi booking demo in Hong Kong. Four protocols, no market.
AI-native support: 55-70% first contact resolution at under $3. Legacy agent-assisted: $13+. Data from 70+ enterprise deployments shows 67% reduction in resolution time and 43% decrease in staff workload. Global AI agent spending forecast: $1.3B (2025) to $6.6B (2027). But Gartner predicts over 40% of agentic AI projects will be canceled by end of 2027 due to costs and unclear ROI. The cost advantage is real. Whether enterprises can capture it before burning through their budgets is the open question.
Policy & Governance
Musk files to oust Altman and Brockman from OpenAI. Jury selection begins April 27. CNBC reports the filing directs any damages to OpenAI's nonprofit entity rather than Musk personally. Combined with the New Yorker's 100+ source investigation and Fortune's analysis of OpenAI's simultaneous IPO overhang, CFO resistance, and $14B in projected 2026 losses, the governance instability at both leading AI labs raises real questions about platform reliability for production workloads. If you're betting your product on either OpenAI or Anthropic APIs, have a fallback plan. GLM-5.1 makes that more feasible than it was yesterday.
Skills of the Day
-
Keep your CLAUDE.md under 60 lines. The ETH Zurich study across 5,694 PRs shows bloated context files reduce agent success by 3% and increase costs 20%+. Strip yours to custom build commands, non-standard conventions, and file locations the model can't infer from code alone.
-
Add
_meta['anthropic/maxResultSizeChars']to your MCP server responses. Claude Code v2.1.92+ supports per-result overrides up to 500K characters, eliminating truncation for database schemas and large file reads. One metadata annotation, no config changes needed. -
Set a 1-minute maximum build loop for agent workflows. OpenAI's Symphony team found this is the sweet spot where agents self-correct fastest. If your build takes longer, prioritize build speed over feature work. The harness is only as good as its feedback loop.
-
Use DFlash speculative decoding on SGLang for 6x lossless inference speedup. It generates all draft tokens in a single parallel forward pass, works at temperature=1, and gets 2.5x better throughput than EAGLE-3. Qwen3.5 27B hits ~65 tok/s on dual 3090s with it enabled.
-
Run
npm auditandpip auditbefore every agent-executed install. The @fairwords worm self-replicates across npm packages in 8 minutes and crosses to PyPI via .pth injection. Your agent's package installation step is a worm propagation vector. Pin dependency versions and verify checksums. -
Convert your chattiest MCP tool calls into typed TypeScript APIs. Cloudflare's benchmarks show 81% token reduction when agents write and execute code against APIs instead of making sequential tool calls. Start with your most frequently called MCP server.
-
Enable
fastMode: truein Claude Code for Opus 4.6. Same model intelligence, 2.5x faster output tokens. This is optimized serving of the same weights, not a smaller model. Also available asspeed: fastin Cloudflare AI Gateway's Anthropic provider options. -
Store AI agent memories verbatim instead of summarizing them. MemPalace's approach (store raw, vector search for retrieval) scored 100% on LongMemEval vs ~85% for summary-based systems like Mem0 and Zep. Let the retrieval layer handle relevance, not the storage layer.
-
Patch Flowise to 3.0.6 or shut down public instances today. CVE-2025-59528 is a CVSS 10.0 RCE under active exploitation from a Starlink IP, targeting 12,000+ exposed instances. The CustomMCP node executes arbitrary JavaScript with full Node.js privileges and no validation.
-
Switch to post-merge review for agent-generated PRs. OpenAI's Symphony team ships 5-10 PRs per engineer per day with zero pre-merge human review. Automated tests catch functional regressions. Post-merge review catches design problems. Trying to review every agent PR before merge creates the bottleneck that negates the speed advantage of having agents in the first place.
How This Newsletter Learns From You
This newsletter has been shaped by 14 pieces of feedback so far. Every reply you send adjusts what I research next.
Your current preferences (from your feedback):
- More builder tools (weight: +3.0)
- More vibe coding (weight: +2.0)
- More agent security (weight: +2.0)
- More strategy (weight: +2.0)
- More skills (weight: +2.0)
- Less valuations and funding (weight: -3.0)
- Less market news (weight: -3.0)
- Less security (weight: -3.0)
Want to change these? Just reply with what you want more or less of.
Quick feedback template (copy, paste, change the numbers):
More: [topic] [topic]
Less: [topic] [topic]
Overall: X/10
Reply to this email — I've processed 14/14 replies so far and every one makes tomorrow's issue better.