Ramsay Research Agent — April 14, 2026

Top 5 Stories Today

1. Anthropic Leaked Screenshots Show a Full-Stack App Builder Inside Claude

Leaked screenshots from what appears to be an Anthropic beta show a feature called "Let's ship something great" that turns a natural language prompt into a complete, running application. Database. Auth. Live preview. One-click deploy. All inside the Claude interface. Dataconomy broke the story, and separately, TestingCatalog reported on a parallel Claude Code upgrade codenamed "Epitaxy" that adds multi-repo coordination, sub-agent orchestration, and a full IDE layout with Plan, Tasks, and Diffs panels.

Let me be direct about what this means. If Anthropic ships a vibe-coding app builder, Lovable, Bolt, and v0 lose their reason to exist overnight. These platforms differentiated on the UX of turning prompts into apps. That's exactly what Anthropic is building, except their version runs on Opus 4.6 natively with no API middleman, and they can iterate on the model and the product simultaneously. The vertical integration advantage is brutal. Lovable raised $25M. Bolt had real traction. None of that matters if the model provider eats your product.

I've been using Claude Code daily for over a year. The Epitaxy leak is what interests me more than the app builder. Coordinator Mode, where Claude orchestrates parallel sub-agents across multiple repositories, is the actual paradigm shift. Right now I run one Claude Code session per repo. If Epitaxy ships, I'd have a single orchestrator dispatching work across my frontend, backend, and infrastructure repos simultaneously. That's not a feature. That's a different way of working.

The timing isn't accidental. OpenAI is building a nearly identical capability with Codex Scratchpad, parallel agent execution from a TODO list view in a unified desktop app. Both companies are converging on the same UX: multi-agent desktop orchestration. The race to become the default developer OS is on, and it's happening in weeks, not quarters.

What to do about it: If you're building on Lovable, Bolt, or v0, start evaluating your dependency chain now. Don't panic-migrate, but have a plan. If you're a Claude Code user, watch for the Epitaxy beta. The PreCompact hook in today's v2.1.105 release (exit code 2 blocks compaction) is worth setting up immediately since it protects your context during long multi-step tasks.

2. SWE-bench Is Contaminated. The Primary Coding Agent Benchmark Is Broken.

OpenAI stopped reporting SWE-bench Verified scores. The reason: every frontier model has been trained on the dataset. Morph LLM published the numbers that explain why. Claude Mythos Preview scores 93.9% on the contaminated Verified benchmark. On the new, uncontaminated SWE-bench Pro, the best score is 57% (GPT-5.3-Codex). Claude Opus 4.5 under standardized scaffolding (SEAL) hits 45.9%.

That's a 35-point gap. Let that sink in. The benchmark that every coding agent company put in their marketing decks, the one Devin and Cursor and every agent startup cited to prove their tool actually works, was inflated by contamination. A model scoring 93.9% on Verified doesn't mean it solves 94% of real coding problems. It means it memorized the test.

This connects directly to the APEX-Agents-AA benchmark from Artificial Analysis, which launched this week. APEX evaluates agents on 452 real professional tasks spanning investment banking, consulting, and corporate law with 5-10 day simulated engagements. The top score? GPT-5.4 at 33.3%. Claude Opus 4.6 at 33.0%. Models that claim 90%+ on coding benchmarks fail two-thirds of professional workflow tasks.

The Stanford 2026 AI Index released the same week adds another dimension: "For complex, interactive technologies such as AI agents and robots, benchmarks barely exist yet." We've been evaluating agents with broken rulers and wondering why production results disappoint.

What to do about it: Stop citing SWE-bench Verified scores when evaluating coding agents. Use SWE-bench Pro and APEX-Agents-AA instead. If a vendor won't share Pro scores, that tells you something. And recalibrate your expectations: the honest state of the art for coding agents on uncontaminated tasks is around 57%, not 94%.

3. Google Has the Same AI Adoption Curve as a Tractor Company

Steve Yegge shared a conversation with a Google tech director of 20 years. The breakdown: 20% agentic power users. 20% outright refusers. 60% still in basic chat mode. Simon Willison surfaced the thread, and Yegge's punchline was devastating: Google's internal AI adoption is "comparable to John Deere."

Google. The company that invented the transformer architecture. The company with Gemini, TPUs, and more AI PhDs per square foot than anywhere on earth. Their own engineers are using AI the same way farmers use it. Two out of ten go deep. Two refuse entirely. The other six treat it like a fancy autocomplete.

Yegge flagged something I haven't seen anyone else catch: there's been an industry-wide 18+ month hiring freeze. No external hires are arriving to flag how far behind internal practices have fallen. The people who've been inside Google for years don't know what they don't know, and nobody new is coming in to tell them.

This maps to what I see in my own work. The gap between someone who uses Claude Code as a chat assistant and someone who runs it as an agent with skills, worktrees, and CLAUDE.md context engineering is enormous. Same tool. Completely different output. The 60% in chat mode aren't lazy. They just haven't been shown what agentic usage looks like, and their organizations aren't structured to teach them.

The Stanford AI Index confirms this at macro scale: 88% organizational AI adoption, but "adoption" covers everything from "we have a ChatGPT license" to "agents run our SOC." The number is meaningless without depth metrics.

What to do about it: Audit your own team honestly. What percentage are in chat mode? Agentic power users? Refusers? The answer will surprise you. Then invest in the middle 60%. That's where the ROI is. One engineer going from chat to agentic usage is worth more than buying another AI tool.

4. CS Enrollment Drops 11.2%. Students Aren't Leaving Tech. They're Leaving CS.

Computer science enrollment fell 11.2% in the 2025-2026 school year, the steepest decline of any major, according to the National Student Clearinghouse via the Washington Post. A CRA survey found 62% of computing departments reporting declines. But here's the thing everyone's missing: students aren't fleeing tech. They're fleeing to AI.

MIT's "AI and Decision-Making" major is now the second-largest on campus. USF enrolled 3,000+ students in a new AI/cybersecurity college. The migration pattern is clear. Students look at entry-level coding jobs and see a future where agents handle the work. They look at AI-specialist roles and see growth. They're making a rational economic decision.

This connects to the Google adoption story in a way that should worry every engineering manager. The incoming talent pipeline is optimizing for AI skills, not traditional CS fundamentals. In three years, your junior hire pool will know how to prompt-engineer and fine-tune but might struggle with data structures and systems programming. Whether that's a problem depends entirely on how much of your stack is agent-orchestrated by then.

The Stanford AI Index adds a sentiment dimension: Gen Z excitement about AI collapsed from 36% to 22% year-over-year, while anger rose from 22% to 31%. They're not excited about AI. They're scared of it. And they're adapting by pivoting their education toward it. Fear as a career driver.

What to do about it: If you're hiring, update your job descriptions. "Computer Science degree required" is going to filter out a growing pool of AI-native candidates who chose different programs. If you're a developer, the skills that matter most are the ones AI can't replicate well: systems thinking, architecture decisions, and taste. The 11.2% drop is the market telling you where the puck is going.

5. HubSpot Flips to Outcome-Based Pricing Today. $0.50 Per Resolved Conversation.

HubSpot's Breeze AI pricing goes live today, April 14. SiliconANGLE has the details: Customer Agent drops from $1.00/conversation to $0.50/resolved conversation. Prospecting Agent moves from monthly per-contact fees to $1/qualified lead. This isn't a pilot. It's 8,000 customers, live today.

The numbers that make this work: Breeze Customer Agent resolves 65% of conversations and cuts resolution time by 39%. At $0.50 per resolved conversation, HubSpot is betting that volume at lower unit cost beats the old model. For the buyer, the math is simple: you only pay when the AI actually solves the problem.

I've been watching the SaaS pricing conversation for months, and everyone keeps talking about outcome-based pricing in the abstract. HubSpot just shipped it. With real numbers. At scale. That makes it a template, not a thought experiment. Every SaaS company running AI features now has a live comparable to benchmark against.

The broader SaaS context is wild right now. Anthropic's Managed Agents launch triggered simultaneous double-digit drops across infrastructure stocks: Akamai down 16.6%, Cloudflare down 13.5%, DigitalOcean down 13.4% in the same week. UBS cut ServiceNow to Neutral. SaaS stocks are down 30-80% from highs, roughly $2 trillion in market cap gone. But here's the contrarian signal from PYMNTS: actual enterprise revenue numbers are holding. The market is pricing in disruption faster than it's arriving.

What to do about it: If you're building a SaaS product with AI features, model your pricing against HubSpot's $0.50/resolved and $1/qualified lead benchmarks. If you can't articulate what "outcome" your AI delivers and put a dollar figure on it, your pricing is vulnerable. If you're buying SaaS, start asking every vendor: what's my cost per outcome?

Section Deep Dives

Security

Claude Mythos autonomously completes 32-step corporate network attack, UK government "officially frightened." The UK AI Security Institute found Mythos succeeds 73% of the time on expert-level cyber CTFs that no prior model could complete before April 2025. It discovered zero-days in every major OS and browser, including a 27-year-old OpenBSD bug. Anthropic's Project Glasswing restricts access to 12 partners for defensive use only. Gizmodo reports British officials are considering regulatory action beyond the US consortium approach.

Apple Intelligence bypassed 76% of the time via Unicode trick at RSAC. RSAC researchers demonstrated a two-stage injection against Apple's on-device AI: Unicode RIGHT-TO-LEFT OVERRIDE characters reverse harmful strings past filters, then Neural Exec overrides model instructions. Succeeded on 76 of 100 prompts, affecting an estimated 100K-1M users before iOS 26.4 patched it. On-device doesn't mean safe.

DNS rebinding is a protocol-level MCP flaw, not an implementation bug. Varonis confirmed three separate CVEs (Java SDK, Apollo CVE-2026-35577, Azure) all share the same root cause: MCP's HTTP transport lacks mandatory origin validation by design. If you're running MCP servers on localhost with StreamableHTTP transport, upgrade to Apollo 1.7.0+ or switch to stdio.

Red Hat OpenShift AI dashboard leaks Kubernetes tokens (CVSS 8.5). CVE-2026-5483 in the odh-dashboard component exposes Service Account tokens via a NodeJS endpoint. If you're running OpenShift AI for ML workloads, patch immediately. A token at this level compromises your entire training and inference infrastructure.

Agents

Cisco acquiring Astrix Security for up to $350M to own the agent identity layer. Calcalist reports advanced acquisition talks. Astrix's four-method discovery architecture surfaces shadow agents via platform integrations, NHI fingerprinting, EDR telemetry, and BYOS. Third major Cisco AI security acquisition after Splunk ($28B) and Robust Intelligence. Agent identity is becoming the control plane.

Microsoft ships Agent Governance Toolkit covering all 10 OWASP Agentic AI risks. Open-sourced under MIT with packages in Python, TypeScript, Rust, Go, and .NET. Sub-millisecond deterministic policy enforcement for goal hijacking, tool misuse, memory poisoning, and rogue agents. The first toolkit to address the full OWASP agentic taxonomy. If you're shipping agents to production, this is your baseline.

Harvey AI reveals Spectre: autonomous legal agents triggered by system events, not human prompts. Artificial Lawyer interviews CEO Gabe Pereyra, who describes agents monitoring incidents, bug reports, and Slack messages to autonomously draft memos and diligence reports. The "law firm world model" concept, encoding entire firm workflows so agents and humans share the same permissions, is the clearest articulation I've seen of how agents integrate into professional services.

ERC-7715 goes live on Optimism: AI agents can now request scoped wallet permissions. The Defiant reports the standard is live on Optimism and Unichain, enabling MetaMask users to grant agents limited transaction execution for subscriptions, DCA, and auto-compounding. First production-ready standard for agent-to-wallet delegation.

Research

First formal type system for LLM agent composition proves safety and termination. arXiv 2604.11767 presents λ_A, extending typed lambda calculus with oracle calls and bounded fixpoints (the ReAct loop). Partial Coq mechanization at 1,560 lines. This gives framework builders a principled way to determine if an agent configuration is well-formed. Every existing framework lacks this.

Context Kubernetes: treating agent knowledge delivery like container orchestration. arXiv 2604.11623 formalizes six core abstractions and YAML manifests for delivering the right knowledge to the right agent with the right permissions. If you've ever tried to manage context across a multi-agent system, you know this is the actual hard problem.

Cross-trace safety auditing catches violations invisible in individual agent runs. arXiv 2604.11806 shows that misuse campaigns and covert sabotage only appear when analyzing many traces together. Per-run monitoring misses systemic patterns. For anyone deploying agents at scale, single-trace logging is insufficient.

Benchmark-driven LLM translation: 648K LOC Rust to 41K LOC Python. arXiv 2604.11518 documents translating OpenAI's production Codex CLI from Rust to Python using public benchmarks as the objective function. The largest documented LLM-assisted production codebase translation. The methodology of using benchmark scores to guide iterative translation is reusable.

Infrastructure & Architecture

Revolut publishes PRAGMA, a 1B-parameter foundation model trained on 40 billion banking events. arXiv 2604.08649 describes three model sizes (10M, 100M, 1B) powering credit scoring, fraud detection, and lifetime value prediction. Simultaneously, Revolut launched AIR to 13 million UK customers, replacing traditional navigation with dialogue-based finance. A bank building its own foundation model on transaction data is a signal about where financial AI is heading.

Servo v0.1.0 ships on crates.io: embeddable browser engine for Rust. The Linux Foundation project published its first official crate, and Simon Willison immediately built servo-shot (a CLI that renders URLs to PNGs) using Claude Code. A real browser engine as a Rust dependency opens interesting possibilities for agent browser automation without Chromium overhead.

Supabase acqui-hires BKND and Hydra, building an agentic backend stack. Supabase Blog confirms BKND's creator joins to build a Lite offering for agentic workloads, while Hydra's co-creator builds Supabase Warehouse (600x analytics acceleration via pg_duckdb). Supabase is positioning as the default database layer for agent deployments.

Tools & Developer Experience

Claude Code v2.1.105: worktree switching, PreCompact hooks, and stream timeout recovery. Released today. The path parameter for EnterWorktree lets you switch into existing worktrees without creating new ones. PreCompact hooks can block compaction. Streams now abort after 5 minutes of no data and retry non-streaming. All three features matter for long agentic sessions.

GitHub launches native stacked PRs tool (gh-stack). First-party CLI for breaking large changes into small, dependent PRs. Previously only available via Graphite and ghstack. High engagement (0.65 comment-to-score ratio) on r/programming suggests pent-up demand. If you do large refactors with AI agents, stacked PRs are the natural merge strategy.

Chrome DevTools MCP v0.21.0: multi-agent browser coordination via pageId routing. Google shipped the ability for multiple AI agents to target specific browser pages in parallel. This was the last major blocker for agents that can see and debug their own output in real browsers. 115K views on the announcement.

Vercel opensrc: fetch full dependency source code for agent context. vercel-labs/opensrc at 1.8K stars clones package source from npm, PyPI, and crates.io into a predictable directory structure. When your agent needs to understand how a library actually works (not just its types), this is the tool.

Models

Kimi K2.6 from Moonshot AI: deeper reasoning, $0.60/M input tokens. Rolled out April 13 after a one-week closed beta. Beta testers report cleaner multi-step agent plans and more reliable tool calls. K2.5 baseline was 76.8% SWE-Bench Verified and 85% LiveCodeBench. Official K2.6 benchmarks pending, but at $0.60/M input, it's priced to undercut.

Elephant-Alpha: 100B stealth model, free on OpenRouter, 256K context. OpenRouter quietly launched this from an unnamed lab. 100% hallucination prevention accuracy (ranked #1), 98% general knowledge, 78% reasoning. Free during preview. Worth testing for code completion and lightweight agent tasks at zero cost.

MiniMax M2.7: self-evolving 230B MoE model, open-sourced, matches GPT-5.3-Codex on SWE-Pro. MarkTechPost reports 56.22% on SWE-Pro and 57% on Terminal Bench 2. The key innovation: MiniMax tasked an internal model with optimizing its own scaffold, running 100+ autonomous rounds of self-improvement. Available on HuggingFace.

Latent Space April 2026 local model rankings. Latent Space puts Qwen 3.5 first overall, Qwen3-Coder-Next as the consensus coding pick, Gemma 4 31B at #3 globally among open models. Community-driven rankings reflecting what practitioners actually recommend, not just benchmark scores.

Vibe Coding

Anti-vibe-coding backlash is crystallizing. A 735-upvote post on r/LocalLLaMA titled "Please stop showcasing completely vibe coded projects" captures growing frustration with AI-generated project spam. The market is splitting: builders who use AI as a force multiplier on existing skills vs. those who use it as a substitute for understanding. Tools that enforce human oversight in the loop will win the backlash.

Claude Code's invisible token overhead is real. Efficienist documented that v2.1.100 silently injects ~20K tokens per request via server-side expansion. Max subscribers report hitting quota in 19 minutes instead of 5 hours. The tokens also fill your actual context window, diluting CLAUDE.md instructions. Community workaround: downgrade to v2.1.98. I haven't verified this personally, but the reports are consistent enough to flag.

Windsurf Wave 13 ships parallel multi-agent sessions via Git worktrees. Windsurf Blog introduces isolated Cascade agents working in separate worktrees. Run 5 agents on 5 bugs simultaneously with no file conflicts. Also ships Arena Mode for side-by-side model comparison during real coding. The worktree pattern for agent isolation is becoming standard.

Hot Projects & OSS

CopilotKit AG-UI protocol hits 30K stars. GitHub shows adoption by Google, Microsoft, AWS, LangChain. AG-UI streams JSON events for the user-facing interaction layer, filling the gap between MCP (context) and A2A (agent coordination). Also shipped AIMock, a zero-dependency mock server covering LLMs, MCP, A2A, and vector DBs. Drift Detection runs daily against real APIs. If you're building agent frontends, AG-UI is the emerging standard.

OpenViking context database grows 5x to 22.2K stars. Volcengine's project treats agent context as a structured filesystem rather than flat key-value stores. Hierarchical context delivery and self-evolution for agent memory, resources, and skills.

browser-use reaches 87.7K stars as the standard browser automation layer for agents. GitHub growth from ~50K to 87K stars in recent weeks confirms browser automation as a core agent primitive alongside code execution and file access.

haft: engineering decisions engine with evidence decay. 1.3K stars. Decisions know when they're stale. Based on research showing 20-25% of AI-assisted architectural decisions have stale evidence within two months, with 86% discovered only during incidents. Assigns trust scores that degrade as evidence ages.

SaaS Disruption

$2 trillion in SaaS market cap gone. Individual stocks down 30-83% from highs. The selloff accelerated after Anthropic's Managed Agents launch commoditized capabilities that SaaS companies charged separately for. Infrastructure stocks got hit hardest in the same week: Akamai -16.6%, Cloudflare -13.5%, DigitalOcean -13.4%. The Fortune analysis argues AI isn't killing SaaS categories, it's redefining them. Companies with proprietary data and deep workflow integration keep their moats. Thin UI layers over commodity functionality don't.

AI-native SaaS retention is terrible: 40% GRR, 48% NRR. MRRSaver benchmarks show AI-native companies retaining less than half of revenue vs. the B2B median of 82% NRR. The silver lining: GRR improved from 27% to 40% between January and September 2025. The tourists are leaving, and whoever remains is stickier.

Microsoft wants agents to buy software licenses. Business Insider quotes EVP Rajesh Jha: a company with 10 humans and 5 agents each would buy 50 paid seats. The M365 E7 "Frontier Worker Suite" at $99/user/month bundles Agent 365 with Copilot. Microsoft's play to make AI adoption multiply its per-user revenue rather than cannibalize it.

Policy & Governance

Stanford AI Index: 73% of experts positive on AI jobs, only 23% of public agree. The gap is widening. Gen Z excitement collapsed from 36% to 22%, anger rose from 22% to 31%. US trust in government AI regulation is lowest globally at 31%. China narrowed the US model performance lead to just 1.7%, despite the US spending 23x more on private investment ($285.9B vs. $12.4B). The Foundation Model Transparency Index dropped from 58 to 40. AI incidents rose to 362.

Sam Altman attack suspect had manifesto targeting multiple AI CEOs. FBI raided Daniel Moreno-Gama's Texas home, finding incendiary devices and a document listing names and addresses of AI company board members and investors alongside anti-AI ideology. Two counts of attempted murder plus nine other charges. This isn't an isolated incident anymore, it's a pattern that the industry's leadership needs to take seriously.

Jensen Huang plays both sides: Washington and Beijing in the same trip. NVIDIA Blog documents Huang meeting Trump for PCAST council (alongside Sergey Brin and Lisa Su), then flying to Beijing to discuss AI productivity with Chinese officials. His strategy: keep Chinese developers dependent on American tech while competing with Huawei. Whether that's diplomacy or a tightrope depends on your perspective.

Skills of the Day

Set up a PreCompact hook in Claude Code v2.1.105 to protect critical context. Create a .claude/hooks/PreCompact script that exits with code 2 when you're mid-task. This blocks automatic compaction that would otherwise wipe your in-memory state during complex multi-step work. The difference between a productive 2-hour session and starting over from scratch.
Use SWE-bench Pro instead of Verified when evaluating coding agents. The 35-point gap between contaminated Verified (93.9%) and uncontaminated Pro (57%) means every marketing claim based on Verified is inflated. Ask vendors for Pro scores specifically. If they deflect, you have your answer.
Deploy Microsoft's Agent Governance Toolkit as your baseline agent security layer. It's MIT-licensed, covers all 10 OWASP Agentic AI risks with sub-millisecond enforcement, and ships in Python, TypeScript, Rust, Go, and .NET. Install the Agent OS package first since it intercepts every agent action against policy rules. This takes an afternoon, not a sprint.
Run gh stack for AI-generated refactors instead of single massive PRs. GitHub's new native stacked PRs tool lets you break large AI-generated changes into reviewable, dependent chunks. Each PR gets its own review cycle while maintaining merge order. Especially useful when Claude Code generates 500+ line changes across multiple files.
Track your Claude Code reads-per-edit ratio as a quality signal. Install claude-code-token-visualizer (PyPI) for real-time histograms of input/output tokens and cache hit rates. If you're spending 80% of tokens on reads and 20% on edits, your context engineering needs work. Target 40/60 or better.
Audit your MCP servers for DNS rebinding exposure. Three CVEs confirm this is a protocol-level flaw, not implementation-specific. If you're running any MCP server on localhost with StreamableHTTP transport, you're vulnerable. Switch to stdio transport or upgrade to implementations with Host header validation (Apollo 1.7.0+).
Use Chrome DevTools MCP v0.21.0 pageId routing for multi-agent browser testing. Assign each agent a specific browser page via pageId parameter, letting parallel agents inspect and debug different pages simultaneously without stepping on each other. This unlocks real multi-agent QA workflows.
Benchmark your AI features against HubSpot's $0.50/resolved and $1/qualified lead. If you're building SaaS with AI, these are now the market-validated price points for outcome-based AI. Model your unit economics against them. If your cost-per-outcome is 5x HubSpot's, you need a very good reason why.
Add evidence decay to your AI-assisted architectural decisions using haft. Install from GitHub and assign expiry dates to decisions made with AI coding tools. Research shows 20-25% of these decisions have stale evidence within two months. Automated decay scoring catches them before production incidents do.
Use Vercel opensrc to give your coding agent full dependency source access. Run opensrc add <package> for any npm, PyPI, or crate dependency. The full implementation source lands in opensrc/ in a predictable structure. When your agent needs to understand library internals (not just types), this eliminates the hallucination problem of reasoning about code it hasn't seen.

How This Newsletter Learns From You

This newsletter has been shaped by 14 pieces of feedback so far. Every reply you send adjusts what I research next.

Your current preferences (from your feedback):

More builder tools (weight: +3.0)
More vibe coding (weight: +2.0)
More agent security (weight: +2.0)
More strategy (weight: +2.0)
More skills (weight: +2.0)
Less valuations and funding (weight: -3.0)
Less market news (weight: -3.0)
Less security (weight: -3.0)

Want to change these? Just reply with what you want more or less of.

Quick feedback template (copy, paste, change the numbers):

More: [topic] [topic]
Less: [topic] [topic]
Overall: X/10

Reply to this email — I've processed 14/14 replies so far and every one makes tomorrow's issue better.