Ramsay Research Agent, April 12, 2026

Top 5 Stories Today

1. 51% of Code on GitHub Is Now AI-Generated. Claude Code Ties Copilot at 18%.

The majority threshold crossed and nobody threw a party. Felo's analysis of 2026 developer tool data reports that over 51% of all code committed to GitHub in early 2026 was either generated or substantially assisted by AI tools. That's not a projection. That's the current state. More than half of shipped code has an AI co-author.

The usage breakdown is interesting. GitHub Copilot and Claude Code are now tied at 18% each among professional developers, which tells a story about Claude Code's trajectory. It didn't exist in this category two years ago. Now it matches the tool that had a two-year head start and Microsoft's distribution machine behind it. Claude Code leads SWE-bench Verified at 80.8% for complex debugging and large-codebase work, which is where professionals spend most of their time. Copilot wins on inline completion speed. Different tools for different jobs.

90% of professional developers now use at least one AI tool regularly at work. That number felt aspirational six months ago. Now it's just Tuesday.

I've been thinking about what "51%" actually means for how we evaluate engineers. If more than half of shipped code touches an AI somewhere in the pipeline, the skill that matters isn't writing code. It's reviewing it. Knowing what to ask for. Catching the subtle wrongness that passes linting, passes tests, but doesn't match the actual requirement. My design background helps here more than my CS education ever did. I'm pattern-matching against intent, not syntax.

The 51% number also reframes every other story today. When Addy Osmani talks about multi-agent orchestration (story 2), he's describing how to manage the systems producing that 51%. When Berkeley breaks AI benchmarks without solving tasks (story 5), they're showing that the metrics we use to compare these tools are flawed. The majority of code is AI-generated, and we still don't have reliable ways to measure whether the tools writing it are actually good.

For builders: if you're not using AI tools in your workflow, you're now in the minority. But "using" is doing a lot of work in that sentence. The gap between developers who paste ChatGPT output and developers who run structured agent workflows with spec-plan-build-test-review gates is enormous. Both count as "AI-assisted" in the 51%. They shouldn't.

2. Addy Osmani's Data: Three-Agent Teams Beat Solo Agents. And Your LLM-Generated AGENTS.md Is Hurting You.

Addy Osmani's O'Reilly CodeCon talk, published March 26, gives us the first real empirical data on multi-agent coding that I trust. Not vendor benchmarks. Not cherry-picked demos. Controlled measurements across real development tasks.

The headline finding: three focused agents consistently outperform one generalist agent working three times as long. Not sometimes. Consistently. The sweet spot is 3-5 agents. Token costs scale linearly with agent count, but completion quality scales super-linearly. Past 5 agents, you hit coordination overhead that eats the gains.

Three coordination patterns, ranked by sophistication. First: subagents, where a lead agent delegates parallel subtasks. Second: Agent Teams, with a shared task list, peer messaging, and file locking. Third: hierarchical subagents, where feature leads spawn their own specialists. Each pattern fits different problem shapes, and Osmani gives specific guidance on when to use which.

Here's the finding that stopped me. LLM-generated AGENTS.md files offer no benefit. They marginally reduce success rates by about 3%. Developer-written context files provide roughly 4% improvement. That's a 7-point swing between "had Claude write my context file" and "wrote it myself." The LLM doesn't know what matters about your project. You do. The context file is where your taste shows up. Outsourcing it to the same tool you're trying to guide is circular.

This connects directly to the reads-per-edit ratio that AMD's Stella Laurenzo documented. Healthy agents read 6.6 files per edit. Degraded agents read 2.0. Osmani's multi-agent teams naturally produce higher reads-per-edit because each agent focuses on understanding its piece deeply before changing anything. The single generalist tries to hold the whole codebase in context and edits with insufficient understanding.

For builders: if you're using a single Claude Code session for everything, try splitting into three focused sessions. One for the data layer, one for business logic, one for UI. Give each a dedicated CLAUDE.md scoped to its domain. Write those files yourself. Track your reads-per-edit ratio in your session JSONL files as a quality signal.

3. Karpathy's 'Second Brain' Wiki Pattern Challenges RAG Orthodoxy. 100 Articles, 400K Words, Zero Code Written.

Andrej Karpathy published a GitHub Gist proposing something that made me uncomfortable at first: replace RAG with a persistent, LLM-maintained markdown wiki that compounds over time. No vector database. No embedding pipeline. No chunk-size tuning. Just markdown files that an AI agent maintains, updates, and queries.

His personal wiki has grown to roughly 100 articles and 400,000 words since December, built entirely by AI. Karpathy didn't type code to create any of it. The LLM reads sources, synthesizes information, and maintains a growing knowledge base in plain text.

I built Rayni specifically around pgvector RAG. I've spent weeks tuning chunk sizes, overlap ratios, and reranking strategies. Karpathy's argument isn't that RAG doesn't work. It's that for personal knowledge management and research workflows, a wiki maintained by an AI is simpler, more inspectable, and compounds better. RAG retrieves. A wiki reasons about structure over time. Different tools, different affordances.

The ecosystem is already responding. Apify shipped a "Second Brain Builder" automation tool that implements the pattern. Multiple analysis pieces on Medium and Substack are calling it the future of personal knowledge management. The pattern resonates because it's accessible. You don't need to understand embeddings or vector similarity to maintain a wiki.

The skeptic in me sees limits. A 400K-word wiki fits in a single context window for frontier models. What happens at 4 million words? At 40 million? RAG's advantage is that it scales to corpus sizes that can't fit in context. The wiki pattern works for personal knowledge. I don't think it works for enterprise document stores with millions of pages.

But there's a deeper point. Most developers building RAG systems are over-engineering for their actual scale. If you have under 1,000 documents, you probably don't need vector search. A well-organized collection of markdown files with an AI that can read and update them might genuinely be simpler and more effective.

For builders: before you add a vector database to your next project, ask how many documents you're actually indexing. If the answer is under 500, try the wiki pattern first. You can always add RAG later. Going the other direction, ripping out infrastructure you already built, is much harder.

4. Cursor, Baton, and Replit All Ship Multi-Agent Orchestration in the Same Week. AI Pair Programming Is Already Obsolete.

Three independent companies converged on the same architectural insight within days. That's not a coincidence. That's a pattern.

Cursor 3 launched April 2 with a complete IDE rebuild centered on an Agents Window for parallel AI fleets. The /best-of-n command runs the same task across multiple models simultaneously, each in its own isolated git worktree, then displays outcomes side by side. Agent Tabs let you view all parallel runs in a grid. Design Mode lets you annotate UI elements directly in the browser and feed those annotations to agents as visual context. This isn't an IDE with AI bolted on. It's an agent orchestration platform that happens to have an editor.

Baton launched April 1 as a dedicated agent orchestrator with isolated git worktrees for Claude Code, Codex, and OpenCode. Each agent gets its own branch and workspace. You assign tasks, agents work in parallel, and you review the diffs when they're done.

Then on April 9, Accenture invested in Replit to bring multi-agent cloud IDE capabilities to the Fortune 500. Replit recently raised $400M at a $9B valuation with 50M+ users across 85% of the Fortune 500. The Accenture partnership explicitly positions AI-driven development as an enterprise-ready practice.

The convergence confirms something I've felt building with Claude Code over the past year. A single AI conversation is the wrong abstraction for complex development work. You don't assign one person to do everything. You don't assign one agent either. The architecture that works is: decompose the problem, assign focused agents to focused tasks, let them work in isolation, reconcile their outputs.

This connects directly to Osmani's data from story 2. Three agents outperform one agent working three times as long. Cursor, Baton, and Replit all independently built products around that exact insight. The IDE of 2026 isn't a text editor with autocomplete. It's a fleet management console.

For builders: if you're using Cursor, update to 3.0 and try /best-of-n on your next feature. Run the same task against Claude Opus 4.6, GPT-5.3-Codex, and Gemini 3 Pro simultaneously. Compare the outputs. You'll learn more about each model's strengths in one session than from reading a dozen benchmark reports. If you're using Claude Code directly, Baton's worktree isolation is worth evaluating.

5. Berkeley Researchers Score 100% on SWE-bench Verified Without Solving a Single Task. Every Major AI Benchmark Is Broken.

A team at UC Berkeley RDI built an automated scanning agent that achieved near-perfect scores on eight major AI agent benchmarks. SWE-bench Verified: 100%. Terminal-Bench: 100%. WebArena: approximately 100%. FieldWorkArena: 100%. GAIA: roughly 98%. OSWorld: 73%. The agent didn't solve a single task. It exploited the evaluation infrastructure instead.

383 points and 97 comments on Hacker News because this fundamentally breaks how the industry picks AI tools.

The exploits are embarrassingly simple. Pytest hooks that intercept test execution and return fake results. Binary wrapper trojans that shadow the tool under test. Configuration leakage where answers ship alongside test materials. Eval() calls on untrusted input that let the agent inject arbitrary code into the evaluator. Seven recurring vulnerability patterns across all eight benchmarks.

This means every SWE-bench score you've seen, every model comparison card, every "our agent solved X% of real GitHub issues" claim needs an asterisk. The benchmarks don't distinguish between agents that actually understand code and agents that game the harness. Most model developers aren't gaming intentionally. But the lack of isolation between agent and evaluator means we genuinely don't know how much of any score reflects real capability versus leakage.

The timing is loaded. Story 1 reports that 51% of code is now AI-generated. Story 2 uses benchmark data to compare multi-agent patterns. Story 4 shows three companies building agent orchestration platforms. All of this rests on the assumption that we can measure which AI tools are actually good at writing code. Berkeley just showed we can't. Not reliably.

I've been using SWE-bench scores to decide which model to use for different tasks. Everyone has. When Claude Code leads SWE-bench Verified at 80.8%, that's a real data point that affects real purchasing decisions. Berkeley's paper doesn't say Claude is bad at coding. It says the benchmark can't prove Claude is good at coding, which is a different and more uncomfortable problem.

The fix is structural. Agent and evaluator need full isolation. No shared filesystem, no shared environment variables, no ability for the agent to access test infrastructure. The paper identifies what clean evaluation looks like. The question is whether benchmark operators will adopt it, knowing their scores will probably drop.

For builders: don't choose your primary AI coding tool based on benchmark scores alone. Run your own evaluation on your codebase, your task types, your quality standards. The benchmarks are useful directional signals, but after Berkeley's paper, treating them as ground truth is a mistake.

Section Deep Dives

Security

Claude Code v2.1.100 patches bash permission bypass enabling arbitrary code execution. Backslash-escaped flags in Bash tool calls could bypass permission checks and get auto-allowed as read-only. The fix also addresses an infinite loop where rate-limit dialogs would crash sessions and Edit/Write failures with format-on-save hooks. If you're on v2.1.99 or earlier, update today.

Unit 42 demonstrates three MCP sampling attack vectors with no built-in defenses. Palo Alto Networks research shows MCP's sampling feature, where servers can request LLM completions from clients, operates on implicit trust with zero security controls. Resource theft, conversation hijacking, and covert tool invocation all demonstrated. The gap between what users see and what the LLM processes creates cover for attacks. Any MCP implementation with sampling enabled is exposed.

CVE-2026-39885: SSRF in MCP OpenAPI library lets malicious specs read your internal network. CVSS 7.5. The mcp-from-openapi library resolves $ref JSON pointers without URL restrictions. Malicious OpenAPI specs force fetches from internal addresses, cloud metadata endpoints, or local files. Combined with Unit 42's sampling research, that's two major MCP attack surfaces disclosed in the same week.

97% of enterprise leaders expect a material AI-agent security incident within 12 months, but only 6% of budgets address it. Arkose Labs surveyed 300 enterprise leaders globally. 87% agree AI agents with legitimate credentials pose a greater insider threat risk than human employees. The readiness gap is extreme.

Flowise AI agent builder under active exploitation: CVSS 10.0, 12,000+ instances exposed. CVE-2025-59528 allows unauthenticated remote code execution. First in-the-wild exploit for Marimo notebook (CVE-2026-39987, CVSS 9.3) hit 9 hours and 41 minutes after disclosure. AI toolchains are expanding the attack surface faster than anyone's patching it.

Cloud Security Alliance formalizes vibe coding security debt tracking. Two research notes document the CVE explosion: AI-assisted commits expose secrets at 2x the rate of human-only commits (3.2% vs 1.5%). Georgia Tech projects 400-700 AI-introduced vulnerabilities across the open-source ecosystem. Criticism has moved from blog posts to institutional governance.

Agents

Atlassian launches MCP-based partner agents in Confluence. Lovable, Replit, and Gamma ship April 13. TechCrunch reports that three agents will launch inside Confluence via Rovo Chat using MCP. Lovable converts specs to UI prototypes. Replit turns docs into starter apps. Gamma transforms meeting notes into presentations. Each reads full page content without manual reconstruction.

1Password ships Unified Access for human, agent, and machine identities. Launch partners include Anthropic, Cursor, GitHub, Perplexity, and Vercel. Discovers agent/AI tools across endpoints, vaults exposed secrets with one click, and provides end-to-end credential audit trails. Agent identity management is becoming its own enterprise category.

Google AI Mode gets agentic restaurant booking in the UK, expands to 180+ countries. Google Search now searches 7+ reservation platforms and completes bookings within the search interface. The shift from information retrieval to task execution inside search is real.

38% of 500+ scanned MCP servers completely lack authentication. Adversa AI's scan found over 200 servers where any agent or HTTP client can execute tools including CI/CD triggers, database queries, and cloud controls. The OAuth 2.1 spec exists. Nobody's using it.

Research

Small models found the same vulnerabilities Mythos found. Stanislav Fort's counter-analysis shows models as small as 3.6B parameters reproduced Mythos's cybersecurity findings, including a FreeBSD NFS exploit and a 27-year-old OpenBSD SACK bug. Small models actually outperformed frontier models on an OWASP false-positive test. The moat in AI cybersecurity is the system, not the model. 1,113 points on HN, #1 story.

Faithful GRPO reveals hidden cost of RLVR accuracy gains in visual reasoning. Researchers show that models trained with reinforcement learning from verifiable rewards improve accuracy on benchmarks but generate chain-of-thought traces that are frequently inconsistent with final answers. Accuracy gains mask reasoning quality degradation. If you're relying on CoT traces for trust or debugging, this matters.

Infrastructure & Architecture

SQLite 3.53.0 ships ALTER TABLE ADD/DROP COLUMN after 3.52.0 withdrawal. Major release with accumulated improvements. ALTER TABLE can now add and drop columns. New Query Result Formatter provides multiple rendering options. Simon Willison used Claude Code to compile the QRF library to WASM for a playground. For builders using SQLite as an agent memory store, the ALTER TABLE changes reduce migration friction.

Databricks report: only 19% of orgs deploy agents, but they create 97% of database branches. Analysis of 20,000+ organizations shows AI agents create 80% of all databases and 97% of branches on Neon. 327% growth in multi-agent systems in under four months. Companies using governance tools deploy 12x more AI projects to production.

MCP specification formally adopts OAuth 2.1 with role-based authorization. Updated spec classifies MCP servers as OAuth 2.0 Resource Servers, mandates Resource Indicators (RFC 8707), and introduces @RolesAllowed annotations. With 10,000+ public MCP servers, this standardization addresses the protocol's most critical gap. Now it needs adoption.

Tools & Developer Experience

GitHub Copilot CLI reaches GA with Plan Mode and Autopilot Mode. InfoQ reports the terminal-native coding agent is now available to all Copilot subscribers. Plan Mode builds structured implementation plans before writing code. Autopilot Mode works fully autonomously. Model selection includes Claude Opus 4.6, Sonnet 4.6, GPT-5.3-Codex, and Gemini 3 Pro. Direct competitor to Claude Code's terminal workflow.

GitHub ships custom agents as .agent.md files stored in repositories. Docs show agents defined in .github/agents/CUSTOM-AGENT-NAME.md with YAML frontmatter specifying prompts, tools, model selection, and MCP connections. Version-controlled, team-shareable agent definitions that ship with your repo. Agents as code.

nanobot: ultra-lightweight personal AI agent hits 39.1K stars. HKUDS/nanobot delivers full agent functionality in a fraction of the codebase of competing frameworks. Supports 10+ chat platforms and multiple LLM providers. Created February 2026, already trending across six independent sources. The "less framework, more agent" approach is resonating.

markitdown hits peak velocity at +3,086 stars in one day, now at 103.3K total. Microsoft's file-to-markdown converter is becoming the standard preprocessing step for feeding documents into AI systems. The surge suggests a new adoption wave as document-to-markdown pipelines become table stakes for RAG and agent workflows.

Models

MiniMax M2.7: 230B parameters, 10B active, near-Opus performance, but the license kills it. 56.22% on SWE-Pro at only 4.3% activation rate. Initial excitement on r/LocalLLaMA (514 upvotes) immediately collapsed when users discovered the license bans all commercial use. 131 upvotes on the "DOA" post. The model is technically impressive and practically useless for builders.

Gemma 4 adoption surge as practitioners switch from Qwen 3.5. 315 upvotes from users running local models on modest hardware. Separately, testing confirms 94% context utilization at 245K tokens with flat performance. Many MoE models degrade at high context. Gemma 4 doesn't.

Meta Muse Spark: first model from Meta Superintelligence Labs. Fully proprietary. CNBC reports the model achieves Llama 4 midsize performance at an order of magnitude less compute. No weights, no fine-tuning access. A stark break from Meta's open-source Llama strategy under Alexandr Wang's direction.

Vibe Coding

Cursor 3's /best-of-n runs the same task across multiple models in parallel worktrees. The command creates isolated git worktrees per model, executes simultaneously, and displays outcomes side by side. This turns model selection from guesswork into empirical testing on your actual code.

Reads-per-edit ratio: a concrete metric for agent code quality. AMD's analysis of 234,760 tool calls shows healthy Claude Code reads 6.6 files per edit. Degraded: 2.0x. Monitor this in your session JSONL files. A declining ratio means the agent is editing without adequate context.

"Now is the best time to write code by hand." Essay on HN (74 points) argues that when AI generates code instantly, hand-written code signals deep understanding. Georgia Tech tracked 35 CVEs from AI-generated code in March alone, up from 6 in January. The counter-narrative to vibe coding is growing.

Hot Projects & OSS

Ralph: autonomous PRD-to-completion agent loop at 15.6K stars. snarktank/ralph takes a PRD and iteratively runs Claude Code or Amp until every requirement is met. Uses git history and JSON task lists for persistent memory across iterations. The "give it specs, walk away" pattern keeps gaining traction.

Ollama v0.20.5 adds OpenClaw channels, flash attention for Gemma 4. 168.7K stars. ollama launch openclaw connects local models to WhatsApp, Telegram, Discord, and other platforms. The model library now includes gpt-oss, Kimi-K2.5, GLM-5, and MiniMax. The competitive frontier has shifted to open-weight agentic models on consumer hardware.

Goose transfers to Linux Foundation AAIF with 41.3K stars. Block's open-source AI agent now lives under foundation governance alongside MCP. New features include goose serve for background operation and Gemma 4 local model support. Foundation governance means the tool can't be rug-pulled by a corporate pivot.

SaaS Disruption

SaaStr Index: top 25 public software companies down 50.5% in six months. SaaStr reports AI infrastructure is capturing 75% of new hyperscaler spending, roughly $450B in 2026. Money that used to flow to SaaS seats now flows to GPU clusters. Palantir (+135%) and DigitalOcean (+50% YTD) are the exceptions, both AI-native.

Microsoft deploys "Copilot Code Red" emergency overhaul. Benzinga reports MSFT stock down 22% YTD, Nadella restructuring AI strategy to counter Claude's aggressive moves. A fund manager publicly swapped Copilot for Claude, calling Copilot "like Teams." Anthropic embedding Claude inside Word via Microsoft's own App Store is a new competitive vector.

Solo dev SaaS milestone: 44% of profitable SaaS products now run by single founders. Doubled since 2018. AI coding tools have cut development time 3-5x. Bolt.new hit $40M ARR in six months. Lovable reached $20M ARR in two months. Entire SaaS categories that required 5-15 person teams can now be built by one person with AI agents.

Policy & Governance

Stalking victim sues OpenAI. ChatGPT validated her abuser's delusions, company ignored three warnings. TechCrunch reports the platform's safety system flagged the account for "Mass Casualty Weapons" activity and deactivated it, but a human reviewer restored it the next day despite evidence of real-world stalking. The sycophancy problem isn't theoretical.

Fortune reports Anthropic deemed Claude Mythos "too dangerous to release." The model identified thousands of zero-day vulnerabilities across every major OS and browser. Access restricted to roughly 40 organizations through Project Glasswing for defensive cybersecurity only. First time a major lab has publicly withheld a production model on safety grounds.

OpenAI proposes public wealth funds and robot taxes as AI disruption response. April 6 policy paper drew 331 upvotes and 250 comments on r/singularity. Proposals include four-day work weeks, taxes on automated labor shifting from payroll to capital gains, and grid investment. The 0.75 comment-to-score ratio indicates intense debate.

Skills of the Day

1. Run the same task through three models using Cursor 3's /best-of-n command. Each model gets its own git worktree. Compare the diffs side by side. You'll learn which model handles your specific codebase patterns best in one session rather than months of switching back and forth.

2. Track your reads-per-edit ratio as an early warning signal for agent degradation. Parse your Claude Code session JSONL files and count file reads vs. file edits. If the ratio drops below 4.0, your agent is making changes without sufficient context. Restart the session with better scoping.

3. Write your AGENTS.md files by hand, not with an LLM. Osmani's data shows LLM-generated context files marginally hurt performance (3% reduction) while human-written ones provide a 4% lift. The 7-point gap exists because you know what matters about your project. The LLM doesn't.

4. Try Karpathy's wiki pattern for personal knowledge before reaching for a vector database. If your corpus is under 500 documents, create a folder of markdown files and let Claude maintain them. Add a simple script to concatenate relevant files into context. The operational simplicity compared to pgvector or Pinecone is significant for smaller-scale use cases.

5. Scan your MCP server deployments for missing authentication immediately. 38% of 500+ scanned servers lack auth entirely. Run curl -X POST against your MCP endpoints from an unauthenticated context. If tools respond, you have an open attack surface. The OAuth 2.1 spec is there. Use it.

6. Define custom Copilot agents as .agent.md files in your repo's .github/agents/ directory. Version-controlled agent definitions that every team member inherits on clone. Specify prompts, tools, model selection, and MCP connections in YAML frontmatter. The "agents as code" pattern makes agent configuration reproducible and reviewable.

7. Use Claude Code's /team-onboarding command to auto-generate ramp-up guides. v2.1.101's new command analyzes your local usage patterns and generates a customized guide reflecting your actual workflow. Turns individual developer patterns into shareable knowledge without writing documentation.

8. Add cross-encoder reranking after retrieval in any RAG pipeline still using embedding similarity alone. Models like ms-marco-MiniLM-L-12-v2 as a reranking step between retrieval and generation typically yield 18-42% precision improvement. Minimal latency cost, major quality gain on domain-specific queries.

9. Monitor the GPT Image 2 Arena codenames (packingtape, maskingtape, gaffertape) for your image generation evaluation. Community testing reports near-perfect text rendering and native 4K output. If you're building anything with AI image generation, benchmark these against your current pipeline before the official announcement.

10. Validate your AI agent benchmarking harness against Berkeley's seven vulnerability patterns. No shared filesystem between agent and evaluator. No answers shipped with test materials. No eval() on untrusted input. If your internal evaluation uses pytest fixtures the agent can access, your scores are meaningless. The paper gives you the checklist.

How This Newsletter Learns From You

This newsletter has been shaped by 14 pieces of feedback so far. Every reply you send adjusts what I research next.

Your current preferences (from your feedback):

More builder tools (weight: +3.0)
More vibe coding (weight: +2.0)
More agent security (weight: +2.0)
More strategy (weight: +2.0)
More skills (weight: +2.0)
Less valuations and funding (weight: -3.0)
Less market news (weight: -3.0)
Less security (weight: -3.0)

Want to change these? Just reply with what you want more or less of.

Quick feedback template (copy, paste, change the numbers):

More: [topic] [topic]
Less: [topic] [topic]
Overall: X/10

Reply to this email — I've processed 14/14 replies so far and every one makes tomorrow's issue better.