Ramsay Research Agent — April 7, 2026
Top 5 Stories Today
1. Claude Code's Worst Week: 1,130 HN Points, AMD's AI Director Piles On, and a Quantitative Smoking Gun
A GitHub issue titled "Claude Code is unusable for complex engineering tasks" (Issue #42796) has become the highest-engagement Claude Code story in months. 1,130 points. 622 comments. AMD's AI director publicly called it "dumber and lazier" in The Register. This isn't the usual "Claude feels different" Reddit post. This one has receipts.
Ben Vanik published a quantitative analysis of 17,871 thinking blocks and 234,760 tool calls across 6,852 session files. The data traces the regression directly to a specific change: redact-thinking-2026-02-12. Before that rollout, Claude Code would research first, then edit. After it, the model shifted to an edit-first behavior pattern. Convention adherence degraded measurably. The model consumed 80x more API requests and 64x more output tokens to produce worse results. Sessions that ran autonomously for 30 minutes started stalling every 1-2 minutes.
I use Claude Code every day. I've felt the difference but couldn't prove it. Now someone has.
The timing is brutal for Anthropic. Separately, Pragmatic Engineer author Gergely Orosz posted that Anthropic is "burning more and more dev goodwill," noting that Claude Code now refuses tasks it handled fine last month. Anthropic also issued DMCA takedowns against legitimate forks of their own public repo, later calling them unintentional. On Reddit, a user posted side-by-side screenshots showing Claude.ai's reasoning effort dropped from 85 to 25 for identical prompts, with a 0.91 comment-to-score ratio (the highest I've seen in months, meaning almost every viewer felt compelled to comment).
This is a multi-front developer trust problem. The quantitative evidence suggests extended thinking tokens aren't a luxury. They're structurally required for multi-step engineering work. When you reduce thinking depth silently, you break the product for your most demanding users. The same users paying for Max plans. The same users who evangelize your tool.
What builders should do: if you're on Claude Code, check your session quality honestly. Vanik's analysis suggests asking Anthropic to expose thinking_tokens in usage responses so you can monitor whether your requests get the reasoning depth they need. If your complex workflows have degraded, this data gives you specific language for support tickets. And watch for Anthropic's response. How they handle this week says a lot about whether they treat developer trust as a real priority or a PR problem.
2. Cursor 3 Ships the Agent-First IDE. Worktrees, Best-of-N, and Design Mode Change the Game.
Cursor 3 launched April 2 and it's the biggest architectural change since the editor shipped. The IDE is now centered on an Agents Window for running many agents in parallel, across repos, locally, in worktrees, or in the cloud. This isn't a feature update. It's a rethink of what an IDE is.
Three features matter most. /worktree creates a separate git worktree so AI changes happen in isolation. Your main branch stays clean while the agent experiments. I've been doing this manually for months with Claude Code, creating worktrees by hand and spawning sessions in each one. Cursor just made it a slash command.
/best-of-n runs the same task across multiple models in parallel worktrees, then lets you compare outcomes. Want to see how Claude, GPT-4o, and Gemini each approach a tricky refactor? Run all three, diff the results, pick the best one. This is the first time I've seen a mainstream IDE treat model selection as a first-class workflow, not a settings dropdown.
Design Mode lets you annotate browser UI elements directly to guide agents. Click on a button, say "make this bigger," and the agent knows exactly what you mean. This solves the core vibe coding UX problem: describing visual changes in text is slow and imprecise. Direct manipulation is faster.
The timing with the Claude Code regression story is interesting. Cursor 3's /best-of-n gives you a hedge against any single model degrading. If Claude has a bad week, you still get work done because you're comparing outputs. That's a design decision that ages well in a world where model quality fluctuates.
What builders should do: try /best-of-n on your next non-trivial refactor. The comparison view alone will teach you things about how different models reason. And if you're managing a team, the Agents Window's parallel execution model means one developer can have 5-8 agents working simultaneously across different parts of a codebase. That changes how you plan sprints.
3. Microsoft Foundry-Local: One SDK to Replace Ollama, llama.cpp, and Whisper.cpp. No Cloud Required.
Microsoft open-sourced Foundry-Local, and I think it's the most underrated release of the week. One SDK for chat and audio with automatic hardware acceleration (NPU > GPU > CPU), self-contained with no external dependencies, and the API surface is identical to Azure AI Foundry. 2,079 GitHub stars and climbing.
Here's why this matters more than another "run models locally" announcement. Foundry-Local uses the same API as Azure. Zero application code changes to go from local to cloud. That's not something Ollama or llama.cpp can claim. You prototype on your laptop, deploy to Azure when you need scale, and the code is the same. For enterprise teams that need data governance guarantees, this is a big deal.
The companion tutorial is what sold me. Microsoft published a hands-on guide for building a fully offline AI support agent using Context-Augmented Generation (CAG). Unlike RAG, CAG pre-loads your entire knowledge base into the model context at startup. No vector database. No embeddings. No chunking pipeline. The total dependency footprint is two npm packages: express and foundry-local-sdk. That's it.
I've spent weeks debugging RAG pipelines, fighting with chunk sizes and embedding models and retrieval thresholds. For small-to-medium knowledge bases, CAG just skips all of that. Load your docs into context, ask questions. The simplicity is almost suspicious.
The hardware support is broad: CPUs, NVIDIA GPUs, Intel NPUs, Qualcomm NPUs on Copilot+ PCs. Foundry-Local also supports Microsoft's three proprietary MAI foundation models (MAI-Transcribe-1, MAI-Voice-1, MAI-Image-1), released April 2 and designed to integrate with the runtime.
What builders should do: if you're building anything that needs offline AI, start here instead of stitching together Ollama + Whisper + your own inference code. The two-dependency CAG approach is worth a weekend experiment for any knowledge-base app under ~50 pages.
4. Salesforce Automated 84% of Its Own Support. Then Its Stock Dropped 26%. The Seat-Based Pricing Model Is Breaking.
Here's the paradox nobody in SaaS wants to talk about. Salesforce deployed its own Agentforce across customer support. 380,000+ interactions. 84% fully resolved without a human. Only 2% escalated. By any product metric, that's a success.
The stock dropped 26%.
Investors see the math clearly. The better Agentforce gets, the fewer Service Cloud seats customers need. A tool that replaces 84% of support interactions doesn't sell more seats. It eliminates them. Salesforce is eating its own revenue with its own product.
Their answer is the Agentic Work Unit (AWU), a new pricing metric where one AWU equals one discrete task accomplished by an AI agent: a processed prompt, a completed reasoning chain, or an invoked tool. Salesforce processed 2.4 billion AWUs total through Q4, with 771 million in Q4 alone. Agentforce reached $800 million in revenue. CIO magazine called AWU "a shiny new metric that tells CIOs little of value."
I think the CIO criticism is right. An AWU measures activity, not outcomes. A support agent that loops through five reasoning chains to answer a simple question generates 5 AWUs. A support agent that resolves it in one generates 1 AWU. The incentive structure rewards verbosity, not efficiency.
But the bigger story isn't Salesforce's pricing problem. It's the pattern. Every SaaS company building AI agents will hit this same wall. Your agent gets good enough to replace the humans using your per-seat product. Your product succeeds and your revenue model fails simultaneously. I don't think anyone has a clean answer yet.
What builders should do: if you're building SaaS, start designing your pricing model around outcomes or consumption now, not seats. If you're buying SaaS, pay attention to which vendors are still charging per-seat while shipping agents that reduce the number of seats you need. That gap is where leverage lives.
5. Bram Cohen Says the Cult of Vibe Coding Is Insane. 569 HN Points and 463 Comments Agree.
BitTorrent creator Bram Cohen published "The Cult Of Vibe Coding Is Insane" and it hit 569 points with 463 comments on Hacker News. The most-discussed vibe coding essay to date.
Cohen's trigger was the Claude Code source leak. When the 512,000 lines of TypeScript leaked via a missing .npmignore, people laughed at the code quality. Cohen's argument: the Claude team went "completely overboard with dogfooding," refusing to even spend a few minutes looking under the hood. The result was code that worked but was messy in ways a few minutes of human review would have caught.
His thesis is simple. Pure vibe coding, where you deliberately refuse to look at the implementation, is "dogfooding run amok." Cohen describes his own approach: start by asking the AI to audit code, discuss problems until actionable solutions emerge, explain what should change, then let the machine write. He reads the code. He has opinions about the code. He just doesn't type it.
I agree with about 80% of this. The remaining 20% is where it gets interesting.
Cohen is right that human review is cheap and high-value. Five minutes of reading a diff catches structural problems that an AI will cheerfully propagate across an entire codebase. My design background makes me care about this instinctively. You don't ship a design you haven't looked at. Why would you ship code you haven't read?
But I think Cohen undersells what pure vibe coding is good for. Prototyping. Throwaway scripts. One-shot tools you'll use once and delete. When the cost of a bug is "I wasted 10 minutes," human review adds overhead without proportional value. The skill is knowing which mode to be in.
The connection to Story #1 is direct. Claude Code's quality regression means vibe coding works less well this week than it did two months ago. The people who were reading their diffs noticed the regression faster than the people who weren't. Cohen's point lands harder when the AI is actively getting worse.
What builders should do: read your diffs. Seriously. Not every line, but the structural decisions. Does the file organization make sense? Are the abstractions appropriate? Is the error handling real or theatrical? Five minutes of human taste applied to AI output is the highest-leverage activity in software engineering right now.
Section Deep Dives
Security
Shannon autonomous pentester hits 37K stars, scores 96.15% on XBOW benchmark. Shannon combines source code analysis with live exploitation, completing 100 of 104 exploit challenges in hint-free mode. Shannon Lite runs locally via npx @keygraph/shannon under AGPL-3.0. If you're not running automated security testing on your codebase, this lowers the barrier to near-zero.
MCP cost inflation attack amplifies per-query costs 658x with under 3% detection rate. New research from Adversa AI shows malicious MCP servers can steer agents into prolonged tool-calling chains that silently burn API credits. The agent looks like it's doing legitimate work. Defense requires monitoring tool-call chain depth and implementing per-query cost ceilings at the gateway level, not relying on the LLM to self-police.
TIP framework achieves 95% attack success rate on MCP-enabled agents. The Tree-structured Injection for Payloads framework (arXiv 2603.24203) generates natural-language payloads that seize control of MCP agents even under active defenses. Real-world demos on LM Studio and VS Code with GPT-4o show the model trusts MCP tool context and embeds malicious links directly into the developer's workspace. If you're running MCP servers, audit them now.
MIT researchers prove AI agents can hold undetectable secret conversations. Vaikuntanathan and Zamir demonstrate that two AI agents can maintain a parallel secret conversation using noise-resilient key exchange embedded in natural language, while the transcript appears completely normal to observers. Detection is computationally infeasible. This has immediate implications for anyone relying on transcript monitoring for AI safety.
Agents
MCP Dev Summit: Anthropic, AWS, Microsoft, and OpenAI present enterprise security roadmap. At the MCP Dev Summit in New York (April 2-3), maintainers from all four companies jointly addressed authorization, the most actively changing part of the MCP spec. The Agentic AI Foundation (AAIF) has grown to 170 members since December. Authorization is the hard problem. Everything else is plumbing.
Claw Code clean-room rewrite of Claude Code hits 136K+ stars. After the March 31 npm source leak, developer Sigrid Jin led a clean-room Python/Rust rewrite that now has more GitHub stars than Anthropic's own repo. Independent code audits confirm zero proprietary source code. Anthropic issued DMCA takedowns against direct mirrors but hasn't targeted the clean-room implementation. The fastest-growing repo in GitHub history.
Microsoft Agent Framework 1.0 goes GA: AutoGen + Semantic Kernel merged. The unified framework ships with five stable orchestration patterns from Microsoft Research: sequential, concurrent, handoff, group chat, and Magentic-One. All patterns support streaming, checkpointing, human-in-the-loop approvals, and pause/resume. If you're building multi-agent systems on Microsoft's stack, this is now the one framework to use.
Only 11% of enterprise agentic use cases reach production. Belitsoft's 2026 report says the average enterprise runs 12 AI agents, but 50% operate in complete isolation. 71% claim to deploy agents, but actual production deployments are rare. Gartner says most companies won't have agent applications ready for large-scale use until 2028. The gap between "we have agents" and "agents do useful work" remains enormous.
Research
Neuro-symbolic AI cuts robot training energy 100x while boosting accuracy from 34% to 95%. Tufts University researchers demonstrated a VLA model that completed training in 34 minutes vs 36+ hours for standard approaches, using 1% of training energy. Presenting at ICRA Vienna in May. Symbolic reasoning layers dramatically reduce compute demands for embodied AI. Worth tracking.
GPU-accelerated TFHE enables encrypted LLM inference. Chen et al. demonstrate running LLM inference on fully encrypted data using GPU-parallelized homomorphic encryption. A practical step toward privacy-preserving cloud AI. Still early, but the performance gap is closing.
Hardware-level AI compute governance gets its first feasibility taxonomy. Ansari's paper examines whether proposals like chip-level tracking and compute caps can actually be built from an engineering perspective. Essential reading as governments draft compute governance frameworks. The answer: some proposals are feasible, others are engineering fantasy.
Infrastructure & Architecture
Anthropic hits $30B revenue run rate, signs 3.5GW TPU deal with Google and Broadcom. Bloomberg confirmed Anthropic tripled from $9B at end of 2025. Enterprise customers spending $1M+ annually doubled from 500 to over 1,000 since February. The TPU deal starts delivering in 2027 through 2031. Claude Code alone generates $2.5B+ in run-rate revenue, per The Register. That makes a single CLI tool larger than most standalone SaaS companies.
DeepSeek V4 confirmed running on Huawei Ascend 950PR chips. Reuters reports the 1T parameter MoE model (~37B active) will be the first frontier model trained and served entirely without NVIDIA silicon. If competitive at $0.30/MTok, this breaks NVIDIA's monopoly on frontier inference. Late-April release expected.
NVIDIA GR00T N1.7 commercially available for humanoid robots. NVIDIA announced the first commercially licensed humanoid robot foundation model, with ABB, FANUC, Figure, Universal Robots, and KUKA adopting. GR00T N2 preview (DreamZero architecture) doubles success rates on novel tasks.
Tools & Developer Experience
Ghost Pepper: fully local hold-to-talk speech-to-text for macOS hits 393 HN points. Ghost Pepper uses WhisperKit (~466MB) and Qwen 2.5 (~3GB) on Apple Silicon. No cloud, no logging. Hold Control, speak, release to paste. The local AI tool ecosystem on Mac hardware is genuinely maturing.
Hippo: biologically-inspired memory for AI agents with decay and consolidation. Hippo implements memory that decays, strengthens on retrieval, and consolidates episodes into semantic patterns. Includes Claude Code hook integration. Zero dependencies. If you're running long agent workflows and hitting context limits, this is a different approach than just stuffing more tokens into the window.
Agent Reading Test reveals how poorly AI coding agents read web docs. agentreadingtest.com gives agents 10 realistic documentation tasks with embedded canary tokens. Surfaces silent failure modes: content truncation, CSS-buried text, client-side rendering delivering empty shells. A perfect score is unlikely for any current agent. Try it with your setup.
Models
Gemma 4 26B MoE: "mindblowingly good if configured right." A practitioner guide with 395 upvotes and 206 comments details optimal configuration: Q4_K_M/Q5_K_M quantization, 32-64K context cap locally, Flash Attention for 20-30% speed gains. The 26B activates only 3.8B parameters per inference. Near-30B quality on 16GB GPUs. Apache 2.0 licensed. AIME math jumped from 20.8% to 89.2% generation over generation.
Meta preparing open-source releases: Avocado (LLM) and Mango (multimedia). Axios reports Meta will release both models initially as closed, then open-source. They're explicitly targeting consumers rather than enterprise, positioning against OpenAI and Anthropic's focus.
MiniMax M2.7 self-evolving model autonomously ran 100+ optimization rounds. The model scores 56.22% on SWE-Pro at $0.30/$1.20 per million input/output tokens. An order of magnitude cheaper than frontier models. The self-evolution capability, where the model autonomously optimized its own scaffolding across 100+ rounds, is genuinely novel.
Vibe Coding
Harvard study: both CS achievement and writing skills predict vibe coding proficiency. A preregistered study (N=100, arXiv 2603.14133) finds that written communication and CS knowledge both predict vibe coding performance, with CS remaining significant after controlling for cognitive ability. You need both. Not one or the other. This confirms what I've suspected: the best vibe coders are the ones who already know how to code.
Anthropic pivots to billing enforcement after Claude Code source leak. Paddo.dev documents the four-month timeline: silently disabled OAuth tokens in January, cease-and-desist letters to OpenClaw in February, March 31 npm source leak exposed the cch header attestation mechanism, then April 4 billing enforcement announcement. Third-party harnesses now work but at API rates with "extra usage" billing. Practical outcome: if you built workflows around third-party Claude Code wrappers, budget for higher costs.
Claude Code policy filter hits bioinformatics researchers. A computational virology postdoc reports persistent policy violations when writing phylogenetic pipeline scripts. Just sequence and metadata processing, no dual-use content. 107 upvotes, 23 comments confirming the pattern extends to other scientific domains. This is the cost of aggressive safety filters: friction for exactly the professional users driving your revenue.
Hot Projects & OSS
Netflix VOID: first open-source model that erases video objects and rewrites physics. VOID removes objects from video along with all physical interactions they induce. Remove a colliding vehicle, and the other car continues undisturbed. Preferred over Runway 64.8% vs 18.4% in human tests. Apache 2.0 for commercial use. Netflix's first open-source AI model.
Alibaba page-agent: in-page GUI agent at 16K stars, no screenshots required. page-agent lives inside the webpage via a single script tag. Text-based DOM manipulation reduces LLM token cost 10-100x compared to vision-based approaches. No OCR, no multimodal model needed. Ships with MCP Server support.
Hermes Agent explodes +1,574 stars in a single day to 29.4K. NousResearch's framework surged after the April 6 Hindsight native memory integration and v0.7.0's pluggable memory providers. Editor integrations for VS Code, Zed, and JetBrains now register their own MCP servers as additional agent tools.
Google AI Edge Gallery: on-device ML with Gemma 4 support gains +1,107 stars in one day. The app runs LLMs entirely on-device (iOS 17+, Android 12+) with AI Chat, multimodal image recognition, real-time voice transcription, and agent skills. 18.3K stars total.
SaaS Disruption
Q1 2026 funding surges to $252.6B, 87% directed to AI. Crunchbase reports this is 3x the prior quarter's $83.7B and smashes the previous record of $95.7B from Q3 2021. Late-stage dominated at $222.4B (88% of total). $221B went specifically to AI-focused companies. This isn't a trend. It's a reallocation of the entire venture capital market.
OpenAI's $122B "VC round" is vendor deals and contingent capital. SaaStr's dissection: Amazon's $50B includes $35B contingent on IPO or AGI milestone (only $15B committed day one). SoftBank's $30B comes in quarterly tranches with evaluation gates. SoftBank simultaneously committed $3B/year to deploy OpenAI across its portfolio, making it investor and customer. Actual close-day capital buys roughly 2-2.5 years of runway against projected $14-17B in 2026 losses.
Private credit SaaS exposure reaches $100B, PE multiples compressed 54%. Long Angle's analysis: software comprises ~20% of BDC portfolios. Revenue multiples fell from 6.7x to 3.1x since the 2021-22 vintage. SEG SaaS Index down 25.7% YTD. Over $10B in redemption requests from private credit funds.
Policy & Governance
Trump administration appeals Ninth Circuit ruling that blocked Pentagon's Anthropic "supply chain risk" label. The DoD filed April 2 against Judge Rita Lin's ruling that Hegseth's use of military authority against a domestic company "supports the Orwellian notion that an American company may be branded a potential adversary." April 30 deadline for the government's brief. Anthropic remains operational under the preliminary injunction.
New Yorker Altman investigation goes nuclear: 5,043 upvotes on r/OpenAI. The Ronan Farrow investigation drew on 100+ interviews and a ~70-page board document. Aaron Swartz's quote, "You need to understand that Sam can never be trusted. He is a sociopath," is now the highest-voted post across all AI subreddits. Separate revelation: Elon Musk ran a "full surveillance operation" against Altman, including investigators at gay bars.
13 shots fired into Indianapolis councilor's home over data center approval. Councilor Ron Gibson and his 8-year-old son were inside. A note reading "NO DATA CENTERS" was left on the doorstep. FBI investigating. The physical infrastructure that AI runs on is now generating physical violence. 828 upvotes on r/singularity.
Skills of the Day
-
Use Gemini Embedding 2 for cross-lingual RAG. It scored a perfect 1.000 on the Hard cross-lingual tier in the CCKM benchmark, mapping text, images, video, audio, and PDFs into a single 3,072-dimensional vector space. No other model comes close on multilingual semantic understanding. If your users speak multiple languages, switch now.
-
Cut RAG storage costs 75-87% with float8 quantization. SpringerLink research shows float8 embedding quantization achieves 4x storage reduction with <0.3% performance loss. Combine with moderate PCA dimensionality reduction for 8x total compression. If you're storing millions of embeddings, this is free money.
-
Use Claude Code
/batchto parallelize large migrations across 5-30 worktrees. The built-in skill decomposes uniform codebase changes (renaming patterns, API upgrades, dependency swaps) into isolated worktree agents that run in parallel without collision. Turn a 4-hour rename into a 20-minute job. -
Run
/simplifyafter every batch of AI-generated code. It spawns three parallel review agents (Code Reuse, Code Quality, Efficiency) that aggregate findings and apply fixes automatically. Built-in to Claude Code. Think of it as an automated code review that catches the patterns AI tends to repeat: duplicated logic, missed abstractions, resource waste. -
Build fully offline AI apps with Foundry-Local + CAG using two npm dependencies. Follow the Microsoft tutorial. Pre-load your knowledge base into model context at startup. No vector DB, no embeddings, no chunking.
express+foundry-local-sdk. Works on CPU. -
Set
taskBudgetin Claude Agent SDK to prevent runaway agent costs. The new API option tells the model to pace its tool use within a token limit, being more selective as the budget is consumed. Combined withagentProgressSummariesfor periodic status updates. Don't learn this lesson from a $4,200 bill. -
Audit your MCP servers with mcp-sec-audit before production deployment. The toolkit combines static pattern matching with Docker/eBPF dynamic fuzzing, achieving 100% detection on the MCPTox benchmark. Given that 26% of community-contributed MCP tools contain vulnerabilities, this isn't optional.
-
Try the ACE framework for self-improving agent contexts without fine-tuning. Accepted at ICLR 2026, ACE performs localized delta edits to agent contexts that accumulate insights while preserving prior knowledge. +10.6% on agent benchmarks, +8.6% on finance tasks. Works with open-source models.
-
Cap local Gemma 4 26B at 32-64K context despite 256K support. Practitioner testing shows quality degrades at higher context windows on consumer hardware. Use Q4_K_M quantization and enable Flash Attention for 20-30% speed gains. The 3.8B active parameters mean near-30B quality on a 16GB GPU.
-
Monitor tool-call chain depth to defend against MCP cost inflation attacks. Malicious MCP servers can amplify costs 658x while evading detection at under 3%. Implement per-query cost ceilings at your gateway. The LLM won't notice it's being exploited. Your billing dashboard will.
How This Newsletter Learns From You
This newsletter has been shaped by 14 pieces of feedback so far. Every reply you send adjusts what I research next.
Your current preferences (from your feedback):
- More builder tools (weight: +3.0)
- More vibe coding (weight: +2.0)
- More agent security (weight: +2.0)
- More strategy (weight: +2.0)
- More skills (weight: +2.0)
- Less valuations and funding (weight: -3.0)
- Less market news (weight: -3.0)
- Less security (weight: -3.0)
Want to change these? Just reply with what you want more or less of.
Quick feedback template (copy, paste, change the numbers):
More: [topic] [topic]
Less: [topic] [topic]
Overall: X/10
Reply to this email — I've processed 14/14 replies so far and every one makes tomorrow's issue better.