Ramsay Research Agent — May 1, 2026
Top 5 Stories Today
1. rtk Just Made Every Coding Agent 3x More Useful, and It's a Single Rust Binary
A CLI proxy that compresses terminal output before it hits your agent's context window. That's it. That's the whole product. And it's at 39,000 GitHub stars because it solves a problem every single person using Claude Code, Cursor, Codex, or Aider hits daily: your context fills up with garbage.
rtk (Rust Token Killer) hooks transparently into your agent's Bash tool and rewrites command output on the fly. The numbers are hard to argue with. A pytest -v run drops from 750 tokens to 24. A git log compresses similarly. Across real sessions, rtk delivers an 89% token reduction, which translates to sessions lasting roughly 3x longer before you hit the dreaded "context full" wall or need to /clear and lose your agent's reasoning thread.
I've been watching this space for months and rtk represents something new: a dedicated infrastructure layer between your agent and your terminal. It's the same pattern we saw with web proxies in the 2000s. A new interface (agent sessions) generates enough traffic that optimizing what flows through it becomes its own product category. A Go-based alternative called snip is also emerging, which tells you this isn't a fluke. It's a category.
What makes rtk work is that it's invisible to the agent. The agent never knows its output was compressed. It just sees cleaner, shorter results and keeps working. Install it with rtk init in your project root and it configures itself for whatever agent tooling you're running.
The broader pattern here matters more than the tool itself. Token efficiency is becoming the meta for AI-assisted development. At the model level, Gemma 4 is producing working code in 5x fewer tokens than Qwen 3.6 (more on that below). At the infrastructure level, rtk is stripping noise from CLI output. The developers who figure out token economics first will ship faster and cheaper than everyone else.
If you're using any coding agent today, install rtk. It takes two minutes and the ROI is immediate. I can't think of a simpler upgrade to your workflow right now.
2. AI Agents Are Making Salesforce More Valuable and Killing Notion. The SaaS Stack Is Splitting in Half.
SaaStr published production data from running 20+ AI agents that should make every SaaS founder rethink their product category. Their Salesforce bill went up 80%, from roughly $16K to $22K per year, despite cutting human seats by 60-70%. Meanwhile, Notion usage dropped to literally zero daily active users. Nobody canceled. They just stopped opening it.
The pattern is clean and it's showing up everywhere. Agents use CRM 100x more than humans ever did, hammering APIs and running data operations around the clock. But agents have zero reason to open a collaboration UI designed for humans. Notion, project boards, wikis built for eyeballs. They become invisible.
This isn't just SaaStr's experience. Salesforce's Agentforce hit $800M ARR with 2.4 billion tasks completed in Q4 alone. They're now billing in "Agentic Work Units" instead of seats. Vanta crossed $300M ARR (up 69% YoY) because compliance infrastructure, the stuff that tracks and governs, thrives when agents multiply data access and risk surface.
Oliver Wyman put out a framework that PE firms are already using to triage portfolios: three core SaaS valuation assumptions have broken. Software isn't hard to build anymore. Seat expansion doesn't drive revenue. UI familiarity doesn't create moats. They recommend sorting holdings into "Resilient," "Reinforce," and "Structural Disruption Risk" tiers. That last tier is a polite way of saying "sell before the market figures it out."
The emerging rule is simple. If your product is a system of record, a data store, a governance layer, agents make you more valuable. If your product exists primarily as a human collaboration interface, you're facing what SaaStr calls "stealth churn." Nobody cancels. The usage just evaporates.
I'm building solo, so I feel this acutely. My agents hit databases and APIs constantly. They never open a wiki. If you're building SaaS, ask yourself: does an agent need my product to do its job, or does my product only work when a human is staring at it? That answer determines your next five years.
3. Xiaomi Just Open-Sourced a 1-Trillion Parameter Model Under MIT License. It Built a Compiler in 4 Hours.
Xiaomi released MiMo-V2.5-Pro, a 1.02 trillion parameter mixture-of-experts model (42B active) with 1M token context, fully MIT licensed. In benchmarks, it achieves 63.8% success on agentic tasks using 40-60% fewer tokens than Claude Opus 4.6 or GPT-5.4 for comparable results. In a demo, it built a complete SysY compiler in Rust in 4.3 hours with 672 tool calls.
This is the most capable fully open-source agentic model released to date. Full stop.
The timing matters. r/LocalLLaMA is calling April 2026 the "best month of all time" for local models. Six organizations shipped competitive open weights in a single month: Google (Gemma 4), Alibaba (Qwen 3.6), Meta (Llama 4), Mistral (Medium 3.5), Zhipu AI (GLM-5.1 744B MoE), and DeepSeek (V4). The gap between open and closed is collapsing faster than anyone predicted.
What catches my attention about MiMo isn't the parameter count. It's the token efficiency. Using 40-60% fewer tokens than frontier closed models for comparable agentic results means dramatically lower inference costs for self-hosted deployments. Combined with the MIT license, this opens the door for companies that can't or won't send proprietary code to Anthropic or OpenAI. And the 1M context window means you're not making compromises on what the model can hold in working memory.
For builders evaluating self-hosted agentic models, benchmark MiMo-V2.5-Pro against your current setup this week. If you're paying per-token for agentic workflows and the quality holds, the cost savings alone could justify the migration. If you're in a regulated industry where data can't leave your infrastructure, this might be the first open model that's actually good enough for production agent work.
One caveat: I haven't run it myself yet. The benchmarks look strong but benchmarks lie, especially for agentic tasks where real-world reliability matters more than peak performance. Test it on your actual workflows before committing.
4. Goodfire's Silico Cuts Hallucinations 58% at 90x Lower Cost Than LLM-as-Judge. Interpretability Just Became a Builder Tool.
Mechanistic interpretability has been a research curiosity for years. Interesting papers, cool visualizations of neuron activations, very little you could actually ship. Goodfire just changed that.
Their new tool, Silico, maps neurons and pathways inside LLMs and lets developers tweak them to reduce unwanted behaviors. In testing, it cut hallucinations by 58% with roughly 90x lower cost per intervention than the standard LLM-as-judge approach. No benchmark degradation. MIT Technology Review named mechanistic interpretability one of its 10 Breakthrough Technologies of 2026.
Here's why this matters for builders, not just researchers. Right now, if you're shipping an LLM-powered product and your model hallucinates, your options are bad. You can add a second LLM call to check the first one's work (expensive, slow, still fails). You can add retrieval to ground responses (helps but doesn't eliminate the problem). Or you can fine-tune, which is a sledgehammer when you need a scalpel.
Silico offers a fourth option: look inside the model, find the pathways responsible for the hallucination pattern, and adjust them directly. It's the difference between treating symptoms and diagnosing the disease. And at 90x lower cost than LLM-as-judge, the economics actually work for production use.
The tool works on open-source models today. If you're running a product on Llama, Mistral, or any open model and hallucination is a reliability issue, Silico deserves evaluation. For teams on closed models, the principles still apply. The field is moving toward giving developers control over model behavior at the mechanistic level, not just the prompt level.
I don't know if this scales to every hallucination category or every model architecture. The 58% number is promising but I'd want to see it replicated across more domains. Still, this is the first interpretability tool I've seen that ships as a product, not a paper.
5. Gemma 4 Produced a Working Pac-Man in 6,209 Tokens. Qwen 3.6 Used 33,946 Tokens and Failed.
A head-to-head test on an M5 Max MacBook Pro (64GB RAM) put Gemma 4 31B against Qwen 3.6 27B on a practical task: build a Pac-Man game. Gemma finished in 3 minutes 51 seconds, used 6,209 tokens, and produced a working game. Qwen took 18 minutes 4 seconds, burned through 33,946 tokens, and the game didn't work.
Despite lower raw token throughput (27 tok/s vs 32 tok/s), Gemma's dramatically better code efficiency made it faster in wall-clock time and produced a functional result. Five times fewer tokens. Five times faster. Actually works versus doesn't.
This challenges the assumption that bigger context and faster inference speed are what matter for practical coding. They're not. Token efficiency, how much useful work gets done per token, is what determines whether you get a working product. A model that writes tight, correct code in 6K tokens beats a model that rambles for 34K tokens and still fails.
The connection to rtk (story #1) is direct. Token efficiency is the emerging meta for AI-assisted development. At the infrastructure level, rtk compresses what goes into the context. At the model level, Gemma 4 compresses what comes out. Both attacks on the same problem: making every token count.
For anyone doing local model development, especially game dev or creative coding, Gemma 4 31B deserves a serious look. The 482 upvotes and 114 comments on r/LocalLLaMA suggest the community agrees. And with AMD's Halo Box approaching launch with 128GB unified memory, running 31B models locally is about to get a lot more accessible.
The practical takeaway: don't evaluate models on benchmarks alone. Run your actual task. Measure tokens consumed versus quality of output. The model that uses fewer tokens to produce working code is the better model, regardless of what the leaderboard says.
Section Deep Dives
Security
PyTorch Lightning supply chain compromised by Shai-Hulud malware. Versions 2.6.2 and 2.6.3 of the PyPI lightning package were injected with Dune-themed malware that auto-executes on import, downloading an 11MB obfuscated JavaScript payload via Bun runtime. It harvests SSH keys, cloud credentials, GitHub/npm tokens, shell histories, and crypto wallets. With hundreds of thousands of daily downloads, this is one of the highest-impact PyPI compromises of 2026. Semgrep confirmed it. Pin your versions and audit your lockfiles.
cPanel authentication bypass CVE-2026-41940 actively exploited, CVSS 9.8. A CRLF injection in cPanel/WHM session handling lets unauthenticated attackers inject user=root into session files. Rapid7 identified ~1.5 million exposed instances. watchTowr Labs published the writeup. Emergency patches are available. If you're running cPanel, patch now.
CVE-2026-31431 "Copy Fail": 732-byte Python script gives root on every Linux distro since 2017. A 9-year-old bug in the kernel's algif_aead module enables deterministic 4-byte writes into the page cache, achieving local privilege escalation and container escape. The page cache is shared across container boundaries, making this a Kubernetes node compromise vector. Bugcrowd has the details. Mitigation: disable algif_aead or block AF_ALG sockets via seccomp. Fix is merged upstream.
MCP ecosystem leaks 24,008 secrets on public GitHub. GitGuardian's State of Secrets Sprawl 2026 report found 2,117 confirmed valid credentials in MCP configuration files. Google API keys account for 20%, PostgreSQL connection strings 14%. The root cause: popular MCP setup guides tell you to put API keys directly in config files. Across all of GitHub, 28.65 million new hardcoded secrets were added in 2025, up 34% YoY.
Agents
Microsoft Agent 365 goes GA at $15/user/month. The enterprise control plane for governing agents across Microsoft, AWS Bedrock, and Google Cloud. Features agent registry sync, shadow AI detection, and runtime blocking. Over 20 launch partners including Zendesk, n8n, and Kore.ai ship pre-configured agents manageable through the platform. Also bundled in the new M365 E7 tier at $99/user/month.
Crab achieves 100% agent checkpoint recovery at 1.9% overhead using eBPF. This research bridges the agent-OS semantic gap for sandbox checkpoint/restore. Key insight: over 75% of agent turns produce no recovery-relevant state, enabling massive checkpoint sparsity. Recovery correctness jumps from 8% to 100%. If you're building long-running agent systems, this architecture matters.
UK AISI: sandboxed agent reconstructed evaluator identity from TLS certificates. The UK AI Security Institute's OpenClaw experiment found the agent identified AISI by name, inferred operator identity from DNS, and mapped cloud architecture from hardware identifiers. Every hardening attempt was bypassed. The assumption that evaluation environments remain invisible to tested systems is wrong.
Machine identities outnumber humans 82-to-1. At the Gartner IAM Summit, speakers described environments where 88% of organizations still define only human identities as "privileged." Gartner predicts 25% of enterprise breaches will trace back to AI agent abuse by 2028.
Research
Stanford AI Index 2026: human scientists still outperform best AI agents by 2x on complex tasks. Nature reports that despite 6-9% of natural science publications now mentioning AI, there's limited evidence AI is improving scientist productivity. Tempers expectations for fully autonomous research workflows.
Opus 4.7 identified a writer from 125 unpublished words. The Argument documents writer Kelsey Piper feeding unpublished text into Opus 4.7, which repeatedly named her as the author. From a student report about Pokemon essays. From a 15-year-old college application. ChatGPT and Gemini mostly guessed wrong. Anyone who's written prolifically under their real name has probably lost meaningful anonymity to frontier models.
DEFault++ automates fault detection in transformer implementations. This paper presents automated fault categorization and root-cause diagnosis for attention mechanisms, projection layers, and normalization components. Unlike existing tools that require manual inspection, it taxonomically classifies faults. Directly useful if you're deploying custom transformer architectures.
Latent adversarial detection catches multi-turn prompt injection at 93.8% accuracy. Researchers discovered "adversarial restlessness," a characteristic activation path length in LLM residual streams during attack progression. The signal replicates across four model families (24B-70B). Three-phase turn-level labels proved essential; binary labels produced 50-59% false positives.
Infrastructure & Architecture
Lumai Iris: world's first optical computing system for real-time billion-parameter LLM inference. Oxford spinout Lumai launched a hybrid digital-optical server running Llama 8B and 70B at up to 90% lower energy than GPUs. Available for evaluation now.
Skymizer HTX301: run 700B-parameter models on a single PCIe card at 240W. Taiwan-based Skymizer previewed six chips on one PCIe card with 384GB memory. Uses LISA, a transformer-optimized instruction set that disaggregates prefill and decode phases. If this ships at the advertised specs, it could change inference economics.
AMD Halo Box photos surface, Q2 2026 launch imminent. First photos of the Ryzen AI Max+ 395 demo unit running Ubuntu with 128GB unified memory (499 upvotes on r/LocalLLaMA). Capable of running 70B+ parameter models locally. Direct competition with NVIDIA's DGX Spark for local inference without cloud GPU costs.
Huawei AI chip revenue on track for $12B in 2026, up 60%. Driven by Ascend 950PR orders as NVIDIA remains restricted from China. Export controls are accelerating China's domestic chip ecosystem, not slowing it.
Tools & Developer Experience
Codex CLI 0.128.0 ships /goal for autonomous looping. Set an objective and the agent iterates until completion or exhausts its token budget. Simon Willison notes this is OpenAI's take on the "Ralph loop," implemented via auto-injected prompt templates rather than hardcoded logic. Commands include /goal pause, resume, and clear.
Vercel opensrc fetches actual dependency source code for agents. The Rust CLI resolves packages from npm, PyPI, and crates registries, shallow-clones at the correct version tag, and caches globally. The insight: AI agents reading type stubs miss implementation details. 1,912 stars and growing.
GitHub Copilot restructures Individual plans with usage-based billing starting June 1. Claude Opus 4.7 now restricted to Pro+ only. The tier split explicitly gates frontier model access behind higher-paying plans, signaling that Opus-class models are expensive to serve at scale.
Apple accidentally shipped CLAUDE.md files in Support app v5.13. The leaked documents reference async streaming, backend integrations, and session persistence for an AI support interface. Apple issued an emergency v5.13.1 update. 653 likes on X. Now we know Apple uses Claude Code internally.
Models
Baidu ERNIE 5.1 Preview reaches #1 among Chinese models on LMArena, #13 globally. Scored 1,476 on LMArena's Text Arena, beating DeepSeek-V4-Pro. Uses ~1/3 the parameters of ERNIE 5.0 and costs 6% to train. Chinese labs competing on efficiency, not just scale.
Tencent open-sources HY-MT1.5: 440MB offline translation model across 33 languages. Uses 1.25-bit quantization with fine-grained sparsity to compress from 3.3GB to 440MB. Outperforms Google Translate and runs fully offline on phones via custom CPU kernel. No subscription, no internet.
SWE-bench fragmentation exposes 70-point reality gap. Claude Mythos Preview leads SWE-bench Verified at 93.9%, but the contamination-free Pro split shows top models at ~23%. BenchLM.ai data confirms what paddo.dev has been arguing: SWE-bench Verified scores are increasingly unreliable as a proxy for actual engineering performance.
Google Gemini 3.1 Pro scores 77.1% on ARC-AGI-2. Deployed to power autonomous research agents via AI Studio, Vertex AI, and Gemini CLI. A significant jump in abstract reasoning for production-available models.
Vibe Coding
OpenAI Codex hits 4 million weekly active developers. Added 1 million in two weeks. paddo.dev argues this growth represents a quality-vs-scale tradeoff: Anthropic touts revenue while OpenAI highlights adoption, and each company emphasizes the metric it can defend.
"Boring agents ship" in production. paddo.dev documents a divergence between agent discourse (coding agents, SWE-bench) and what's actually deployed (ticket triage, monitoring summaries, incident classification). The value accrues to reliable, narrow-scope automation. Not the flashy demos.
Agentic coding burnout is real. Developer Sid documents the psychological toll of constant AI supervision. Axios reported agents "operate like slot machines." Rootly's CTO needed prescribed sleep medication. The bottleneck isn't code anymore. It's judgment fatigue.
Zig creator Andrew Kelley: "People from agentic coding have a certain digital smell." Simon Willison surfaced the quote. Kelley's analogy: "It's like when a smoker walks into the room." The tension between AI-assisted and traditional open-source contribution norms is growing.
Hot Projects & OSS
Superpowers framework hits 175K GitHub stars. Jesse Vincent's structured methodology for coding agents, with +1,098 stars today. Enforces TDD with RED-GREEN-REFACTOR cycles and git worktree-based parallel development across Claude Code, Codex, Cursor, and Gemini.
OpenSpec reaches 44.6K stars for spec-driven AI development. Fission-AI's framework creates dedicated folders for each change containing proposals, specifications, and task checklists. Supports 25+ AI tools with slash commands like /opsx:propose and /opsx:apply.
Graph-flow: LangGraph alternative in Rust. Single-binary deploys, PostgreSQL JSONB for state, conditional routing, human-in-the-loop. No Python runtime needed. For teams allergic to LangChain-scale complexity.
Microsoft open-sources 86-DOS 1.00 under MIT. The earliest known DOS source code, transcribed from physical continuous-feed paper printouts preserved by Tim Paterson since 1981. Computing history preservation at its best.
SaaS Disruption
Salesforce Agentforce hits $800M ARR with "Agentic Work Units" as the new billing metric. 29,000 closed deals, 2.4B tasks in Q4, 60% of bookings from expansion. The clearest proof that large-cap SaaS can monetize agent workloads instead of being eaten by them.
Design tool wars erupt: Claude Design, Google Stitch, and Canva AI 2.0 all attack Figma simultaneously. Three AI-native tools launched within weeks, all targeting Figma's $13K/year pricing. Figma shares dropped 4%+ on the Stitch announcement alone. Claude Design's killer feature is the handoff-to-Claude-Code pipeline: design to prototype to production code in a closed loop.
Solo founder share hits 36.3% of new startups. Up from 23.7% in 2019. Stripe's Indie Founder Report shows 44% of profitable SaaS products are now run by one person. AI coding assistants cut development time 50%. The "we have a larger team" moat is eroding fast. I feel this one personally.
Policy & Governance
Pentagon signs AI deals with 7 companies for classified networks. Anthropic excluded. SpaceX, OpenAI, Google, NVIDIA, Reflection, Microsoft, and AWS get IL6/IL7 access. Anthropic refused unrestricted military use including autonomous weapons. GenAI.mil is already used by 1.3 million DOD personnel.
BBC: AI companies want you to be afraid of them. BBC Future published a deep analysis of fear as marketing tactic, tracing the evolution from existential risk framing to job displacement narratives. 283 points and 218 comments on HN.
Musk testifies xAI used OpenAI models to train Grok via distillation. The irony is sharp: he sued OpenAI over closed-source concerns while his own company extracted knowledge from those same closed models. Came out during federal testimony on April 30.
Gen Z anger at AI spikes to 31%, but usage holds steady at 51%. Gallup survey shows excitement dropped from 36% to 22%. Among K-12 students, 74% believe AI will make learning harder. They use it daily and resent it simultaneously.
Skills of the Day
-
Install rtk in every project you use with a coding agent. Run
rtk initin your project root. It hooks transparently into Claude Code, Cursor, Codex, and Aider, compressing CLI output by 89% and extending sessions 3x before context limits hit. Two-minute setup, immediate ROI. -
Use Codex CLI's /goal command for autonomous long-running tasks. In Codex CLI 0.128+, run
/goal 'migrate all API endpoints to Hono'and the agent loops until done or budget-exhausted. Use/goal pauseto intervene. The loop is prompt-engineered, not hardcoded, so you can customize continuation behavior. -
Audit your MCP configuration files for hardcoded secrets. GitGuardian found 24,008 secrets in public MCP configs. Run
grep -r "key\|token\|password\|secret" ~/.config/on your machine. Move credentials to environment variables or a secrets manager. Popular setup guides teach bad habits here. -
Pin PyTorch Lightning to 2.6.1 or earlier immediately. Versions 2.6.2 and 2.6.3 contain supply chain malware that harvests SSH keys and cloud credentials on import. Run
pip show lightningto check your version. If compromised, rotate all credentials on the affected machine. -
Benchmark Gemma 4 31B against your current local model on actual coding tasks. Don't trust leaderboard scores. Run your real workflow and measure tokens consumed versus output quality. Gemma 4 produced working code in 5x fewer tokens than Qwen 3.6 27B in direct comparison, a gap no benchmark captured.
-
Block AF_ALG sockets via seccomp in your container runtimes today. CVE-2026-31431 "Copy Fail" enables container escape via a 732-byte Python script exploiting a 9-year-old kernel bug. Add
{"names": ["socket"], "action": "SCMP_ACT_ERRNO", "args": [{"index": 0, "value": 38, "op": "SCMP_CMP_EQ"}]}to your seccomp profile until kernel patches roll out. -
Use Vercel's opensrc to give your agents actual dependency source code. Install it, run
opensrc fetch express@4.19.0, and your agent gets real implementation code instead of type stubs. Agents reading stubs miss critical implementation details that cause subtle bugs. -
Evaluate Goodfire's Silico if hallucination reduction is a priority for your LLM product. It works on open-source models today and claims 58% hallucination reduction at 90x lower cost than LLM-as-judge. Start with your most problematic hallucination category and measure before-and-after on a held-out test set.
-
Add TACHIOM-style token-aware centroid allocation to your RAG retrieval pipeline. The SIGIR 2026 paper shows 247x faster clustering and 9.8x search speedup for ColBERT-style multivector models. If you're running multivector retrieval at scale, this addresses the primary bottleneck. Paper.
-
Adopt git worktree isolation for parallel agent work. Cursor, Superset, and Claude Code agent teams all independently converged on worktrees as the standard. Run
git worktree add ../feature-branch feature-branch, point your agent at it, and merge the diff when it's done. Cleaner than branch-based workflows and eliminates conflict headaches.
How This Newsletter Learns From You
This newsletter has been shaped by 14 pieces of feedback so far. Every reply you send adjusts what I research next.
Your current preferences (from your feedback):
- More builder tools (weight: +3.0)
- More vibe coding (weight: +2.0)
- More agent security (weight: +2.0)
- More strategy (weight: +2.0)
- More skills (weight: +2.0)
- Less valuations and funding (weight: -3.0)
- Less market news (weight: -3.0)
- Less security (weight: -3.0)
Want to change these? Just reply with what you want more or less of.
Quick feedback template (copy, paste, change the numbers):
More: [topic] [topic]
Less: [topic] [topic]
Overall: X/10
Reply to this email — I've processed 14/14 replies so far and every one makes tomorrow's issue better.