synesis

Silver Bars and Language Models: Reading Babel in the Age of LLMs

synesis — Fri, 17 Apr 2026 00:00:00 GMT

Cover of R.F. Kuang’s Babel: Or the Necessity of Violence (Harper Voyager, 2022).

R.F. Kuang’s Babel was published in August 2022 [1], three months before OpenAI released ChatGPT [2]. The novel is set in an alternate 1830s Oxford and concerns the British Empire’s monopoly on a magical system of translation: silver bars that harvest the gap between languages and convert it into industrial power. Kuang was not writing about AI. But she finished describing a world in which linguistic knowledge is industrialized and centralized just as our world began building exactly that, and in retrospect the structural parallels are precise enough that the novel reads as one of the clearest available descriptions of what large language models actually do.

The Asymmetry the Silver Mines

In Kuang’s novel, silver bars harvest the gap between languages. A word in Mandarin carries connotations its English counterpart does not, and that untranslatable remainder is what powers the silver. The system converts linguistic asymmetry into material force: bridges that do not collapse, ships that sail faster, an empire that holds.

LLMs run on a structurally similar logic, except the asymmetry they mine is between what any one person knows and what the system has absorbed from millions. A radiologist has deep expertise in chest imaging but limited knowledge of rare genetic disorders. A junior developer can write a for-loop but struggles to architect a distributed system. LLMs sit in these gaps, having ingested the radiologist’s textbooks and the senior engineer’s Stack Overflow answers along with most of what is in between.

The extraction is not limited to natural language. When a physicist uses an LLM to debug a simulation, or a philosopher uses one to stress-test an argument, the model is operating inside formalized reasoning: mathematical proof, algorithmic design, diagnostic logic. These are the structured systems humans built over centuries to close gaps in their own understanding, and they are now being absorbed into the model alongside the words. Language is the most visible layer of what LLMs ingest. The deeper layer is the full spectrum of how people think and solve problems.

The Hidden Labor

In Babel, the silver-work appears as refined scholarship: brilliant students in a beautiful tower translating ancient texts. The system depends on the colonized world — languages taken from peoples who will not share in the benefits equally, knowledge extracted along trade routes that are really routes of control.

LLMs have their own version. In 2023, TIME reported that OpenAI had paid Kenyan workers less than two dollars an hour to label traumatic content so ChatGPT could learn to be polite [3]. The conversational interface most users experience as helpful, even charming, was refined through labor most of those users will never see. Behind that labor sit the writers whose work became training data without negotiation, and the domain experts whose hard-won knowledge now surfaces in model outputs without attribution.

A Different Kind of Dependence

A common response to worries about LLM dependence is that humans have always depended on powerful tools — electricity, antibiotics, the internal combustion engine. The relevant difference is what kind of capacity is being delegated. LLMs intervene in the specific work humans have most prided themselves on: reasoning, synthesis, and creation.

Software engineering offers an early read. Industry surveys by 2025 showed that most professional developers had adopted AI coding assistants in their daily workflow [4]. Whether this widespread use is producing dependence is harder to measure, though the early signals point that way. Anthropic’s own study of AI-assisted coding found that delegating to the model correlated with worse comprehension on quiz questions, particularly on debugging and conceptual understanding, while using the model to ask conceptual questions and explain errors went the other way [5]. The pattern is that more delegative use thins the underlying skill.

This is qualitatively different from depending on a calculator for arithmetic. A calculator handles a mechanical task so the user can focus on higher reasoning. An LLM that helps architect a system or draft an analysis is operating in the territory the user would otherwise occupy. The more fluent it gets, the harder it becomes to tell the assisting tool from the substituting one. In Babel, the students gradually lose the ability to imagine scholarship outside the silver-work; they can still think, but the infrastructure of their thinking has been absorbed into a system they do not control.

Benefit Does Not Cancel Asymmetry

One of Babel’s sharper insights is that the silver genuinely works. It heals the sick, strengthens buildings, and reaches even the colonized populations, who receive real advantages, distributed unevenly and always on terms set by the empire. That is what makes the students’ position agonizing: they are not dupes, they see the exploitation, and they also see the good the system produces and have built their lives inside it.

The same structure holds for LLMs. A first-generation college student who cannot afford a tutor uses one to learn organic chemistry at midnight; a small business owner who cannot afford a lawyer drafts a workable contract. Dismissing these benefits would be dishonest.

The benefits also flow through a system with a particular shape. A handful of companies control the most capable models. The people who extract the most value tend to be those who already have education, technical literacy, and institutional access. The dynamic is class-based rather than colonial, but the structure rhymes: those with existing advantages compound them, and those without gain just enough to deepen their dependence. In Babel, the empire does not need to withhold silver entirely. It only needs to control who gets how much, and on what terms.

Mediation Without Visible Seams

A central theme in Babel is that translation always involves loss, selection, and power. The translator chooses which nuance to preserve and which to sacrifice, and those choices serve the institution commissioning the translation. The silver does not transmit meaning; it transforms it, and the transformation is never neutral.

LLMs mediate knowledge along similar lines. They compress, reframe, and smooth over ambiguity. They privilege patterns that were frequent in training data and underweight those that were not. When a medical student asks for a differential diagnosis and receives a confident, well-structured answer, the contested choices behind that answer — which studies were overrepresented, which demographic biases the medical literature carried forward — are not visible in the output. The answer reads as authoritative because the system shows no seams.

This is a different sort of mediation from what universities, encyclopedias, or search engines perform. Those institutions also shape knowledge, but through visible editorial processes, peer review, and named accountability. Encyclopedias have editorial boards; journal articles have authors and methods sections; an LLM output has a prompt and a response, with the entire process of knowledge construction collapsed inside the model. The mediation is deeper and less legible.

Where the Analogy Breaks

LLMs are not silver bars, and the companies building them are not the British Empire. The comparison is strongest at the level of political economy: both systems convert distributed human knowledge into centralized capability, both produce real benefits alongside deep asymmetries, and both make complicity feel rational.

The differences also matter. The British Empire held its silver monopoly through military force, and no one was free to leave. LLM companies operate inside markets where, in principle, competition exists, alternatives can be built, and users can switch away. The coercion is softer — convenience, integration, the cost of building a rival system — and that softness changes both the moral weight of participation and what resistance can plausibly look like.

Intent is another difference. The people building LLMs would largely reject the Babel framing; they would say they are democratizing access to knowledge, and they would not be wrong. The same system that concentrates power also puts capability into the hands of people who previously had less of it. In Babel, the institution also believed it was doing civilizational good — advancing scholarship, building infrastructure, improving lives. Kuang’s point is that sincerity and extraction can coexist comfortably, and often do.

Pressure Points

The analogy is not a prophecy. Babel ends in revolt and destruction; the present situation is more open-ended. The students in the novel had almost no leverage, since the silver system was controlled entirely by one institution and the only option they could imagine was to break it. The current LLM infrastructure is concentrated but not sealed, and there are at least three pressure points that, taken seriously, could bend the trajectory away from the pattern Kuang describes.

Transparency and Distributed Ownership

The first is transparency and distributed ownership. A reasonable objection is that these sound naive in a capitalist economy: why would companies that spent billions on proprietary models voluntarily open them up? They would not, entirely. But transparency and distributed ownership are not anti-market positions, and they have coexisted with capitalism before. Pharmaceutical companies operate in a fiercely competitive market while being required to disclose clinical trial data and submit to regulatory review. The drugs remain proprietary, the companies remain profitable, and the knowledge needed to evaluate safety is not locked inside a black box. Requiring that LLMs surface their sources, flag uncertainty, and make training data composition auditable would not mean giving away model weights. It would mean establishing a floor of legibility.

The ownership side is not about outcompeting every commercial lab. It is about building a public layer alongside the commercial one, the way public libraries did not dismantle the publishing industry but kept access to knowledge from depending entirely on ability to pay. Public investment in open models and shared compute infrastructure — efforts like the EU’s nascent sovereign AI programs [6] — could keep hospitals and school districts from being entirely at the mercy of one company’s pricing and priorities. Transparency without distributed ownership gives the public the right to see inside a system it cannot influence. Distributed ownership without transparency just produces another black box. The two together start to resemble a mixed economy of intelligence.

Education

The second is education. If the deepest risk is cognitive dependence, the most direct response is to teach people to think with these tools without being absorbed by them. Curricula should treat LLMs the way good math programs treat calculators: useful, permitted, but not a substitute for understanding the underlying reasoning.

Shared Language

The third is shared language. One reason Babel resonates is that it names a dynamic many people feel but cannot quite articulate: the sense that something valuable is being quietly reorganized, and that the benefits are real but the terms are not theirs to set. A framework for that feeling is itself a form of power, since a system that can be described is harder to be captured by.

The asymmetry between what individuals know and what these systems have absorbed will keep being mined; I do not think that can be stopped. What is still open is whether the extraction is governed, shared more broadly, and made more accountable than it is now. Babel is most useful here as a description sharp enough to make the trap visible while there is still room to step around parts of it.

References

[1] Kuang, R.F. Babel: Or the Necessity of Violence: An Arcane History of the Oxford Translators’ Revolution. Harper Voyager, August 2022.

[2] OpenAI. “Introducing ChatGPT.” OpenAI Blog, November 30, 2022. https://openai.com/blog/chatgpt

[3] Perrigo, Billy. “OpenAI Used Kenyan Workers on Less Than $2 Per Hour to Make ChatGPT Less Toxic.” TIME, January 18, 2023. https://time.com/6247678/openai-chatgpt-kenya-workers/

[4] Stack Overflow. “2025 Developer Survey.” Stack Overflow, 2025. https://survey.stackoverflow.co/2025/

[5] Anthropic. “AI Assistance and Coding Skills.” Anthropic Research. https://www.anthropic.com/research/AI-assistance-coding-skills

[6] European Commission. “AI Factories: European Initiative for Sovereign AI Infrastructure.” Digital Strategy, 2024. https://digital-strategy.ec.europa.eu/en/policies/ai-factories

Chain-of-Thought, by Way of 4chan

synesis — Tue, 14 Apr 2026 00:00:00 GMT

Cover image from the Atlantic article.

I didn’t know chain-of-thought could have roots in 4chan, where the lowest and highest impulses of humanity clashed to give us the jewel of “reasoning models!” [1] It also mentions Apple’s paper “The Illusion of Thinking.”

References

[1] “The Strange Origin of AI’s ‘Reasoning’ Abilities.” The Atlantic. https://apple.news/AQUmk0Ha9SsiSaF5FeWXEUw

Originally posted on LinkedIn.

APIs, Tools, Skills, and Where MCP Fits

synesis — Tue, 14 Apr 2026 00:00:00 GMT

APIs, tools, and skills form a rough abstraction ladder. Each one builds on the layer below it, and confusing them leads to systems that are either too rigid or too loosely defined. MCP is a different kind of thing: a protocol that crosscuts the ladder. It deserves its own discussion.

graph LR
    subgraph MCP["MCP — standardizes exposure to models"]
        API["API
raw interface"]
        TOOL["TOOL
callable action"]
        API -->|"wrapped by"| TOOL
    end

    TOOL -->|"used by"| SKILL
    API -->|"used by"| SKILL["SKILL
judgment, sequencing"]

    style API fill:d98c4a,stroke:a66a2e,color:fff
    style TOOL fill:#5ba55b,stroke:#3d7a3d,color:fff
    style SKILL fill:#4a90d9,stroke:#2c5f8a,color:fff
    style MCP fill:f3ecf9,stroke:#8b5fbf,color:#4a2a7a

API: The Raw Interface

An API is a defined interface that lets one software system interact with another. It exposes endpoints, methods, inputs, and outputs. It does not tell an agent when to call it, what to do with the result, or how the call fits into a broader task.

In a customer support agent, the API is the HTTP interface to the ticketing system: create ticket, update status, fetch history. The API does not know why you are calling it.

Tool: The Callable Action

A tool is a capability packaged so an agent can invoke it directly. It has a schema the model can read, a name that signals intent, and a defined effect. In many systems a tool is backed by an API, but the tool adds a layer: it tells the model what the action does, what parameters it needs, and what shape the result will take.

In that same support agent, “search_tickets” is a tool. It wraps the ticketing API’s search endpoint, describes its parameters in a model-readable schema, and returns structured results. The agent can decide to call it. The API underneath does not care who is calling or why.

The key difference: a developer writes code against an API. An agent selects and invokes a tool. The tool is the API made legible to a model.

Skill: The Playbook

A skill is a reusable method for solving a class of problems. It involves judgment, sequencing, and often multiple tools. If a tool says “send an email,” a skill says “triage the inbox, identify messages that need a response, draft replies in the right tone, and flag anything that needs human review.”

For the support agent, a skill might be “handle refund request”: check the order status, verify the refund policy applies, draft a response to the customer, create an internal ticket if escalation is needed, and log the outcome. That is not one action. It is a procedure with branching logic and defaults.

Skills are where know-how lives. A tool gives the agent something to do. A skill tells it how to do a job well.

Where MCP Fits

MCP (Model Context Protocol) is not another rung on this ladder. It is a standard for exposing tools, resources, and context to models in a consistent way. Its job is interoperability: making capabilities discoverable and portable across environments.

Without MCP, every agent framework wires up tools differently. The schema formats vary, discovery is ad hoc, and switching between hosts means rewriting integrations. MCP standardizes that surface so a tool written once can be used by any compliant client.

MCP can also expose resources (files, database rows, live context) alongside tools, which means a model can see both what it can do and what it is working with. This is useful but distinct from the API/tool/skill question. MCP does not replace any of those layers. It makes them portable.

Think of it this way: APIs, tools, and skills describe what capabilities exist and at what level of abstraction. MCP describes how those capabilities are presented to models.

A Concrete Example

Consider a code assistant that can read files, run tests, and create pull requests.

graph TB
    SKILL["SKILL: Review and ship a bug fix"]

    SKILL --> S1["1. Read failing test
judgment: which file?"]
    SKILL --> S2["2. Identify + fix broken code
judgment: what change?"]
    SKILL --> S3["3. Run test suite"]
    SKILL --> S4["4. Open PR
judgment: description,
reviewers, linked issue"]

    S1 --> T1["TOOL
read_file"]
    S3 --> T2["TOOL
run_tests"]
    S4 --> T3["TOOL
create_pull_request"]

    T1 --> A1["API
Filesystem"]
    T2 --> A2["API
Test runner CLI"]
    T3 --> A3["API
GitHub REST
POST /pulls"]

    style SKILL fill:#4a90d9,stroke:#2c5f8a,color:fff
    style S1 fill:e8f0fe,stroke:#4a90d9,color:#1a1a1a
    style S2 fill:e8f0fe,stroke:#4a90d9,color:#1a1a1a
    style S3 fill:e8f0fe,stroke:#4a90d9,color:#1a1a1a
    style S4 fill:e8f0fe,stroke:#4a90d9,color:#1a1a1a
    style T1 fill:#5ba55b,stroke:#3d7a3d,color:fff
    style T2 fill:#5ba55b,stroke:#3d7a3d,color:fff
    style T3 fill:#5ba55b,stroke:#3d7a3d,color:fff
    style A1 fill:d98c4a,stroke:a66a2e,color:fff
    style A2 fill:d98c4a,stroke:a66a2e,color:fff
    style A3 fill:d98c4a,stroke:a66a2e,color:fff

The APIs are the filesystem interface, the test runner’s CLI, and the GitHub REST API.
The tools are “read_file,” “run_tests,” and “create_pull_request,” each with a schema the model can parse.
An MCP server bundles those tools (and possibly resources like the repo’s directory tree or recent CI results) into a package any compliant model host can connect to.
A skill is “review and ship a bug fix”: read the failing test, identify the broken code, make the fix, run the test suite, and open a PR with a clear description. The skill uses all three tools and applies judgment at each step.

The Boundaries Are Not Fixed

Yes, a skill can become a tool. This happens routinely. The “handle refund request” skill from the earlier example starts as a multi-step procedure with judgment at each point. But once the team has run it enough times, the branching logic stabilizes, the edge cases get codified, and someone wraps the whole thing in a single callable function: “process_refund.” Now it looks like a tool. The agent calls it in one shot. The orchestration still happens, just inside the wrapper rather than in the agent’s reasoning loop.

This is the natural lifecycle. Skills get hardened into tools. Tools get commoditized into APIs. What was once a careful, multi-step procedure becomes a reliable black box. The abstraction ladder is not a static taxonomy. It is a description of where the judgment currently lives, and that shifts as systems mature.

graph LR
    SKILL["SKILL
handle refund request
judgment in the agent"]
    TOOL["TOOL
process_refund
judgment in the wrapper"]
    API["API
POST /refunds
judgment removed"]

    SKILL -->|hardens into| TOOL
    TOOL -->|commoditizes into| API
    API -.->|edge cases resurface| TOOL
    TOOL -.->|context demands judgment| SKILL

    style SKILL fill:#4a90d9,stroke:#2c5f8a,color:fff
    style TOOL fill:#5ba55b,stroke:#3d7a3d,color:fff
    style API fill:d98c4a,stroke:a66a2e,color:fff

The movement goes the other direction too. When a tool starts failing in new contexts, or when its behavior needs to vary based on surrounding conditions, it may need to be unpacked back into a skill. A “create_pull_request” tool works fine when the agent has already done everything right. But if the PR needs a certain description format, specific reviewers based on the files changed, and a linked issue, that single tool call becomes a skill again: a sequence of checks and decisions that the agent has to reason through.

The practical implication: do not over-invest in classifying a capability as one or the other permanently. Instead, pay attention to where the judgment lives right now. If the agent is making multi-step decisions around a capability, that is a skill, even if it is technically implemented as a single function. If a capability can be called reliably without the agent thinking about it, that is a tool, even if it was once a complex workflow.

When You Are Building an Agent

If you are deciding what to build and in what order:

graph LR
    A["① API
raw access
always needed"]
    T["② Tool
callable action
always needed"]
    M["③ MCP
portable exposure
if multi-host"]
    S["④ Skill
orchestrated workflow
when patterns emerge"]

    A -->|wrap| T
    T -->|expose| M
    M -->|orchestrate| S

    style A fill:d98c4a,stroke:a66a2e,color:fff
    style T fill:#5ba55b,stroke:#3d7a3d,color:fff
    style M fill:#8b5fbf,stroke:#6b3fa0,color:fff
    style S fill:#4a90d9,stroke:#2c5f8a,color:fff

Start with the APIs. You need raw access to the systems your agent will interact with. If the integration does not exist, nothing above it works.
Wrap them as tools. Give each action a clear name, a schema, and a description. This is what the model will actually call.
If you need portability, add MCP. If your tools will be used across multiple model hosts or clients, exposing them through MCP saves you from writing bespoke integrations for each one. If you are building for a single environment, this can wait.
Encode skills last. Once your agent can take individual actions reliably, identify the recurring workflows where sequencing and judgment matter. Package those as skills. Premature skills, built before the underlying tools are solid, tend to be brittle.

The layers are complements: APIs provide access, tools provide action, MCP provides portability, and skills provide method. Most agent systems need all four, but they should be built in roughly that order.

Originally posted on LinkedIn.

“Verified” Doesn’t Mean “Nothing Can Go Wrong”

synesis — Mon, 13 Apr 2026 00:00:00 GMT

Illustration from Kiran’s post.

Formal verification of software is powerful, but this is a good reminder that “verified” never means “nothing can go wrong.” In this case, a Claude agent using AFL++ and other testing tools found a buffer overflow in the Lean runtime itself even though the application logic is correctly verified with Lean [1].

References

[1] Kiran. “Log: Who Watches the Watchers.” kirancodes.me. https://kirancodes.me/posts/log-who-watches-the-watchers.html

Originally posted on LinkedIn.

CoachGPT, in Practice

synesis — Sun, 12 Apr 2026 00:00:00 GMT

Marathon runners. Photo: Wikimedia Commons.

Can CoachGPT be effective as your running coach? I’ve gone through something very similar to what this Bloomberg article [1] describes: consistently interacting with my “coach” by stating my goals, sending screenshot after screenshot of my recorded running stats, and strategizing with it on rezoning my HR, training routines, and race plans.

In the end, these concluding statements from the model itself, when the author asked how far he’d come, stand out most:

“A long way,” it responded. “Not because the model ‘coached’ you. And not because the technology was flawless. It wasn’t. You had to correct it, constrain it and teach it how to be useful. What feels significant here is that this stopped being a chatbot giving generic advice and became a system you were actively shaping toward a real-world outcome. Over time, you built something closer to a working instrument: goals, rules, memory, exceptions, cadence, feedback loops, failure modes, even governance language.”

It then widened the lens: “Far enough that an AI can participate meaningfully in a serious human project over time. Not far enough that it can be trusted without structure,” it added. “And maybe that’s the most important answer: The future is probably not AI replacing people. It’s people who learn how to build disciplined working relationships with AI outperforming those who don’t.”

References

[1] “I Trained for the Paris Marathon Using ChatGPT.” Bloomberg, April 10, 2026. https://www.bloomberg.com/news/features/2026-04-10/training-for-a-marathon-with-an-ai-coach-what-worked-and-what-didn-t

Originally posted on LinkedIn.

Why LLMs Still Stumble Over Time

synesis — Sat, 11 Apr 2026 00:00:00 GMT

Illustration from the EvolveBench paper [[10]](#ref-10).

Are you surprised that today’s LLMs still make so many time-related mistakes? They might mismatch a date and weekday [1], misinterpret relative expressions like “in 2 hours” or “next Monday at 1 PM” when the current time or timezone is unclear [2], or, in more complex cases, build a schedule whose event order is simply impossible [3]. The core issue is that temporal reasoning is not one skill but a bundle of them. Recent benchmarks break it into subskills such as event ordering, arithmetic, duration, and frequency, and show both large variation across categories and a substantial gap to human performance [4][5]. These failures also reflect how temporal reasoning depends on more than calendar math. Models often need to determine when two mentions refer to the same event and to reason jointly about temporal, causal, and subevent relations [6]. They also rely on commonsense script knowledge, since text frequently leaves the typical order or duration of everyday events unstated [7][8]. Even when stale knowledge is not the problem, temporal reasoning remains brittle. In controlled evaluations, performance changes substantially with problem structure, question type, and fact order [9]. That brittleness helps explain why models struggle when events are introduced out of chronological order and the timeline has to be reconstructed rather than read off the surface form [9]. The problem becomes harder still when knowledge itself changes over time. A model may retrieve the right fact yet fail to align it with the period in which it was true. EvolveBench evaluates exactly this kind of temporal awareness through historical cognition, temporally misaligned context, and invalid timestamped queries [10]. Taken together, these results suggest that temporal failures arise from the interaction of heterogeneous subskills, event-level dependencies, temporal commonsense, sensitivity to ordering, and weak grounding of facts to the correct time.

References

[1] Sweenor, David. “Don’t Trust Your AI With Dates: The Calendar Blindspot.” Medium, 2025. https://medium.com/@davidsweenor/dont-trust-your-ai-with-dates-the-calendar-blindspot-882c7223eca0

[2] Pankretić, Filip. “The invisible problem. How we solved scheduling with AI.” Infobip Developers Blog, 2026. https://www.infobip.com/developers/blog/the-invisible-problem-how-we-solved-scheduling-with-ai

[3] Bansal, Tanmay. “Your Favourite LLMs Still Can’t Differentiate Between Time Like Humans.” GoPubby / Medium, 2026. https://ai.gopubby.com/why-llms-still-cant-perceive-time-like-humans-602adb0e9f20

[4] Wang, Yuqing, and Yun Zhao. “TRAM: Benchmarking Temporal Reasoning for Large Language Models.” Findings of ACL 2024. https://aclanthology.org/2024.findings-acl.382/

[5] Chu, Zheng, et al. “TIMEBENCH: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models.” ACL 2024. https://aclanthology.org/2024.acl-long.66/

[6] Wang, Xiaozhi, et al. “MAVEN-ERE: A Unified Large-scale Dataset for Event Coreference, Temporal, Causal, and Subevent Relation Extraction.” EMNLP 2022. https://aclanthology.org/2022.emnlp-main.60/

[7] Regneri, Mirella, Alexander Koller, and Manfred Pinkal. “Learning Script Knowledge with Web Experiments.” ACL 2010. https://aclanthology.org/P10-1100/

[8] Wenzel, Gregor, and Adam Jatowt. “An Overview of Temporal Commonsense Reasoning and Acquisition.” arXiv, 2023. https://arxiv.org/abs/2308.00002

[9] Fatemi, Bahare, et al. “Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning.” arXiv, 2024. https://arxiv.org/abs/2406.09170

[10] Zhu, Zhiyuan, et al. “EvolveBench: A Comprehensive Benchmark for Assessing Temporal Awareness in LLMs on Evolving Knowledge.” ACL 2025. https://aclanthology.org/2025.acl-long.788/

Originally posted on LinkedIn.

iPhone, Artemis II, Moon

synesis — Mon, 06 Apr 2026 00:00:00 GMT

Artemis II Commander Reid Wiseman peering out of one of the Orion spacecraft’s main cabin windows, looking back at Earth as the crew travels toward the Moon. Photo taken April 4, 2026. Credit: NASA.

Astronauts onboard Artemis II took pictures with iPhone 17 Pro Max [1].

References

[1] NASA. “Home, Seen from Orion.” Flickr (NASA2Explore), April 4, 2026. https://www.flickr.com/photos/nasa2explore/55187189317/

Originally posted on LinkedIn.

Filesystems vs. RAG

synesis — Sat, 04 Apr 2026 00:00:00 GMT

ChromaFs architecture diagram from the Mintlify post.

Can we replace RAG with filesystems? This article shares a neat engineering write-up on doing exactly that: replacing a standard docs RAG flow with a virtual filesystem for an assistant [1]. Instead of only retrieving top-K chunks, they let the model interact with documentation through familiar file operations like ls, cat, find, and grep. Under the hood, they built a virtual layer over their existing docs index, avoiding the latency and cost of spinning up full sandboxes while giving the assistant a more natural way to explore the corpus. Optimization details aside, the article hints at two bigger ideas. First, the filesystem vs. RAG distinction is really about structured navigation vs. relevance retrieval. A filesystem presents the corpus as a pre-built navigable structure for exploration. Plain RAG presents it as a retrieval space. Second, system interfaces work better when they align with the interaction patterns the base LLM already knows. For coding-oriented models especially, files, paths, directories, and shell-style operations are much closer to the environments they were trained on than a one-shot retrieve-and-insert workflow. The article explicitly framed the goal as letting the assistant explore docs the way a developer explores a codebase. A few contrasts between the filesystem and RAG approaches:

Filesystems preserve and expose author-created structure that chunk RAG often weakens.
Filesystem agents are natively iterative and exploratory; plain RAG is often, but not always, more one-shot.
Filesystems provide a richer tool surface than retrieval alone.
Filesystems are better for exact search, full-page reading, and multi-page investigation.
Filesystems support hypothesis-driven exploration across documents: inspect one page, discover a term, search for it elsewhere, and refine understanding.
Filesystems preserve document integrity better by making whole pages or files accessible, rather than only isolated retrieved chunks.
Filesystem-based interfaces can align access control more naturally with paths, files, and directories, so the assistant’s visible world is scoped up front.

But RAG still shines when the main problem is semantic recall: messy corpora, weak or ambiguous structure, heterogeneous sources or modalities, or cases where a few relevant passages are enough and full exploration would be overkill. The interesting part is that these approaches are not incompatible. RAG can be one tool inside a filesystem-style agent: useful for candidate recall, while the filesystem layer handles navigation, inspection, and verification. The implementation described effectively does this by using existing indexing infrastructure to narrow search before exact filtering. Takeaway: RAG retrieves likely evidence. A filesystem lets the model investigate the source material. We can combine both.

References

[1] Mintlify. “How we built a virtual filesystem for our Assistant.” Mintlify Blog, March 24, 2026. https://www.mintlify.com/blog/how-we-built-a-virtual-filesystem-for-our-assistant

Originally posted on LinkedIn.

The Revenge of the Data Scientist

synesis — Thu, 02 Apr 2026 00:00:00 GMT

Illustration from Hamel Husain’s post.

Agentic systems need more than prompts and model calls. They need a surrounding machinery that helps the system observe itself and improve, including logs, metrics, traces, tests, and specifications.

This article by Hamel Husain [1] gives a good rundown of the key considerations:

Generic quality scores like helpfulness or hallucination are often not very useful because they do not explain what actually failed. Define narrow metrics tied to specific failure modes.
Inspection has to be easy and frequent so hands-on review can happen regularly and reveal patterns, not just numbers. This only works when tools are built to support that kind of inspection.
LLM judges are part of the system and must be validated. They should be treated like classifiers: checked against human labels, tuned carefully, and evaluated with metrics like precision and recall rather than trusted at face value.
Good experiments start from real behavior, not abstract test generation. Start from real logs or traces, then create synthetic examples grounded in those actual patterns and edge cases.
Labeling is part of product thinking, not just annotation. Keeping domain experts close to labeling helps teams refine what they truly care about, because criteria often become clearer only after reviewing real outputs.
Too much automation can hide the signal needed for improvement. LLMs can help with boilerplate and plumbing, but they cannot replace the human work of looking directly at failures and deciding what matters.

References

[1] Husain, Hamel. “The Revenge of the Data Scientist.” Hamel’s Blog. https://hamel.dev/blog/posts/revenge/

Originally posted on LinkedIn.

Artemis II Launches on Apple’s 50th Anniversary

synesis — Thu, 02 Apr 2026 00:00:00 GMT

Official Artemis II crew portrait. Photo: NASA, via Wikimedia Commons.

Do you know what else happened on the same day Apple celebrated its 50th anniversary? Artemis II launched, marking humanity’s first crewed journey around the Moon in more than 50 years, the first since Apollo 17 in 1972 [1].

References

[1] “Highlights From the Launch of NASA’s Artemis II Moon Mission.” The New York Times, April 1, 2026. https://www.nytimes.com/live/2026/04/01/science/moon-nasa-artemis-launch

Originally posted on LinkedIn.

Learning to Reason in 13 Parameters

synesis — Tue, 31 Mar 2026 00:00:00 GMT

Figure 1 from the TinyLoRA paper.

TinyLoRA pushes low-rank adaptation down to almost nothing [1].

They report that an 8B-parameter Qwen2.5 model reaches 91% on GSM8K with only 13 trained bf16 parameters, which they note is just 26 bytes of learned weights.
For RL-based post-training, the effective update needed to unlock better reasoning may live in a very low-dimensional subspace.
This works well with RL, but not nearly as well with SFT, where they say SFT needs 100–1000x larger updates to match the same gains.

References

[1] “Learning to Reason in 13 Parameters.” arXiv. https://arxiv.org/abs/2602.04118

Originally posted on LinkedIn.

How Apple Became Apple

synesis — Mon, 30 Mar 2026 00:00:00 GMT

Apple I, on display. Photo: Wikimedia Commons.

“Today, as Apple turns 50, its presence in our lives is so pervasive—2.5 billion of the company’s devices are in active use—that its unlikely origin story is more resonant than ever. To tell it, I turned to the people who lived it…” [1]

References

[1] “How Apple became Apple: The definitive oral history of the company’s earliest days.” Fast Company. https://www.fastcompany.com/91514404/apple-founding-50th-anniversary-apple-1-apple-ii-jobs-wozniak

Originally posted on LinkedIn.

Redmond City Marathon: #30, with a Sprained Ankle

synesis — Sun, 29 Mar 2026 00:00:00 GMT

Today I ran my 30th marathon (since 2024) at the Redmond City Marathon race, finishing in 3:38:33 (watch time) with pace 8’20”/mi. It also extended an unusual streak for me: 7 straight weeks of running HM or longer!

This was my first marathon raced by HR, and that part mostly went to plan: I got avg/max HR of 149/159 bpm. The bigger challenge in the second half was actually not keeping my HR down, but getting it up. My legs were prematurely tired mainly due to a very silly mistake happened early in the race: I stored gels in the shallow part of my vest pockets, and they went flying at ~2.6mi mark! I stepped onto the unpaved edge to retrieve them, and proceeded to spraining my left ankle!

Running on it didn’t feel too bad and I thought I had cheated disaster. It started to rear its ugly head around the famous mile 20. Well, this is what it is – at least I managed to have a strong-ish finish in the last 2+mi!

Not the race I originally imagined, but sometimes sh*t happens (okay, most of the times). My next marathon is ~24 weeks out –- a lot of training to do, but I have the time! Onward!

Originally posted on LinkedIn.

LLM Neuroanatomy: Topping the Leaderboard Without Changing a Weight

synesis — Tue, 24 Mar 2026 00:00:00 GMT

Diagram from the dnhkng.github.io post showing the duplicated middle-layer block.

Improve an LLM without training by duplicating a specific block of middle transformer layers and running them twice during inference. The power of thinking it over twice [1].

References

[1] “LLM Neuroanatomy: How I Topped the LLM Leaderboard Without Changing a Single Weight.” dnhkng.github.io. https://dnhkng.github.io/posts/rys/

Originally posted on LinkedIn.

Snowflake Cortex AI Escapes Sandbox and Executes Malware

synesis — Thu, 19 Mar 2026 00:00:00 GMT

Cover image from the PromptArmor write-up.

A good reminder that agent safety is not just about having a sandbox.

Snowflake’s Cortex Code CLI recently patched an indirect prompt injection flaw where untrusted content, such as a repo README, could trigger malicious shell commands, bypass user approval, and run outside the sandbox [1].

The deeper lesson is that safety can fail in two places at once: incomplete command validation and weak observability across agent layers. If a lower-level agent can act while the top-level agent thinks it only detected risk, the system is not actually in control.

Multi-agent systems need recursive validation, strong isolation, and end-to-end action visibility.

References

[1] “Snowflake Cortex AI Escapes Sandbox and Executes Malware.” PromptArmor. https://www.promptarmor.com/resources/snowflake-ai-escapes-sandbox-and-executes-malware

Originally posted on LinkedIn.

Language Puzzles, NACLO, and a Note of Thanks

synesis — Thu, 19 Mar 2026 00:00:00 GMT

Cover image from the Scientific American article.

This Scientific American article on the North American Computational Linguistics Open Competition (NACLO) highlighted both the people behind it and the impact it has had over the years [1]. Deep appreciation to Lori Levin, Dragomir Radev, Tom Payne, James Pustejovsky, and Tanya Korelsky, whose founding work helped create a competition that has introduced so many students to linguistics, AI, and language preservation through the joy of solving language puzzles. Lori in particular helped guide me into the wonderful world of linguistics back in my grad school days, and her class inspired me to write my first computational linguistics paper [2], motivated by the idea that field linguists should not have to rely on shoeboxes full of note cards and lexical slips to collect and analyze language data!

References

[1] “Try These Language Puzzles from North America’s Biggest Linguistics Competition.” Scientific American. https://www.scientificamerican.com/article/try-these-language-puzzles-north-americas-biggest-linguistics-competition/

[2] Han, Benjamin. “Building a Bilingual Dictionary with Scarce Resources: A Genetic Algorithm Approach.” Proceedings of the Student Research Workshop, Second Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL 2001). https://www.cs.cmu.edu/~benhdj/Publications/Published/bhan_naccl_2001.pdf

Originally posted on LinkedIn.

Journey into Coding with AI [3/4]: Decision-Bound Programming

synesis — Sun, 15 Mar 2026 00:00:00 GMT

Illustration from the original post.

AI is shifting programming from execution-bound work to decision-bound work.

Generation Is Cheap, Evaluation Is Not

Many developers describe AI coding tools as both powerful and exhausting. Recent empirical work offers a plausible explanation. In a randomized controlled experiment, METR asked experienced developers to solve issues in familiar repositories, with and without AI tools [1]. On average, developers took about 19% longer when AI assistance was allowed, even though they believed they were faster. Reuters’ coverage notes that much of the extra time went into prompting, reviewing generated code, and correcting partially correct outputs [2]. These findings suggest that AI lowers the cost of producing candidate code while increasing the effort required to evaluate it.

Once generation becomes cheap, the structure of development changes. Developers can quickly produce implementations, refactorings, or architectural variations. Not every developer will explore every branch, but the number of plausible alternatives grows. Each candidate must then be interpreted, validated, and compared before it can be trusted. Programming becomes less about executing a plan and more about evaluating possibilities.

Heuristics and Context Switching

The theory of bounded rationality [3] suggests that as the number of alternatives grows, people can no longer evaluate every option fully and instead rely on heuristics to reach satisfactory decisions. AI-assisted coding can increase the number of candidate solutions that must be screened and compared, helping explain why developers may perceive higher mental effort even when code is generated faster.

A second burden is context switching. Research on programming interruptions has long shown that rebuilding context is expensive [4]. More recent AI-specific work strengthens this point. The 2026 EditFlow paper reports that 68.81% of code-edit recommendations disrupted developers’ mental flow [5]. A separate five-day field study found that proactive AI suggestions worked better after commits than mid-task [6].

Recent studies also report higher perceived cognitive load during AI-assisted development [7]. Related work on user mental models in AI-driven code completion found that developers want better timing, display, granularity, and explanation [8].

Toward Decision Support

Taken together, the evidence suggests a broader pattern: AI accelerates code generation, but it can also increase interpretation, comparison, judgment, and context management during development. The bottleneck does not disappear. It moves. This means programming environments should evolve beyond code generation toward decision support. They should reduce unnecessary branching, summarize differences between alternatives, surface trade-offs, highlight risks, and make the implications of choices explicit.

In other words, the next generation of programming tools should be decision support systems, not just code generators.

(Part 2: Shifting Gears)

Continue the series: ← Part 1: Running Back to Code · ← Part 2: Shifting Gears

References

[1] METR. “Early-2025 AI on OSS Dev Productivity.” 2025. https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/

[2] “AI slows down some experienced software developers, study finds.” Reuters, 2025. https://www.reuters.com/business/ai-slows-down-some-experienced-software-developers-study-finds-2025-07-10/

[3] Simon, Herbert A. Models of Man. 1957. (See also Klahr, 2004: https://www.cmu.edu/dietrich/psychology/pdf/klahr/PDFs/klahr%202004.pdf)

[4] Parnin, Chris, and Spencer Rugaber. “Resumption Strategies for Interrupted Programming Tasks.” Software Quality Journal, 2011. https://chrisparnin.me/pdf/parnin-sqj11.pdf

[5] Liu, et al. “EditFlow.” arXiv, 2026. https://arxiv.org/abs/2602.21697

[6] Kuo, et al. “Proactive AI Field Study.” arXiv, 2026. https://arxiv.org/abs/2601.10253

[7] Brandebusemeyer, et al. “GenAI Mixed-Methods Field Study.” arXiv, 2025. https://arxiv.org/abs/2512.19926

[8] Desolda, et al. “Mental Models in AI Code Completion.” arXiv, 2025. https://arxiv.org/abs/2502.02194

Originally posted on LinkedIn.

Knuth’s Hamiltonian Cycles, Solved by Claude

synesis — Sun, 08 Mar 2026 00:00:00 GMT

Cover image from the original LinkedIn post.

Donald Knuth found out that Hamiltonian cycle decomposition problem he came up with while writing The Art of Computer Programming was solved by Claude Opus 4.6 [1]. Perhaps Mathematica will integrate with Claude now alongside ChatGPT?

(the reverse already exists)

References

[1] Knuth, Donald E. Claude Cycles (note). Stanford CS. https://www-cs-faculty.stanford.edu/~knuth/papers/claude-cycles.pdf

Originally posted on LinkedIn.

Woodinville Half: HM #76, the Comeback

synesis — Sun, 08 Mar 2026 00:00:00 GMT

I ran my HM #76 today (since 2023) at the Woodinville Half. Chip time matched my watch: 1:44:20 (7:57/mi). I finished 58/554 overall, 45/216 men, 5/29 AG. Felt like I still had gas left at the end!

Course surprise: the gravel section on the Eastrail wasn’t where the map suggested. I expected it around mile ~2, but they changed the ordering so it ran from mile 8 to 11. Rain left puddles, so we had to run on rougher gravel parts to avoid them, which cost more pace. Those miles were a bit over 8:00/mi, but the rest were sub-8, and I finished faster than I started!

Weather was rainy and cold (felt like ~40F at the start with ~9mph wind). Also today was the DST change, so we lost an hour of sleep.

Pretty happy with this one, even though it’s not a PR (1:39:07). A month or so ago I was at rock bottom after the disastrous 50K trail race in January and a cough that lasted over a month. Today felt like I finally made it back: VO₂max recovered from 45.9 to 51.5 since February, and my <130 bpm easy pace improved from 10:30/mi to 9:15/mi.

Happy International Women’s Day!

A few weeks before the race I scouted the course on foot. Here’s the preview run on YouTube (not the race itself):

Woodinville Half preview run on YouTube (not the race itself).

Originally posted on LinkedIn.

iPhone and iPad Approved for Classified NATO Information

synesis — Thu, 26 Feb 2026 00:00:00 GMT

Cover image from the Apple Newsroom announcement.

The first and only consumer devices to meet these international government security standards [1].

References

[1] “iPhone and iPad approved to handle classified NATO information.” Apple Newsroom, February 2026. https://www.apple.com/newsroom/2026/02/iphone-and-ipad-approved-to-handle-classified-nato-information/

Originally posted on LinkedIn.