The Hidden Tax on Your AI: Context Windows, Skills, and the Cost of Attention

Every skill you load, every message you send, every document you attach quietly eats into the same fixed budget. Here's what happens when that budget runs out — and how to spend it wisely.

Published: May 26, 2026
Author: Hrushiekesh Kanjula Reddy
Read time: ~7 min
Category: essay

#ai #context-window #tokenization #cost-optimization #skills #productivity #developer-tools

Yesterday I added sixteen new skills to my AI workflow in a single afternoon. Deep research, email drafting, blog writing, resume tailoring — each packaged into its own set of instructions, loaded automatically whenever I start a session. By the time I closed my laptop, I had more automated capability than I'd had the entire previous year combined.

And then I started thinking about what I'd actually done.

I used to believe in a clean division of labor: humans handle the small, everyday things; AI handles the heavy lifting. But somewhere between automated job applications and hands-free portfolio updates — all without switching between my IDE, GitHub, and Vercel — that line moved. I let AI take the small things too, and I found that I liked it. What I hadn't fully reckoned with was the cost of all those instructions sitting in memory, waiting to be used.

Abstract digital art: a glowing context window filling with token streams

What You're Actually Paying For

Before anything else: every piece of text an AI processes — your message, its response, the system instructions, the skill definitions, the attached file, the conversation history — gets broken into tokens. One token is roughly four characters, or about three-quarters of a word. A thousand words costs roughly 1,300 tokens. Not a lot, on its own.

The context window is the total number of tokens the model can hold at once — its working memory for the conversation. Modern flagship models hold anywhere from 128,000 to 1,000,000 tokens. That sounds enormous until you start filling it.

And it fills faster than you'd think. Claude's own base system prompt consumes around 23,000 tokens before the conversation begins — over 11 percent of a typical context window, gone before you've typed a word. Every skill you add contributes between 2,000 and 4,400 tokens on top of that. Sixteen skills at an average of 3,000 tokens each is 48,000 tokens of overhead sitting in every single session, regardless of whether those skills ever get used.

The Skills Paradox

There's a version of this that goes badly in two directions at once.

Load fifty skills into your context and you've essentially pre-filled half your working memory with instructions. The model is still capable — but you're burning tokens on every API call just to carry the weight of capabilities you haven't touched. If you're on a paid API tier, that cost compounds with every turn of the conversation. A researcher studying LLM skill efficiency found that over 60 percent of the content in publicly available skills is non-actionable text — preamble, explanation, caveats — that costs tokens without contributing anything to the output.

Load zero skills and you lose consistency. Every session starts from scratch. The AI doesn't know how you prefer your emails written or what format your research reports should take. You compensate by explaining things repeatedly, which costs tokens anyway, just less efficiently.

The right answer is neither extreme. It's intentional curation: load the skills you'll actually invoke in a given session, not every skill you might theoretically need someday. Think of it like a browser with too many tabs — each one you're not using is still consuming memory, slowing down the ones you are.

The Invisible Quality Drain

Here's what most people never notice: even if you stay within the token limit, quality degrades long before you hit the wall.

Research from Stanford's "Lost in the Middle" paper found that models systematically under-attend to content positioned in the middle of a long context. Information at the beginning and end of a conversation gets disproportionate attention; everything in between fades. Accuracy for mid-context content dropped by 15 to 20 percentage points compared to content at the edges — not because the model forgot it, technically, but because it stopped weighing it properly.

This produces what engineers now call context rot: the slow degradation in response quality as a session grows long. The signs are subtle at first. The AI re-introduces a pattern you moved away from three messages ago. It asks you something you already answered. It starts hedging in ways it wasn't hedging before, growing verbose where it used to be precise. Left unchecked, it begins contradicting its own earlier decisions without seeming to notice.

Most users don't recognize context rot when it happens. They assume the model is having a bad moment, tweak their prompt, and push forward — which makes it worse.

Abstract art: a glowing signal degrading into noise as context fills — warm amber fading to dim red

What Developers Should Know About the Bill

For casual users, context and skills are a quality problem. For developers calling the API, they are also a money problem — and the math is unforgiving.

Token pricing varies by nearly two orders of magnitude across providers. Gemini 2.0 Flash sits at $0.08 per million input tokens. Claude Sonnet runs $3.00. Claude Opus climbs to $15.00 for input, $75.00 for output. GPT-4o lands at $2.50 input, $10.00 output. The model you pick and the context size you maintain interact multiplicatively: a naive agent that re-serializes its entire conversation history on every turn sees costs grow quadratically — not linearly — as the session extends.

The fixes are well-established, but underused.

Prompt caching is the most immediately impactful. When the static portion of your context — system instructions, skills, common reference documents — is cached, the effective per-call cost drops 50 to 90 percent. Most major providers support it as of 2026. If you're not using it, you're paying full price on content the model has seen before.

Retrieval-augmented generation beats stuffing. For large codebases, a well-tuned RAG pipeline pulling 5 to 10 relevant chunks typically outperforms shoving the entire repo into context — even when the model could technically fit it. You get better signal-to-noise, lower cost, and faster inference.

Model routing is cheap leverage. Use your flagship model for reasoning-heavy tasks, drop to a smaller model for classification, summarization, or boilerplate. A 40 to 70 percent cost reduction with no meaningful quality loss on appropriate tasks.

Safety Measures Worth Keeping

Whether you're a developer or a daily user, a few habits change the outcome significantly.

Start task-scoped sessions. A new conversation for each distinct task — not one marathon session for everything — keeps context clean and focused. The gains compound: cleaner context means less context rot means better responses means fewer follow-up messages to fix things.

Use your platform's compaction tools. In Claude, /compact summarizes the conversation history rather than carrying it verbatim, freeing window space without losing the thread. /clear resets entirely when you've switched tasks.

Keep a persistent reference file. Rather than re-explaining your preferences at the start of every session, maintain a short document — what you're working on, how you like things structured, what the AI should assume. Start each session by referencing it. You get continuity without carrying the full history.

For long sessions, test periodically. Early in a conversation, state a specific constraint. Thirty minutes later, ask something that should invoke it. If the model misses it, context rot has already started. Starting fresh with a brief summary of where you are almost always gets things back on track faster than trying to wrestle a congested window into shape.

Abstract art: two glowing paths diverging — one ordered and efficient, one tangled and degrading

The Shift I Didn't See Coming

I added sixteen skills yesterday because each one, individually, makes me more effective. I don't regret any of them. But I also now think differently about what's sitting in that context window every time I start a session — the overhead I'm carrying before I've asked a single question.

The mental model I've landed on: the context window is a finite budget, and every element in it is a spending decision. Skills, conversation history, attached documents, system instructions — all of it competes for the same attention. The goal isn't to maximize what you load. It's to load only what earns its place.

Spend wisely. Start fresh often. And pay attention to when the quality starts to drift — because by the time you notice it clearly, it's been happening for a while.

If you want to see how these ideas play out in practice — automating tasks, managing AI-assisted workflows, building tools without context bloat — the Assembly Hub project lives here. It's the most concrete place I've applied these lessons so far.

← All posts