Extended Thinking as a Design System Debugger
Anthropic markets it as a reasoning feature. I use it as a debugger. When Claude misnames a token or mislabels a component, the scratchpad tells you exactly why.
| Criterion | What you see in thinking Pattern in the trace | Where to fix it One change at a time |
|---|---|---|
| Misread input | Claude named the wrong component | User prompt or image quality |
| Misapplied rule | Rule used on wrong element | Tighten rule scope in system prompt |
| Missing rule | Reasoned correctly under given rules | Add the missing rule |
| Wrong analysis order | Looked at design before tokens | Add explicit analysis_order block |
| Hedging | Right answer, talked itself out | Relax tone instructions |
What this guide covers
The Anthropic team showed extended thinking as a feature. I use it as a debugger. Same tool, different reason.
When Claude picks the wrong token, names a component badly, or misjudges a design, you have two choices. You can guess at why and rewrite the prompt. Or you can turn on extended thinking, read the scratchpad, find the actual misreasoning, and patch the system prompt with surgical precision.
This guide shows the second option.
What extended thinking is
Extended thinking is a mode on Claude 4 family models where Claude generates a visible reasoning trace before producing the final answer. The trace is wrapped in <thinking> tags. You can read it. You can copy from it. You can use it to figure out what Claude believes about your problem, your data, and your rules.
It is not magic. It is just Claude thinking out loud, on purpose, in a way you can read.
The debugger move
Here is the loop I run when something is wrong.
1. Capture a failure
Pick one specific case where Claude produced the wrong output. Not a vague “the audits are kind of off.” A specific bad output. Save the input, save the bad output, write down what the correct output would have been.
2. Re-run with extended thinking on
Use the same system prompt. Same user prompt. Same input. The only change is extended thinking turned on. The output now includes a reasoning trace before the final answer.
3. Read the scratchpad slowly
Look for one of these patterns:
- Misread input. Claude looked at the screenshot and identified the wrong component. The fix is upstream: better image quality, better description, or a clearer user prompt.
- Misapplied rule. Claude knew the rule but applied it to the wrong element. The fix is in the rules section: more precise scope (“apply this rule only to interactive elements”).
- Missing rule. Claude reasoned correctly under the rules you gave it, but the rules you gave it did not cover this case. The fix is to add the missing rule.
- Wrong analysis order. Claude looked at the design before reading your tokens, so it described what it saw in plain language instead of token names. The fix is to add an explicit
<analysis_order>block telling Claude what to read first. - Hedging. Claude knew the right answer but talked itself out of it because of an over-cautious tone instruction. The fix is to relax the tone block.
4. Patch one thing
Make the smallest possible change to the system prompt that addresses what you saw in the scratchpad. Do not rewrite the whole thing. One change per failure.
5. Re-run
Verify the fix on the same input. Then run on a wider sample to check you did not introduce a regression.
A worked example
I had a system prompt that asked Claude to suggest semantic tokens for a design. On one screen it picked color.text.danger for a label that was just stylistically red, not actually an error.
Without extended thinking, my first instinct was “Claude is bad at colors.” Wrong instinct.
With extended thinking on, the scratchpad said something like:
The label is red. Red is associated with danger in the system. The closest token is color.text.danger.
There is no token for red as a stylistic accent.
I will use color.text.danger.
That is not a Claude problem. That is a design system problem. There was no token for “red as accent” because the system was missing it. Claude reasoned correctly under the rules I gave it.
The fix was not in the prompt. The fix was in the design system: add a token for stylistic red, then update the system prompt with the new token.
I would not have known that without reading the scratchpad.
When extended thinking does not help
- Inputs you cannot reproduce. If the failure is intermittent, you cannot read the scratchpad for the specific bad case. Capture and pin a reproducer first.
- Output formatting issues. If Claude is producing the right answer but in the wrong shape, that is a prefill or output format problem, not a reasoning problem. Skip thinking and fix the format block instead.
- Hallucinated facts. If Claude is inventing tokens or components that do not exist, it is usually because your system prompt did not list them. Thinking will sometimes show this, but the faster fix is to inspect what you actually gave Claude.
Cost and when to turn it off
Thinking tokens count toward output cost. For debugging this is fine because you are debugging on one input at a time. For production runs (audits across hundreds of screens, token suggestions on every commit) you usually turn thinking off once the prompt is stable. The whole point of the debugging loop is to make a prompt that does not need thinking to work.
Treat thinking like console.log. You add it when something is broken. You take it out when it is shipped.
Why this changes how I debug prompts
Before extended thinking, prompt engineering felt like guessing. You tweaked, you re-ran, you tweaked again. The feedback was the final output, which is the worst possible signal because it does not tell you why.
With extended thinking, the feedback is the reasoning. You can see the model’s working. You can find the exact step where it went sideways. You can fix that step.
This is the difference between fixing a bug by reading a stack trace and fixing a bug by changing random lines until the test passes.
Source
Extended thinking as a debugging tool was raised at the end of Anthropic’s Prompting 101 session by Christian from the Applied AI team. The exact framing he used: “you can use extended thinking as a crutch for your prompt engineering. Basically you can enable this to make sure that Claude actually has time to think. It adds his thinking tags and the scratch pad. And the beauty of that is you can actually analyze that transcript to understand how Claude is going about that data.”
That single line reframed the feature for me. It is not a reasoning upgrade. It is a debugger you turn on when you need it and turn off when you do not.
Finished this lesson?
Mark it complete to track your progress through "Claude Code".
The guides alone saved me a full day of work every sprint.
- All guides, prompts, and templates
- Starter kits and templates
- New content every week
- Priority support