Does extended thinking work in Claude.ai?

Yes, on Claude 4 family models when you enable it in the conversation settings. It also works on the API where you set the thinking parameter explicitly.

Does extended thinking cost more?

Yes. Thinking tokens count toward output. For debugging it is worth it because the scratchpad is what you are buying. For production runs, you usually turn it off once the prompt is stable.

Why do you call it a debugger?

Because that is what it does. When the output is wrong, the scratchpad tells you why the output is wrong. That is the same job a debugger does for code.

Extended Thinking as a Design System Debugger

What the scratchpad tells you and where to fix it

Criterion	What you see in thinking Pattern in the trace	Where to fix it One change at a time
Misread input	Claude named the wrong component	User prompt or image quality
Misapplied rule	Rule used on wrong element	Tighten rule scope in system prompt
Missing rule	Reasoned correctly under given rules	Add the missing rule
Wrong analysis order	Looked at design before tokens	Add explicit analysis_order block
Hedging	Right answer, talked itself out	Relax tone instructions

What this guide covers

The Anthropic team showed extended thinking as a feature. I use it as a debugger. Same tool, different reason.

When Claude picks the wrong token, names a component badly, or misjudges a design, you have two choices. You can guess at why and rewrite the prompt. Or you can turn on extended thinking, read the scratchpad, find the actual misreasoning, and patch the system prompt with surgical precision.

This guide shows the second option.

What extended thinking is

Extended thinking is a mode on Claude 4 family models where Claude generates a visible reasoning trace before producing the final answer. The trace is wrapped in <thinking> tags. You can read it. You can copy from it. You can use it to figure out what Claude believes about your problem, your data, and your rules.

It is not magic. It is just Claude thinking out loud, on purpose, in a way you can read.

The debugger move

Here is the loop I run when something is wrong.

1. Capture a failure

Pick one specific case where Claude produced the wrong output. Not a vague “the audits are kind of off.” A specific bad output. Save the input, save the bad output, write down what the correct output would have been.

2. Re-run with extended thinking on

Use the same system prompt. Same user prompt. Same input. The only change is extended thinking turned on. The output now includes a reasoning trace before the final answer.

3. Read the scratchpad slowly

Look for one of these patterns:

Misread input. Claude looked at the screenshot and identified the wrong component. The fix is upstream: better image quality, better description, or a clearer user prompt.
Misapplied rule. Claude knew the rule but applied it to the wrong element. The fix is in the rules section: more precise scope (“apply this rule only to interactive elements”).
Missing rule. Claude reasoned correctly under the rules you gave it, but the rules you gave it did not cover this case. The fix is to add the missing rule.
Wrong analysis order. Claude looked at the design before reading your tokens, so it described what it saw in plain language instead of token names. The fix is to add an explicit <analysis_order> block telling Claude what to read first.
Hedging. Claude knew the right answer but talked itself out of it because of an over-cautious tone instruction. The fix is to relax the tone block.

4. Patch one thing

Make the smallest possible change to the system prompt that addresses what you saw in the scratchpad. Do not rewrite the whole thing. One change per failure.

5. Re-run

Verify the fix on the same input. Then run on a wider sample to check you did not introduce a regression.

A worked example

I had a system prompt that asked Claude to suggest semantic tokens for a design. On one screen it picked color.text.danger for a label that was just stylistically red, not actually an error.

Without extended thinking, my first instinct was “Claude is bad at colors.” Wrong instinct.

With extended thinking on, the scratchpad said something like:

The label is red. Red is associated with danger in the system. The closest token is color.text.danger.
There is no token for red as a stylistic accent.
I will use color.text.danger.

That is not a Claude problem. That is a design system problem. There was no token for “red as accent” because the system was missing it. Claude reasoned correctly under the rules I gave it.

The fix was not in the prompt. The fix was in the design system: add a token for stylistic red, then update the system prompt with the new token.

I would not have known that without reading the scratchpad.

When extended thinking does not help

Inputs you cannot reproduce. If the failure is intermittent, you cannot read the scratchpad for the specific bad case. Capture and pin a reproducer first.
Output formatting issues. If Claude is producing the right answer but in the wrong shape, that is a prefill or output format problem, not a reasoning problem. Skip thinking and fix the format block instead.
Hallucinated facts. If Claude is inventing tokens or components that do not exist, it is usually because your system prompt did not list them. Thinking will sometimes show this, but the faster fix is to inspect what you actually gave Claude.

Cost and when to turn it off

Thinking tokens count toward output cost. For debugging this is fine because you are debugging on one input at a time. For production runs (audits across hundreds of screens, token suggestions on every commit) you usually turn thinking off once the prompt is stable. The whole point of the debugging loop is to make a prompt that does not need thinking to work.

Treat thinking like console.log. You add it when something is broken. You take it out when it is shipped.

Why this changes how I debug prompts

Before extended thinking, prompt engineering felt like guessing. You tweaked, you re-ran, you tweaked again. The feedback was the final output, which is the worst possible signal because it does not tell you why.

With extended thinking, the feedback is the reasoning. You can see the model’s working. You can find the exact step where it went sideways. You can fix that step.

This is the difference between fixing a bug by reading a stack trace and fixing a bug by changing random lines until the test passes.

Source

Extended thinking as a debugging tool was raised at the end of Anthropic’s Prompting 101 session by Christian from the Applied AI team. The exact framing he used: “you can use extended thinking as a crutch for your prompt engineering. Basically you can enable this to make sure that Claude actually has time to think. It adds his thinking tags and the scratch pad. And the beauty of that is you can actually analyze that transcript to understand how Claude is going about that data.”

That single line reframed the feature for me. It is not a reasoning upgrade. It is a debugger you turn on when you need it and turn off when you do not.

Finished this lesson?

Mark it complete to track your progress through "Claude Code".

Lesson

9 / 12

Progress

75%

Extended Thinking as a Design System Debugger

What this guide covers

What extended thinking is

The debugger move

1. Capture a failure

2. Re-run with extended thinking on

3. Read the scratchpad slowly

4. Patch one thing

5. Re-run

A worked example

When extended thinking does not help

Cost and when to turn it off

Why this changes how I debug prompts

Source

Finished this lesson?

How to Prompt Claude to Audit a Figma Design

Codex Subagents for Design Review

What this guide covers

What extended thinking is

The debugger move

1. Capture a failure

2. Re-run with extended thinking on

3. Read the scratchpad slowly

4. Patch one thing

5. Re-run

A worked example

When extended thinking does not help

Cost and when to turn it off

Why this changes how I debug prompts

Source

Finished this lesson?

Create an account to continue

Read this next

Few-Shot Examples for Component Naming

How to Prompt Claude to Audit a Figma Design

Codex Subagents for Design Review