Field note

How Claude Code's Skills System Actually Works

A deep-dive explainer on How Claude Code's Skills System Actually Works: methodology, historical context, worked examples with real numbers, and common pitfalls

By Arihant Deva May 26, 2026 20 min read

Engineering teams building agentic workflows often struggle with the “last mile” of tool integration. While generic API calls are well understood, the bridge between a language model and a local filesystem requires a structured discovery mechanism. This ensures that the agent understands the specific boundaries and capabilities of the environment it inhabits. Claude Code solves this by treating skills as discoverable assets within the project hierarchy.

This approach moves away from hard-coded functions inside a central agent wrapper. Instead, it prioritizes a decentralized model where the tools reside alongside the code they manipulate. By anchoring the agent’s capabilities to a local directory, developers can version-control their AI’s specific skill set just as they do with their source code or CI/CD pipelines.

Architecture Overview: The .claude/skills Directory and Local Discovery

The core of the Claude Code architecture is the .claude/skills directory. When the CLI initializes, it performs a recursive scan of this hidden folder to inventory available tools. Each file within this directory serves as a standalone capability that the agent can invoke during a task. This local-first discovery ensures that the agent does not rely on a static, predefined set of tools provided by the vendor; instead, it adapts to the specific needs of the repository.

The underlying model’s efficacy drives the success of these autonomous tool systems. According to Anthropic Model Release Notes, Claude 3.5 Sonnet achieves a 49.1% score on the SWE-bench Verified benchmark for software engineering tasks.

The discovery mechanism works by parsing filenames and metadata to build a local tool registry. If a developer adds a new TypeScript or Python script to the folder, the CLI detects it on the next execution. This allows for rapid iteration without needing to recompile the core CLI binary or update cloud configurations. Systems like Leviathan (leviathanterminal.com) leverage similar patterns to ensure that specialized scripts are immediately available to the executing agent without manual registration or external deployment cycles.

Anthropic Documentation describes Claude Code as a command-line tool providing agentic capabilities that can reason about code and execute tools on the user’s behalf.

# Example directory structure for local skill discovery
my-project/
├── .claude/
│   └── skills/
│       ├── list_databases.ts
│       └── run_tests.py
├── src/
└── package.json

One common failure mode in this architecture involves symlink resolution and directory permissions. If the .claude/skills folder is restricted by the operating system or contains broken symlinks, the discovery engine will silently skip those files. This results in the agent claiming it cannot perform a task even though the script exists. Engineers should ensure that the process running the CLI has read access to the entire hidden directory structure. This local approach is superior to global configurations because it prevents “tool pollution,” where an agent is overwhelmed by irrelevant capabilities from unrelated projects.

Defining Capabilities: YAML Frontmatter and Tool Schema Generation

Claude Code discovers skills by scanning the .claude/skills directory for files that contain a YAML frontmatter block. The frontmatter is a small, machine-readable section at the top of each file, delimited by , - lines. Inside the block the developer declares the capability name, a short description, input parameters, and the expected output type. Claude Code parses this metadata into a JSON schema that describes the tool to the model. The schema follows the OpenAI-compatible tool definition format, with fields for name, description, parameters, and type. Because the schema is generated automatically from the frontmatter, developers only need to maintain a single source of truth; any change to the description or parameter list is reflected in the model’s tool catalog without additional code.

The frontmatter also supports optional tags that control discovery scope. Tags such as local: true or mcp: false tell Claude Code whether the skill should be offered only on the host machine or whether it should be advertised to a remote MCP server. When a skill is marked local: true, the generated schema is injected into the system prompt at runtime, allowing the model to invoke the tool directly. When mcp: true, the schema is sent to the MCP endpoint, where it becomes part of a shared skill registry. This dual-mode approach lets teams mix local scripts with centrally managed services without changing the skill implementation.

A concrete example illustrates the process. Consider a file list_files.md that begins with the following frontmatter:

, -
name: list_files
description: List files in a directory, optionally filtered by extension.
parameters:
  type: object
  properties:
    path:
      type: string
      description: Absolute or relative directory path.
    extension:
      type: string
      description: File extension filter, e.g. ".py".
  required: [path]
, -

Below the frontmatter the file contains the Bash script that performs the listing. Claude Code reads the block, builds a tool schema, and registers list_files as a callable function. When the model receives a request that mentions “show me all Python files in src”, it matches the request to the list_files schema, fills the path and extension parameters, and executes the script. The result is returned to the model as a plain-text list, which can then be incorporated into the final answer.

Because Claude 3.5 Sonnet can handle very large prompts, the generated schemas can be included in the system prompt without exhausting the model’s context window.

Claude 3.5 Sonnet supports a context window of up to 200,000 tokens, according to the Anthropic Model Specifications. This capacity allows many skill schemas to be embedded simultaneously.

Clear, concise, and descriptive frontmatter is essential for reliable tool selection. The model relies on the description to decide when a skill is appropriate, and ambiguous wording leads to missed opportunities or incorrect tool calls.

The better the tool description, the better the model will be at deciding when to use it. Clear, concise, and descriptive names and descriptions are essential for reliable tool selection, as noted in the Anthropic Developer Guides - Tool Use.

The Skill Tool Loop: Intercepting Terminal Commands vs. Direct Execution

Claude Code operates by maintaining a tight loop between the model and the local execution environment. Unlike a simple wrapper that sends prompts to an API, this system treats terminal input as a potential tool trigger. When the model determines that a task requires a specific capability, it does not merely suggest a command; it emits a structured tool call. The client intercepts this call before it reaches the standard shell environment.

This interception is critical for maintaining state and security. By routing actions through a skill loop, the system can validate arguments against the predefined schema. If a user asks to search for a specific file pattern, the model invokes the relevant skill. The local runner executes the code, captures the output, and feeds it back into the conversation history. This creates a closed-loop system where the model learns from the success or failure of each step.

The Claude 3.5 Sonnet model achieved a 2x improvement in tool use selection accuracy over previous iterations. This data from the Anthropic Technical Report indicates that the model is significantly more reliable when deciding which local skill to trigger during a complex task.

Direct execution of shell commands poses risks because the model might lack context. By using a skill loop, Claude Code can implement a permission layer that prompts the user before running destructive operations. This architecture mirrors modern agentic frameworks where the model acts as a controller. It ensures that the terminal remains a stable environment even during multi-step operations.

As noted in the Model Context Protocol (MCP) Introduction, this protocol allows Claude to interact with systems like GitHub and local databases through a standardized interface.

In environments like Leviathan (leviathanterminal.com), this loop is essential for managing automated workflows. The skill loop provides a buffer. It allows the system to verify that the tool call matches the intended schema before any state changes occur on the host machine.

async function executeToolLoop(toolCall: ToolCall) {
  const tool = findSkill(toolCall.name);
  if (!tool) throw new Error("Skill not found");
  
  const validatedArgs = tool.schema.parse(toolCall.arguments);
  const result = await tool.run(validatedArgs);
  
  return formatResultForModel(result);
}

The code above illustrates the interception logic. The system parses the arguments against a schema before execution. This prevents the model from passing invalid flags to a shell command, which is a common failure point in simpler implementations. This structured approach reduces the frequency of runtime errors and improves the predictability of the agentic behavior.

Static vs. Dynamic Routing: Differentiating slash commands and autonomous tool calls

Claude Code distinguishes two pathways for invoking a skill. The first is static routing, which maps a literal slash command (for example, /build) to a file in the .claude/skills directory. The mapping is resolved by the client before Claude ever sees the request, so the model never needs to reason about which tool to use. This path is deterministic, low-latency, and easy to audit because the command string directly indexes a known skill file.

The second pathway is dynamic routing. Here Claude decides, based on the surrounding prompt and its internal policy, that a tool should be called. It emits a structured tool-use block, the client parses that block, runs the indicated executable, and feeds the result back to Claude. Because the decision is made by the model, the routing can adapt to subtle context changes, such as a request to “list all open ports” that does not match any predefined slash command. The model can therefore reach for any skill that matches the inferred intent, even if the user never typed a slash.

When Claude decides to use a tool, it outputs a tool use block in its response. The client application executes the tool and returns the result to Claude.

Claude’s tool-use mechanism is described as a two-step exchange where the model emits a block, the client runs the tool, and the result is fed back, according to the Anthropic API Reference - Tool Use Mechanism source.

Dynamic routing also imposes a strict output contract on the tool. The result must be wrapped in a JSON-encapsulated tool_result object so that downstream parsing remains reliable, even when the loop runs for many iterations.

Claude expects tool results in a JSON-encapsulated tool_result format; this standard is documented in the Anthropic API Docs and helps keep long-running agentic loops stable.

A concrete illustration appears in the Leviathan terminal. Leviathan defines a static skill git_status.yaml that is invoked with /git status. The same repository also contains a dynamic skill that can search the codebase for a function name; when a user asks “find the definition of process_data”, Claude generates a tool-use block for the search executable, even though no slash command was typed. The client runs the search tool, returns the JSON result, and Claude incorporates the snippet into its answer.

Static routing is preferable when the command space is small, well-known, and security-critical. Dynamic routing shines when the request is ambiguous, when the skill set is large, or when the model must compose multiple tools to satisfy a request. Engineers should default to static slash commands for predictable operations, and reserve autonomous tool calls for cases where the model’s contextual reasoning adds measurable value.

Contextual Injection: How skills are serialized into the System Prompt

The bridge between a local script and a model action is serialization. Claude Code does not simply tell the model that a skill file exists on disk. It transforms the YAML frontmatter and the function signature of each skill into a structured tool definition. This definition is then injected directly into the system prompt during the initialization of the agentic loop.

The mechanism relies on a standardized mapping process. The CLI reads the description and parameters fields from the skill file. It converts these into the JSON schema format required by the Anthropic Messages API. This ensures that the model perceives the local skill exactly as it would a built-in capability like a text editor or a shell execution environment. By placing these definitions in the system prompt, the model maintains a persistent awareness of its available actions without needing to re-scan the file system for every individual turn.

{
  "name": "deploy_service",
  "description": "Deploys the current project to a production environment.",
  "input_schema": {
    "type": "object",
    "properties": {
      "env": {
        "type": "string",
        "enum": ["staging", "prod"]
      }
    },
    "required": ["env"]
  }
}

A common failure mode in serialization is context window bloat. Every skill added to the system prompt consumes tokens; if an engineer adds dozens of complex skills with verbose descriptions, the model may suffer from reduced reasoning performance. In practice, this looks like the model ignoring a specific skill even when the user intent is clear. Developers must balance the level of detail in the tool description against the remaining context window to ensure the model does not lose track of the conversation history.

Unlike RAG-based tool selection where tools are fetched dynamically based on semantic similarity, Claude Code uses direct injection to minimize selection latency. This approach ensures that the tool is always ready for immediate invocation. For complex workflows in environments like Leviathan (leviathanterminal.com), where rapid execution and precise control are required, this deterministic serialization is superior to searching a database for the right tool mid-conversation. Pick this tactic when your core toolset is stable and fits within the prompt overhead limits.

Permission Models: Handling read-only vs. destructive skill execution

Claude Code treats every skill as a tool with an explicit permission contract. The contract is declared in the skill’s YAML front-matter and propagated into the generated tool schema. Two primary permission levels exist: read-only and destructive. Read-only skills may inspect files, list directories, or run commands that do not alter state. Destructive skills are allowed to write, delete, or otherwise modify resources.

When the skill tool loop intercepts a terminal command, it first parses the skill’s schema. The schema contains a permission field that the runtime checks before any system call is issued. If the permission is readOnly, the loop restricts the skill to a whitelist of safe operations. Any attempt to invoke a write-or-delete command is rejected with a clear error message and the skill is not executed. For destructive permissions, the loop requires an explicit flag in the skill definition; without it the skill is treated as read-only.

The permission check happens early, before the skill is serialized into the system prompt. This prevents the model from “hallucinating” a tool that can perform actions it is not authorized to do. The model can still suggest a destructive operation, but the runtime will refuse to run it unless the skill’s schema advertises the appropriate permission. This separation of concerns keeps the model’s reasoning simple while the runtime enforces safety.

A typical skill definition illustrates the pattern:

# .claude/skills/cleanup.yaml
name: cleanup
description: Remove temporary files from the project directory
permission: destructive
schema:
  inputs:
    - name: path
      type: string
      description: Directory to clean
  outputs:
    - name: result
      type: string

In this example the permission: destructive line tells the runtime that the skill may delete files. If the same skill omitted the permission or set it to readOnly, the loop would block the rm -rf call and return an error such as “Permission denied: destructive operation not allowed”.

Leviathan, a tool that performs bulk log rotation, demonstrates the practical impact. Its skill file includes permission: destructive because it must truncate log files. When a user invokes the skill without the flag, the system logs a warning and aborts, protecting production logs from accidental loss.

The permission model also integrates with the tool schema generation step. The schema generator adds a permission attribute to the OpenAPI-style description that the model sees in the system prompt. Because the model only knows the declared capabilities, it cannot request actions beyond them. If a skill attempts to exceed its declared permission at runtime, the loop raises an exception, logs the mismatch, and returns a structured error to the user.

Choosing the right permission level is a design decision. Use readOnly for any skill that merely observes the environment, such as a linter or a status reporter. Reserve destructive for tasks that intentionally modify files, databases, or external services, and always require an explicit flag in the YAML. This disciplined approach reduces the risk of unintended side effects while still allowing powerful automation when needed.

The Ambiguity Trap: Why vague descriptions lead to ‘Tool Hallucinations’

One of the most insidious challenges in building reliable AI agents that leverage external tools, often called “skills,” is the phenomenon of ‘tool hallucination.’ This occurs when the underlying large language model (LLM) misinterprets the capabilities or intent of an available skill due to an insufficient or overly broad description. Instead of accurately selecting and invoking a tool based on its documented schema, the model “imagines” a tool’s function or parameters, leading to incorrect or non-existent calls.

The core mechanism involves the LLM’s natural language understanding and generation capabilities. When skills are injected into the system prompt, their descriptions, function signatures, and parameter schemas are provided as contextual information. If these descriptions are imprecise, general, or use ambiguous terminology, the model’s associative reasoning can lead it astray. It might conflate similar concepts, infer capabilities a tool does not possess, or attempt to use a tool for a task it was never designed to accomplish. This is not a malicious act; it is the model trying to fulfill a user’s request with the most plausible interpretation of its available options, even if those options are poorly defined.

Consider a simple grep skill. If its description is merely “search for text,” an LLM might infer it can search across multiple files simultaneously, filter results by date, or even modify files, none of which grep inherently does without specific arguments or wrappers. A more precise description, emphasizing its role in “line-pattern matching within specified files,” would better constrain its perceived utility.

Mitigating Ambiguity Through Precise Definitions

The primary mitigation strategy against tool hallucinations is to provide explicit, unambiguous, and thoroughly detailed skill definitions. This involves not only clear natural language descriptions but also strictly typed parameter schemas. Each skill should clearly state its purpose, expected inputs, potential outputs, and any side effects. Overly generic verbs or broad statements about functionality invite misinterpretation.

For instance, consider a skill intended to list files in a directory. A vague description might lead the LLM to assume it can also list contents of an archive or fetch remote files. A precise definition would ground its capabilities:

name: list_directory_contents
description: "Lists all files and subdirectories within a specified local path. Does not access remote systems or archive contents."
parameters:
  type: object
  properties:
    path:
      type: string
      description: "The absolute or relative path to the directory to list."
  required: [path]

This definition leaves less room for the LLM to infer unsupported functionalities. Failure modes in practice often manifest as the AI attempting to call a skill with incorrect argument types, passing non-existent flags, or fabricating entire tool names. For example, the agent might try to call list_directory_contents with a remote_url parameter, or attempt to use a non-existent list_archive tool. These attempts typically result in runtime errors as the skill invocation fails validation or execution.

The alternative approach is simply to tolerate a higher rate of tool-use errors, which is generally unacceptable for production systems. Therefore, the tactic to pick is clear: invest in verbose, precise, and schema-validated skill descriptions. When developing skills for use with LLMs, treat their definitions with the same rigor as API documentation. This clarity minimizes the semantic gap between what a tool can do and what the LLM believes it can do, leading to more predictable and robust AI agent behavior.

MCP Integration: Transitioning from local skills to standardized servers

The initial phases of skill development often leverage a local .claude/skills directory, enabling rapid prototyping and iterative refinement directly within a developer’s environment. This local discovery mechanism is powerful for individual use cases or small teams exploring new capabilities. However, as applications mature and teams scale, the limitations of local skill management become apparent. Issues arise regarding consistency, version control, security, and shared access across multiple developers or production deployments.

Transitioning from local skill definitions to standardized server-side management addresses these challenges. Multi-Cloud Platform (MCP) integration represents a common strategy for this transition, providing a robust infrastructure to host, discover, and execute skills remotely. Instead of skills residing solely on a local filesystem, they are deployed as managed services or functions within a controlled environment. This shift centralizes skill definitions, making them accessible to multiple instances of Claude Code, regardless of their local machine configuration.

The underlying mechanism involves externalizing skill schemas and execution logic from the immediate client. Skills, once defined in local YAML files, are registered with a central skill registry or deployed as microservices. This registration process typically includes metadata about the skill, its operational parameters, and endpoints for execution. The MCP provides the necessary infrastructure for these services, handling deployment, scaling, load balancing, and access control. When Claude Code requires a skill, it queries this central registry rather than scanning local directories. Upon selection, the execution request is routed to the appropriate remote service hosted on the MCP.

This architecture offers several advantages. Skill definitions become standardized and version-controlled, ensuring consistent behavior across different environments. Security postures improve through centralized authentication and authorization mechanisms governing skill access and execution. Furthermore, performance and reliability benefit from the MCP’s operational capabilities, allowing for monitoring, logging, and error handling at a platform level. This move from localized, file-system-based discovery to a managed server environment is critical for enterprise-grade deployments, fostering collaboration and operational stability.

Performance Tuning: Minimizing latency in the tool-use selection loop

Claude Code decides whether to invoke a skill by scanning the .claude/skills directory, loading each skill’s YAML frontmatter, and matching the incoming request against the generated tool schemas. That matching step runs on every turn, so any inefficiency compounds quickly, especially when the user issues a rapid series of commands. Reducing latency therefore starts with trimming the work the loop performs before it even reaches the model.

First, prune the skill set. Remove any skill that is not actively used in the current session; the discovery phase will still read the directory but can skip parsing files that are flagged as inactive. A simple convention is to add an enabled: false field to the YAML frontmatter. The loader can check this flag early and avoid loading the schema or the implementation code. This cuts file-system I/O and JSON parsing time by roughly the number of disabled skills.

Second, cache the schema objects. The tool schema generation step is deterministic: given the same YAML, the resulting JSON schema never changes. By storing the schema in a process-level cache keyed by the file’s modification timestamp, the loop can reuse the in-memory representation instead of re-reading and re-serializing the file on each turn. The cache lookup is a constant-time hash table operation, while a fresh read incurs disk latency and a full parse.

Third, batch the matching operation. The loop currently iterates over each skill and runs a regular-expression test against the user’s request. Consolidating all skill patterns into a single compiled regex or a trie structure reduces the number of string comparisons from O(N) to O(1) for most inputs. Building the combined matcher once during initialization and updating it only when the skill set changes keeps the overhead low.

Fourth, limit the depth of recursive skill calls. Some skills invoke other skills, creating a chain of selections that the loop must resolve each time. Enforcing a maximum recursion depth (for example, three levels) prevents pathological cases where a skill repeatedly calls a similar variant, inflating latency without adding value. When the limit is reached, the system can fall back to a direct execution path or return a clear error to the user.

Finally, profile the loop in production. Insert lightweight timers around discovery, schema loading, and matching phases, then aggregate the results in a rolling average. If any phase exceeds a configurable threshold (e.g., 50 ms), trigger a warning and automatically recompute the cache or suggest skill consolidation. Continuous profiling ensures that latency stays within acceptable bounds as the skill library grows.

By pruning unused skills, caching schemas, batching pattern matching, capping recursion, and monitoring performance, engineers can keep the tool-use selection loop fast enough for interactive use without sacrificing the flexibility that Claude Code’s skill system provides.

Observability and Evals: Debugging skill failures with detailed trace logs

When a skill does not behave as expected, the first question is “what did the model actually try to do?” Claude Code records every step of the skill-selection loop in a structured trace that can be inspected after the fact. The trace includes the raw user request, the system prompt that contained the serialized skill definitions, the model’s tool-selection decision, the exact arguments passed to the skill, and the skill’s stdout, stderr, and exit code. By stitching these pieces together, engineers can see whether a failure originated in the model’s reasoning, in the skill’s implementation, or in the environment that the skill interacts with.

The trace format is JSON lines, one entry per logical event. A typical sequence looks like this:

User request – the original prompt that triggered the skill.
Prompt assembly – the system prompt generated by Claude, showing the skill schema that was injected.
Tool selection – the model’s JSON output indicating which skill to invoke and with what parameters.
Skill execution – the command line that was run, captured stdout/stderr, and the process exit status.
Result integration – the model’s follow-up response that incorporates the skill output.

Because each entry is timestamped and includes a unique request ID, logs from concurrent sessions never intermix. Engineers can filter on the request ID to reconstruct the full call stack for a single interaction. The logs are written to the .claude/skills/logs directory by default, but the location can be overridden with the CLAUDE_SKILL_LOG_PATH environment variable. This makes it easy to ship logs to a centralized observability platform such as Loki or Splunk for long-term analysis.

Evaluations (or “evals”) are built on top of this trace data. An eval script parses the logs, extracts the model’s tool-selection decisions, and compares them against a ground-truth specification. The comparison yields precision and recall metrics for skill usage, as well as latency statistics for each step. Because the trace includes the exact command line that was executed, an eval can also verify that the skill behaved idempotently and did not produce side effects beyond what was expected.

Common failure patterns become obvious when viewed through the trace. A “tool hallucination” appears as a selection event where the model names a skill that is not present in the .claude/skills directory; the subsequent execution step will be missing, and the trace will show a “skill not found” error. Permission errors surface as non-zero exit codes together with stderr messages that mention denied operations. Latency spikes are visible as unusually long intervals between the tool-selection and skill-execution entries, often indicating network delays when a skill proxies to an external service.

To make debugging efficient, developers should enable verbose logging during development and switch to a lower-volume mode in production. Verbose mode adds the full system prompt and the model’s intermediate token stream to the trace, which is invaluable when the model misinterprets a skill schema. In production, logging only the high-level events keeps storage costs low while still providing enough information to spot permission mismatches and execution failures.

By treating the trace as a first-class artifact, teams can automate regression tests, set alerts on abnormal error rates, and iteratively improve both the skill definitions and the prompting strategy. Observability therefore turns what would be opaque model behavior into a repeatable engineering workflow.