Progressive Disclosure in MCP Servers: Keeping LLM Context Lean

MCP (Model Context Protocol) servers give LLMs access to external systems, but naive implementations waste thousands of tokens on tool schemas and verbose responses. This post covers patterns for context-efficient MCP design.

The Problem: Schema and Response Bloat

A typical API wrapper exposes one MCP tool per endpoint. Each tool carries its own schema into the LLM's context window. An integration with 8 operations might consume ~480 tokens in schema alone; a gateway pattern can reduce this to ~120 tokens—a 75% reduction.

The tax compounds across every enabled MCP server. A workspace with 5-6 servers can burn 2,000+ tokens before the conversation starts.

Response verbosity compounds the problem. APIs return everything; LLMs need only what's relevant to the current task.

Pattern 1: Gateway Tool with Action Dispatch

Consolidate operations into a single tool with an action discriminator:

@mcp.tool()
async def gateway(
    action: Literal["discover", "list", "get", "search", "export"],
    id: str | None = None,
    query: str | None = None,
    # ... other params with defaults
) -> dict:
    """Service gateway. Actions: discover, list, get, search, export."""

    if action == "discover":
        return {...}  # Capability manifest
    if action == "list":
        return {...}
    # dispatch to implementations

The LLM sees one tool schema. The docstring hints at available actions. Full documentation lives behind discover.

Pattern 2: On-Demand Capability Discovery

Instead of cramming usage details into the schema, expose a discover action that returns structured guidance only when needed:

if action == "discover":
    return {
        "CRITICAL_RULES": [
            "Hard constraints the LLM must never violate",
            "Filesystem or permission boundaries", 
            "Token hygiene requirements"
        ],
        "ACTION_SELECTION": {
            "search": {"purpose": "...", "use_when": "..."},
            "export": {"purpose": "...", "use_when": "..."}
        },
        "capabilities": [...],
        "filters": {...}
    }

Simple requests skip discovery entirely. Complex workflows pay the documentation tax once per session.

Pattern 3: Response-Embedded Guidance

External "skills" or system prompts drift out of sync with server behavior. Embed guidance directly in responses, scoped to the data just returned:

response = {"data": results}

if results_contain_sensitive_pattern:
    response["GUIDANCE"] = {
        "note": "Results contain [pattern] requiring special handling",
        "workflow": [
            "Step 1: ...",
            "Step 2: ...",
            "NEVER do X"
        ]
    }
return response

The LLM receives instructions at the moment of decision, tied to specific data. No stale documentation. No wasted tokens when the guidance doesn't apply.

Pattern 4: Default-On Response Filtering

APIs return complete objects. LLMs usually need a subset. Filter and compact responses by default:

# API returns
{"id": "abc123", "created": "2025-01-05T14:32:05.882Z", 
 "author": {"id": "u789", "email": "jdoe@example.com", "displayName": "Doe, John", "avatar": "..."},
 "content": "Hello", "metadata": {...}}

# Server returns (compact=True, default)
{"ts": "2025-01-05T14:32", "from": "jdoe", "content": "Hello"}

# Server returns (compact=False, when IDs needed for follow-up)
{"id": "abc123", "created": "2025-01-05T14:32:05.882Z", "author": {...}, "content": "Hello"}

Design principles:

Strip fields the LLM won't use (avatars, internal IDs, verbose timestamps)
Normalize repeated structures (deduplicate authors into a lookup table)
Offer a full-fidelity mode for workflows that need complete data
Document when to use which mode

Pattern 5: Proactive Warnings

When response data could cause downstream problems, warn explicitly:

if len(results) > 100:
    response["WARNING"] = "Large result set. Consider narrower filters."

if total_chars > 50000:
    response["TOKEN_WARNING"] = f"~{total_chars // 4} tokens. Consider pagination or smaller scope."

if results_written_to_external_filesystem:
    response["NOTE"] = "File written to user's machine. You cannot access it."

This steers the LLM toward self-correction before problems manifest.

Implementation Checklist

A context-efficient MCP server:

[ ] Single gateway tool with action dispatch (or minimal tool count with clear separation)
[ ] discover action for on-demand capability documentation
[ ] Response-embedded guidance, scoped to returned data
[ ] Compact output by default; full fidelity opt-in
[ ] Proactive warnings for large responses, permission boundaries, token hazards
[ ] Default limits on unbounded operations (pagination, result caps)

Appendix: Embedded Skill Template

For MCP authors who want to provide skill-like guidance to consuming LLMs, structure your discover response (or documentation) like this:

## Critical Rules
- [Hard constraints - filesystem boundaries, token limits, permission scope]

## Action Selection
| Action | Returns | Use When |
|--------|---------|----------|
| list | Collection summary | Need to browse/select |
| get | Single item detail | Have specific ID |
| search | Filtered results | Need subset matching criteria |
| export | File path (external) | User wants local copy |

## Parameter Reference  
| Param | Default | Notes |
|-------|---------|-------|
| compact | true | false when IDs needed for follow-up calls |
| limit | 50 | Cap results; use pagination for more |

## Common Workflows
**[Workflow name]**: step 1 → step 2 → completion criteria

## Output Modes
- compact (default): [what's included], use for [scenarios]
- full: [what's added], use for [scenarios]

This gives LLMs enough information to use the tool correctly without external prompts that drift from actual behavior.