Spec-Driven Development with Claude Code in Enterprise Systems

The teams that get the most out of AI coding tools aren’t the ones who write the best prompts. They’re the ones who’ve done the most thinking before they open the chat window.

I’ve spent the last months integrating Claude Code into my engineering workflow on enterprise banking systems — production code, regulated environment, real compliance requirements. This isn’t a productivity article. It’s an operational account of what actually happens when you introduce AI tooling into a serious engineering context.

The productivity illusion

The first thing most engineers notice when they start using AI coding tools is a dramatic increase in output speed. Boilerplate writes itself. CRUD endpoints materialize. Test fixtures appear. This is real and it matters. But it creates a misleading mental model about where value is actually being generated.

The speed gains are real. The question is what you’re building faster.

In my experience, the risk of AI coding tools in enterprise environments is not that they generate incorrect code. It’s that they generate technically correct code that doesn’t belong in your system — code that passes type checking, passes tests, and violates the architectural constraints that aren’t encoded anywhere a machine can read.

A tool that writes code faster without changing what you think about will make bad decisions faster. Spec-Driven Development is the practice that converts AI speed into compounding engineering leverage rather than compounding technical debt.

What spec-driven development actually means

A spec is not a requirements document. Requirements describe what a system should do from a user perspective. A spec describes what an implementation should look like from an engineering perspective — the constraints that precede and bound the implementation.

A production-grade spec contains:

Data models with types and invariants: not just the shape, but what’s always true about the data
API contracts: inputs, outputs, all error cases, not just the happy path
Behavioral invariants: properties that must hold in all states, not just initial ones
Explicit out-of-scope declarations: what this component deliberately does not handle
Decision log: the options that were considered and the reasoning for the choice made

The last two are the most important and the most skipped.

Out-of-scope declarations prevent scope creep during AI-assisted implementation. If you don’t tell Claude Code what a service is not responsible for, it will often add handling for edge cases that should belong to a different service — creating hidden coupling that you’ll discover six months later during a refactor.

The decision log is the mechanism for architectural governance. It encodes the context that exists at design time but disappears by implementation time. “Why did we use a discriminated union here instead of a class hierarchy?” is a question that should be answerable from the spec, not from a Slack search.

The Claude Code orchestration model

Claude Code operates on a fundamentally different model from a chat assistant. It has tool access — it reads files, runs commands, writes code, runs tests. Sessions accumulate context within a conversation but don’t automatically persist between them. This creates a specific operational challenge for enterprise codebases: context management is engineering work.

// CLAUDE CODE WORKFLOW — from specification to validated implementation

graph TB SPEC["Engineering Spec\nData models · API contracts\nInvariants · Out-of-scope · Decision log"] CLAUDE_MD["CLAUDE.md\nArchitecture context\nConventions · Constraints\nPackage boundaries · Review requirements"] SESSION["Claude Code Session\nRead spec · Read codebase context\nAsk clarifying questions\nGenerate implementation plan"] IMPL["Implementation\nTypes · Logic · Tests\nFollows spec constraints"] REVIEW["Human Review\nArchitecture gates\nCompliance review\nTest coverage check"] MERGE["Merge\nChangeset · ADR update\nDocumentation"]

SPEC --> SESSION
CLAUDE_MD --> SESSION
SESSION --> IMPL
IMPL --> REVIEW
REVIEW -->|"needs revision"| SESSION
REVIEW -->|"approved"| MERGE

The CLAUDE.md file is the most underused feature in the Claude Code workflow. It’s a persistent context file that Claude Code reads at the start of every session — architectural context, conventions, constraints, package boundaries, and explicitly forbidden patterns. Every enterprise codebase I’ve worked on has implicit constraints that are understood by experienced engineers and invisible to new ones. CLAUDE.md makes them explicit and machine-readable.

Our CLAUDE.md for the OneBank platform includes:

## Architecture constraints

- @onebank/forms must not import from @onebank/components or @onebank/primitives.
  The forms layer owns business logic, not visual concerns.
  
- resolverConfigFormulario must remain a pure function.
  No hooks. No side effects. No React imports. Testable in Node.js.
  
- SolicitanteSchema changes require a compliance review label on the PR.
  Tag the compliance owner (see CODEOWNERS) before requesting merge.
  
- EventoFormulario schema is append-only. New fields may be added.
  Existing fields may not be removed or have their types narrowed.
  
## Forbidden patterns

- No direct calls to Core Banking API from product apps.
  All Core Banking access goes through the BFF.
  
- No new shared Postgres tables. Product-specific schemas only.
  (See ADR-007 for the reasoning behind this constraint.)

These constraints are things that experienced engineers on the team know. They’re not in the type system. They’re not enforced by the compiler. Without CLAUDE.md, AI-assisted implementation would violate them regularly — not because the AI is careless, but because it doesn’t have access to the organizational history that shaped them.

Anatomy of a production-grade spec

Here’s a real spec from the OneBank platform — a service for handling form step completion events:

# StepCompletionService

## Responsibility
Records the completion of an onboarding form step and updates session progress.
Does NOT handle form submission — that's ApplicationSubmissionService.
Does NOT handle validation — validation is handled before this service is called.

## Input
StepCompletionEvent: {
  sessionId: string (UUID, active session required)
  productCode: ProductCode (from PRODUCT_CODE enum)
  stepId: string (from FORM_STEPS enum for this product)
  completedAt: Date (server-assigned, not client-provided)
  formDataSnapshot: Partial<SolicitanteSchema> (validated slice)
}

## Invariants
- A step cannot be marked complete if a later step is already complete.
  Steps must complete in order. Out-of-order completion is an error state.
  
- completedAt is always server-assigned. Client-provided timestamps are ignored.

- formDataSnapshot is stored for audit purposes. It must not be modified
  after storage. The snapshot represents state at the moment of step completion.

## Output
StepCompletionResult: {
  sessionId: string
  stepId: string  
  nextStep: FormStep | null (null if this was the final step)
  sessionProgress: number (0-100)
}

## Error cases
- SESSION_NOT_FOUND: sessionId doesn't exist or has expired
- STEP_ALREADY_COMPLETED: idempotent — return success with existing completion data
- OUT_OF_ORDER_COMPLETION: throw, do not silently handle
- INVALID_STEP_FOR_PRODUCT: stepId is not valid for the given productCode

## Out of scope
- Session creation or expiry — that's SessionManagementService
- Form validation — validation happens before this service
- Sending notifications — handled by event consumers downstream
- Calculating product-specific progress weights — all steps count equally

## Decision log
- Why store formDataSnapshot? Regulatory requirement. We need point-in-time
  records of what the customer submitted at each step, not just the final state.
  
- Why throw on OUT_OF_ORDER_COMPLETION instead of silently reordering?
  We considered silently accepting and reordering. Rejected: silent handling
  masks client-side bugs that indicate session state corruption. Fail loud.
  
- Why server-assigned completedAt? Client clocks are unreliable and
  manipulable. Regulatory audit requirements need server-side timestamps.

This spec took approximately 45 minutes to write. It saved approximately 3 hours of back-and-forth clarification during implementation, a compliance review cycle, and one post-implementation refactor when we discovered an out-of-scope concern had been implemented inside the service.

The specification pipeline

The workflow runs in five stages:

// SPECIFICATION PIPELINE — from requirement to merged implementation

graph LR REQ["Requirement\nticket or RFC"] SPEC_WRITE["Spec Writing\nEngineer — 30-90 min\nData models · invariants\nout-of-scope · decision log"] SPEC_REVIEW["Spec Review\nPeer + compliance owner\n24-48 hour SLA"] IMPL["AI-Assisted Impl\nClaude Code session\nSpec + CLAUDE.md context\n1-4 hours"] HUMAN_REVIEW["Human Review\nArchitecture gates\nCompliance tags\nTest coverage"] MERGE["Merge + ADR\nChangeset + docs\nADR if architectural decision"]

REQ --> SPEC_WRITE
SPEC_WRITE --> SPEC_REVIEW
SPEC_REVIEW -->|"approved"| IMPL
SPEC_REVIEW -->|"revise"| SPEC_WRITE
IMPL --> HUMAN_REVIEW
HUMAN_REVIEW -->|"approved"| MERGE
HUMAN_REVIEW -->|"revise spec"| SPEC_WRITE

The spec review stage is not optional. In a regulated environment, a spec that hasn’t been reviewed by a compliance owner is an unreviewed compliance decision. The review creates an artifact: a dated approval that a compliance-relevant implementation decision was reviewed before implementation began. This is auditable. The alternative — making compliance decisions in code review — creates an auditing problem, because code review is not a compliance review process.

The “revise spec” feedback loop from human review back to spec writing is where most of the value is recovered. When an implementation doesn’t feel right in code review, the issue is almost always in the spec, not the implementation. Working backwards from code to spec to requirement surfaces the actual decision that was wrong. Fixing the spec and reimplementing is faster than patching the code.

Architecture governance: what AI cannot decide

There are decisions that AI tooling should not make in enterprise systems, and they’re not primarily about correctness. They’re about authority and accountability.

Boundary decisions: Which service owns which concern is a governance decision, not a technical one. AI will generate technically functional code that crosses boundaries it has no authority to cross. If resolverConfigFormulario needs to call a compliance service, Claude Code will implement the call. That implementation might be correct. The decision to make the call is an architectural decision that requires human authority.

Regulatory interpretation: When a regulatory requirement is ambiguous, AI will make an interpretation. That interpretation will appear confident. It may be wrong. Regulatory interpretations in banking require legal review — not because the technical implementation is hard, but because the interpretation has compliance implications that extend beyond the codebase.

Deprecation and removal decisions: AI will suggest removing unused code. Removing code from a regulated system requires understanding whether the code is used for compliance archiving, whether removing it affects audit trails, and whether the removal creates a gap in regulatory coverage. These are not things AI can determine from the codebase alone.

ADR decisions: AI can generate the structure of an ADR — context, options, consequences. It cannot decide which option to choose. The decision records who made a technical choice and why. That authority must remain human.

The practical implication: Claude Code works best as an implementation tool with a well-bounded specification, not as an architectural decision tool. The spec is where architectural decisions live. The implementation is where they’re executed.

Scaling across teams

The natural question once spec-driven development works for one engineer is whether it scales to a team. The answer is yes, but the bottleneck shifts.

Individual adoption is fast. A developer who understands the spec → implementation → review cycle can operate independently at high leverage. Team adoption requires shared spec templates and shared CLAUDE.md conventions — which requires someone to own those artifacts.

// TEAM ADOPTION PATTERN — individual leverage to organizational workflow

graph TB subgraph INDIVIDUAL["Individual adoption (weeks 1-4)"] I1["Engineer writes spec\nbefore implementation"] I2["Claude Code session\nwith spec context"] I3["Human review\n+ feedback loop"] end subgraph TEAM["Team adoption (months 2-3)"] T1["Shared spec templates\nper component type"] T2["Shared CLAUDE.md\narchitecture context"] T3["Spec review process\n24-48h SLA"] T4["Compliance owner\nin review workflow"] end subgraph ORG["Organizational workflow (month 4+)"] O1["Spec-first\ndefinition of done"] O2["ADR generation\nfrom spec decisions"] O3["Quality gates\nin CI pipeline"] O4["Template library\nfor common patterns"] end INDIVIDUAL --> TEAM --> ORG

The spec template library is the highest-leverage team artifact. When a team has templates for “event handler service,” “form validation schema,” “BFF route handler,” and “audit log schema,” the spec-writing time drops from 45-90 minutes to 15-30 minutes, and the spec quality becomes consistent. Inconsistent specs are a larger problem at team scale than at individual scale — they create inconsistent implementations that require different mental models to reason about.

On the OneBank platform, we maintain templates for:

Domain service specs (business logic, no framework dependencies)
BFF route handler specs (auth, rate limiting, orchestration)
Form schema specs (Zod schemas with regulatory context)
Audit event specs (EventoFormulario extensions)

Each template includes the sections that are required for a spec in that category, with annotations explaining why each section exists. New team members learn the spec format by reading the templates and the annotations — which doubles as onboarding documentation for the architecture itself.

The limits that matter in enterprise contexts

Context window as a constraint: Claude Code has a context window limit. In large enterprise codebases, a single component might depend on patterns scattered across dozens of files. The CLAUDE.md file helps, but it doesn’t replace reading the actual implementation. Complex refactors that require understanding the full dependency graph of a large system hit context limits in ways that smaller codebases don’t.

The practical mitigation: scope sessions aggressively. A session should have a clearly bounded objective — implement this one service according to this spec — not “refactor the forms layer.” The bounded objective keeps the context focused on the files that matter.

Domain knowledge gaps: AI has no knowledge of your organization’s history. It doesn’t know about the compliance incident in month eight that led to the sealed ContextoRegulatoria type. It doesn’t know about the BFF refactor that moved rate limiting from application-level to gateway-level. It will suggest implementations that would have been correct before those decisions were made. CLAUDE.md is the mechanism for encoding this history, but it requires someone to maintain it as the codebase evolves.

Regulatory nuance: AI can apply regulatory rules that are explicitly in the spec. It cannot identify regulatory rules that are missing from the spec. If a spec omits a field that’s legally required, the implementation will omit the field, the code will pass tests, and the gap will surface in a compliance review — or in a regulatory audit. Spec completeness in regulated domains requires compliance expertise, not just technical expertise.

The confident wrongness problem: AI code generation expresses confidence that isn’t calibrated to correctness. A hallucinated API that doesn’t exist will be generated with the same apparent confidence as a correct one. In enterprise codebases with internal APIs, this is a specific risk — the internal API documentation may not be well-represented in training data. The mitigation is test-driven implementation: run the tests early, before assuming the implementation is correct.

Operational safeguards we’ve implemented

After months, these are the safeguards that have prevented production incidents:

Compliance change detection in CI: A script that detects changes to files in the forms package (specifically the schema and resolver files) and requires a compliance review label before staging promotion. This catches compliance-relevant changes that happen during implementation iteration — when a developer extends a schema to handle an edge case without realizing the extension has compliance implications.

ADR requirement for architectural decisions: If a PR introduces a new pattern not represented in existing code, CI requires a linked ADR. This is enforced by a label check — no merge without either an ADR link or an explicit “no ADR required” label from an architect. This creates a paper trail for decisions made during AI-assisted development.

Test coverage gate on business logic: 95% branch coverage required on resolverConfigFormulario and schema validators. This is not a global coverage requirement — it’s scoped to the logic that has compliance implications. Global coverage requirements produce the wrong incentives; scoped requirements produce the right ones.

Spec review SLA: Specs must be reviewed within 48 hours by a peer and the compliance owner. This prevents the spec from becoming a bottleneck while maintaining the governance property. If review takes longer, the developer starts the implementation with the implicit risk that the spec may change — which they’re allowed to do, with the understanding that spec changes after implementation begins create rework.

What this changes for engineering organizations

The operational conclusion after months is not that AI makes individual engineers dramatically more productive — though it does. It’s that AI makes the quality of engineering decisions dramatically more visible.

When implementation is fast, the constraint becomes decision quality. Organizations that were previously bottlenecked on implementation velocity are now bottlenecked on architectural clarity. Every ambiguity in a spec becomes a latent risk in an AI-generated implementation. Every unclear boundary becomes a place where the implementation will make a choice you didn’t make.

This is useful information. The organizations that treat spec quality as the primary engineering metric — not story points, not velocity, not deployment frequency — are the ones that get compounding value from AI tooling. The organizations that treat AI as a faster way to implement underspecified work get faster technical debt.

Spec-Driven Development isn’t a methodology for using AI. It’s a methodology for making engineering decisions explicit before they’re implemented. AI tooling makes the payoff from that explicitness higher — and makes the cost of skipping it more immediately visible.

The specification is not overhead. It’s the work. Everything after it is execution.