From Vibe Coding to AI Engineering
A practical shift from vibe coding to disciplined AI engineering.
Like many, I started with vibe coding. Prompt the model, watch code appear, tweak until it works, move on. It was fast, fun and undeniably impressive. It was also fragile.
TL;DR
- Technical leaders and software builders: A clear workflow from Project -> Proposal -> Design -> Tasks that keeps AI output consistent and changes traceable.
- Underwater photographer: An application that supports and augments identification and enhancement while keeping the photographer in charge of the final calls.
My Background and Motivation
I’m a semi-retired technical architect/software engineer. Stepping back from full-time work gave me space to explore, but it didn’t remove the itch to build systems that have real constraints, tradeoffs and consequences. Over the past year, that curiosity has turned into focused experimentation with AI-assisted software engineering.
As soon as I tried to evolve those prototypes into something durable, the cracks showed. The same shortcuts that made early demos feel magical started to surface as drift, ambiguity and rework. That’s where the shift from vibe coding to AI engineering practices began.
From Vibe Coding to AI Engineering Practices
Vibe coding was the natural starting point. The framing popularized by Andrej Karpathy legitimized fast, exploratory loops where the developer steers and the model fills in gaps. It’s an effective way to learn what’s possible and to move quickly when the cost of mistakes is low.
The problems appear when systems live longer than a single session. Writers like Simon Willison consistently highlight the same lesson from real usage. Models are useful, but unreliable by default. Without constraints, artifacts and verification, AI-assisted workflows degrade into guesswork.
That’s the transition in practice: vibe coding teaches speed, but AI engineering practices are what let you keep the speed without losing control.
A Step Change, Not a Shortcut
LLMs are not just a productivity boost. They represent a step change—a discontinuous jump in capability, not a marginal gain—in how software can be constructed, but only if treated as engineering components, not magic.
The pattern is clear: tools change, principles don’t. The leverage is real, but only when it is framed by explicit intent, clear boundaries and tight feedback loops. Without those guardrails, systems drift, costs spike and confidence erodes. Treating models as components means specifying inputs and outputs, enforcing contracts and instrumenting behavior the same way you would any other critical dependency.
That realization pushed me toward:
- AI Engineering 12-factor principles
- Context engineering as a first-class design concern
- Spec-driven development using OpenSpec
A Manifesto for AI Engineering
AI-assisted software engineering is not about better prompts. It’s about building systems that remain trustworthy as models change. Large language models are powerful, but they are probabilistic, opaque and constantly evolving. Treating them as deterministic components—or as creative collaborators that “figure things out” for us—is a category error. What’s required is engineering discipline: explicit intent, bounded behavior and observable outcomes.
That discipline starts with a simple change of priorities: specs before prompts. Prompts are an implementation detail; specs define responsibility. If inputs, outputs, acceptance criteria and failure modes are not written down, the system will drift. Prompt tuning may hide that drift temporarily, but it cannot correct it. Determinism is not about identical tokens; it’s about predictable system behavior at the boundaries that matter.
Context is architecture. Models reason only over what they are shown, and accumulated context produces behavior that cannot be debugged or trusted. Designed context—minimal, intentional, versioned and owned by the system—turns probabilistic reasoning into something that can be inspected, replayed and improved. In the same way, AI outputs are interfaces, not prose: anything consumed downstream must be structured, validated and contractually enforced.
Observability is non-negotiable. If you cannot replay what the model saw and why it responded, you cannot operate the system. Inputs, outputs, versions, cost and latency are baseline signals, not optimizations. These ideas align closely with the emerging discipline of AI Engineering. I have pulled them into a working set of principles, Dan’s AI Engineering 12 Factors, in Appendix A.
A Real-World Project, Not a Toy Example
Why This Problem Domain?
I chose underwater photography because it’s a real workflow that I already live with. A single trip can produce thousands of images under constantly changing conditions. Many of my subjects are small and often ambiguous, and identification is rarely a clean yes-or-no decision.
It’s also a domain where AI fits naturally as an assistant, not an authority. AI can help narrow candidates, suggest identifications or clean up images, but the final call still belongs to the photographer. If an AI-assisted system can support this kind of workflow without getting in the way, it’s a good signal that the underlying engineering approach is sound.
Introducing Critterflow
That is where Critterflow comes in. It is the product testbed for this workflow, built around a real underwater photo library and the AI-assisted identification and enhancement loops that go with it. The landing page below is the front door to that experience, and it frames the goals of the system before we dive into architecture.
Architecture Overview
The goal wasn’t to let AI build the app. It was to make AI a reliable part of the workflow with clear limits. I wanted stable outputs and earlier signals when things drifted. That also makes change safer because expectations are explicit and verifiable.
At a high level, the system is split into four layers. They cover what the user interacts with, how the system runs requests, how integrations are handled and how data is managed. This keeps each part focused and easier to maintain. critterflow.com - Architecture Overview
User Experience
- Image Library Management - The main dashboard for browsing uploads, tracking identification status, editing metadata and organizing images into collections and tags
- Portfolio Management - Users curate public-facing photos, set visibility and control what appears on their public portfolio
- AI Driven Species Identification - AI suggests species candidates with confidence, and users can confirm or correct results
- AI Image Enhancement - AI-assisted improvements such as clarity, color or detail, saved alongside the original
- Profile - User settings, preferences and public profile details
- Admin - Model catalog, prompts, user management and system health tools
API & Orchestration
- Image & Portfolio - Upload, list, update and publish photo records and collections
- Species Identification - Enqueue identification jobs, poll status and store results
- Image Enhancement - Run enhancement jobs and persist outputs
- User Profile & Subscription - Profile settings, preferences and usage metadata
- Admin & Observability - Admin APIs, audit logs and system metrics
Integration
- LLM Model Adapters - Provider-agnostic interfaces for model calls and configuration
- Integration Adapters - OAuth connections and external service integrations
Data & State Management
- Image - Source files, thumbnails and metadata snapshots
- Identification - Results, confidence and correction history
- Collections - User-defined groupings and curated sets
- User Profiles - Preferences, onboarding status and public profile data
See Appendix B for the infrastructure tech stack and Appendix C for frameworks and versions.
Why Determinism Matters in AI-Assisted Systems
Determinism here does not mean identical outputs from probabilistic models. It means predictable system behavior: contracts are enforced, failure modes are explicit, costs are bounded and downstream consumers can rely on stable guarantees. Without this, tests lose meaning, regressions become invisible and confidence in change erodes.
Determinism is what lets AI-assisted systems scale beyond a single developer or session. It makes specs enforceable, observability actionable and iteration safe. Without it, AI remains a powerful but unreliable assistant. With it, AI becomes a component that can be trusted, integrated and evolved.
Context Engineering (Staying Out of the Dumb Zone)
Determinism depends on what the model sees. Even with a fixed spec and prompt, small shifts in context can change outcomes, so context has to be engineered with the same discipline. This section reinforces that approach by showing how context is trimmed and matched to each phase of work. It also sets up the OpenSpec workflow next, where specs, proposals and tasks define the guardrails that keep context stable. That’s when I ran into what I started calling the “dumb zone.”
The dumb zone happens when a model is given too much context. Not bad context—just too much of it. Signals compete, focus blurs and outputs get longer while confidence and clarity drop. Compaction is the practical countermeasure: trim to only what the current step needs, and make that bundle explicit.
If adding context makes the model talk more but decide less, you’ve crossed from reasoning into noise—compact the context before proceeding.
This maps to factors 2 (design context deliberately), 4 (isolate non-determinism) and 6 (build observability). Context compaction is how those factors get enforced in day-to-day work.
A simple heuristic emerged: if added context makes the model talk more but decide less, you’re in the dumb zone. That shows up as longer answers, weaker commitments and drift from the spec even when nothing looks wrong. The fix is matching context to the phase of work—broad for exploration, structured for planning and minimal for execution. Compaction is how you enforce those profiles in practice. The table below summarizes those profiles.
| Phase | Context Strategy | Goal |
|---|---|---|
| Research | Broad, exploratory context | Discover ideas and possibilities |
| Plan | Narrowed, structured context | Decide intent, constraints and success |
| Implement | Aggressively compacted context | Produce deterministic, reviewable output |
Rule of thumb:
Broad context for research. Narrow context for planning. Minimal context for implementation.
Context engineering defines what the model sees, and observability (factor 6) lets you replay what it saw and did. The remaining failure mode is intent drift across phases, which reintroduces noisy context. That is where OpenSpec became the backbone.
How OpenSpec Changed My Workflow
OpenSpec changed my workflow by making the Project -> Proposal -> Design -> Tasks path explicit. It forces me to name intent early, lock scope before solutioning and keep the reasoning visible as the work moves forward. Each step captures a different type of intent so I can move from idea to implementation without guessing or losing context, and it leaves a trace that makes reviews and handoffs straightforward.
Here is the workflow at a glance:
| Step | Purpose | Description |
|---|---|---|
| Project | Define purpose and constraints | States what it is, who it serves and the non-negotiable constraints. |
| Proposal | Describe the change | Defines scope, why it matters and acceptance criteria. |
| Design | Decide how it will work | Captures tradeoffs, interfaces and solution shape. |
| Tasks | Plan the work | Breaks work into steps, tests and completion evidence. |
A Basic OpenSpec Project Layout
openspec/
├── project.md
├── specs/
│ └── <capability>/
│ ├── spec.md
│ └── design.md
└── changes/
├── <change-id>/
│ ├── proposal.md
│ ├── tasks.md
│ ├── design.md # optional
│ └── specs/
│ └── <capability>/
│ └── spec.md # delta requirements
└── archive/
└── YYYY-MM-DD-<change-id>/
└── ...
This structure separates:
- Project intent (project.md)
- Steady-state requirements (specs/)
- Change proposals and tasks (changes/)
- Design decisions (design.md)
Example Artifacts
openspec/project.md
# Project: Critterflow
## Purpose
Manage an underwater photo library with AI-assisted identification and enhancement.
## Constraints
- Spec-first AI workflows
- Provider-agnostic model adapters
- Full observability of AI calls
openspec/changes/add-lightroom-batch-agent/specs/lightroom-batch-agent/spec.md
## Requirement: Batch agent configuration
The system SHALL let a user configure an album, schedule, enabled state and a per-run cap.
## Requirement: Manual and scheduled runs
The system SHALL support both "Run now" and scheduled runs from the same config.
## Requirement: Album-scoped identification
The system SHALL enqueue identify jobs for new assets in the configured album only.
The system SHALL NOT write metadata back to Lightroom in this capability.
openspec/changes/add-lightroom-batch-agent/proposal.md
# Change: Lightroom Batch Agent (album-scoped identify)
## Why
Lightroom users want a hands-off way to identify new photos in a dedicated album.
## What Changes
- Add a batch agent config with album, schedule, enabled and max assets per run
- Support scheduled runs and a manual "Run now" trigger
- Detect new album assets and enqueue identification jobs
- Surface agent status and run history in Profile > Integrations
openspec/changes/add-lightroom-batch-agent/design.md
## Goals / Non-Goals
- Support scheduled and manual runs for a single album
- Persist run status and counts for UI display
- No Lightroom metadata write-back
## Context / References
- docs/openapi.yaml
- data_models.md
## API Contracts
- docs/openapi.yaml is the baseline source of truth
- New endpoints are first modeled there, then implemented
- Proposed changes include new agent config/run routes with examples
- Frontend API client stays in sync with the updated spec
## Data model details
- AgentConfig: userId, configId, albumId, schedule, enabled, maxAssetsPerRun, lastRunAt
- AgentRun: runId, configId, status, counts, startedAt, endedAt, errorSummary
- AgentAsset: configId, assetId, processedAt to prevent duplicate processing
## Decisions
- AgentConfig stores albumId, schedule, enabled, maxAssetsPerRun because these are the stable knobs users control
- AgentRun stores runId, status, counts, startedAt/endedAt to make runs traceable and UI-friendly
- Scheduler triggers runs; manual uses the same worker path to avoid divergent logic
## Risks
- Album listings can be slow or rate-limited, which can delay runs
- Large albums can enqueue too many jobs without a strict cap
- Partial failures can leave run status ambiguous without clear error summaries
openspec/changes/add-lightroom-batch-agent/tasks.md
## Phase 0: Context setup
- [ ] Define context compaction rules (source docs, exclusions, max length)
- [ ] Run lint and build for the baseline services
- [ ] Compact context for Phase 1
## Phase 1: Data model
- [ ] Define AgentConfig (userId, albumId, schedule, enabled, maxAssetsPerRun, lastRunAt)
- [ ] Define AgentRun (runId, configId, status, counts: queued/processed/succeeded/failed, startedAt, endedAt, errorSummary)
- [ ] Define AgentAsset (configId, assetId, processedAt) to prevent reprocessing
- [ ] Run lint and build for the data model changes
- [ ] Compact context for Phase 2
## Phase 2: API surface
- [ ] Add API route: GET /integrations/lightroom/batch-agent (config + lastRun summary)
- [ ] Add API route: PUT /integrations/lightroom/batch-agent (body: albumId, schedule, enabled, maxAssetsPerRun)
- [ ] Add API route: POST /integrations/lightroom/batch-agent/run (body: runType=manual, optional maxAssetsOverride)
- [ ] Add API route: GET /integrations/lightroom/batch-agent/runs (query: limit, status, since)
- [ ] Run lint and build for API updates
- [ ] Compact context for Phase 3
## Phase 3: Orchestration
- [ ] Implement scheduler: EventBridge cron -> dispatcher that scans enabled AgentConfig records and enqueues run jobs
- [ ] Implement worker: load AgentConfig, diff album assets vs AgentAsset, enqueue identify jobs, persist AgentRun counts/status
- [ ] Run lint and build for orchestration changes
- [ ] Compact context for Phase 4
## Phase 4: UI
- [ ] Add Profile > Integrations UI: album selector, schedule picker, enabled toggle, max-assets cap, run history table and "Run now"
- [ ] Run lint and build for UI changes
- [ ] Compact context for Phase 5
## Phase 5: Tests and verification
- [ ] Add unit tests for happy path flows (config save, manual run, scheduled run, run status)
- [ ] Add unit tests for edge cases (duplicate run, empty album, maxAssetsPerRun cap, retryable failures)
- [ ] Reach at least 75% unit test coverage for batch agent modules
- [ ] Run unit test suite and verify logs/metrics for run execution paths
- [ ] Run regression tests for identification and Lightroom integration workflows
- [ ] Manual verification: save config (album, schedule, enabled, cap) and confirm persistence
- [ ] Manual verification: trigger "Run now" and verify runId and status updates
- [ ] Manual verification: check run history for counts, timestamps and status transitions
- [ ] Manual verification: simulate failure and confirm error summary and user-facing messaging
- [ ] Run lint and build after test updates
- [ ] Compact context for Phase 6
## Phase 6: Spec sync
- [ ] Update OpenAPI docs and verify frontend API client stays in sync
- [ ] Run lint and build for spec/client updates
- [ ] Compact context for release
Why This Changed Everything
Once this structure existed, the workflow stopped being “prompt → code → hope” and became “spec → design → change → implementation → validation.” You can see the handoffs in the examples above: the project file locks in purpose and constraints, the proposal pins down the why and the scope, the design turns that into concrete decisions and the tasks spell out exactly what changes and tests get executed.
The effect is that every change has a paper trail. The OpenAPI spec is updated first, the routes and schemas are explicit, and the UI and worker changes follow a known plan instead of a guess. That sequence makes reviews faster, regressions easier to spot, and AI outputs more predictable.
Lessons Learned
- Intent disappears quickly if it isn’t written down. Specs captured intent before it leaked into prompts and code. Recovering intent later was far more expensive than starting with imperfect specs.
- AI accelerates structure—or chaos. When boundaries were clear, AI sped things up. When they weren’t, it sped up drift. Model quality mattered less than having contracts and validation around AI behavior.
- Context matters more than prompts. Most variability came from context, not models. Compacting context, scoping it by phase (research, plan, implement), and treating it like a build artifact reduced variance more than any prompt tweak.
- Observability replaces guesswork. Logging inputs, outputs, versions, cost and latency turned “why did this change?” into a trace instead of a debate. Retrofitting this later was painful.
- Failure beats false confidence. Allowing explicit “no result” outcomes prevented more damage than trying to always be helpful.
- Small AI responsibilities scale. Narrow, well-defined AI tasks stayed testable and understandable. Broad interactions hid bugs and made reviews harder.
The takeaway: AI didn’t remove the need for engineering discipline. It made the cost of skipping it obvious much faster.
Closing Thoughts
The industry is converging on the same conclusion: agents do not earn trust by being clever. They earn it by being measurable. The next phase for me is about system proof, not model hype: compare coding agents under the same spec-first rules, expand Critterflow’s agent workflow and build the evaluation harness that keeps AI honest. I’ll be field-testing this on upcoming Philippines dive trips, and I’m opening a small alpha for underwater photographers who want early access.
Appendix A: Dan’s AI Engineering 12 Factors
- Make intent explicit. Specs define responsibilities, failure modes and boundaries so teams and agents share the same target. That turns intent into something testable and reviewable. When intent is ambiguous, models optimize for style or speed instead of correctness. Clear specs also make disagreements visible early.
- Design context deliberately. Treat context as a build artifact that is minimal, owned and versioned. This keeps reasoning stable and makes decisions reproducible. Uncontrolled context inflates cost and invites drift. Deliberate context also makes it easier to audit what influenced a decision.
- Treat outputs as interfaces. AI outputs should be structured, validated and contractually enforced like APIs. Downstream systems should depend on shape and meaning, not prose. Prefer schemas, enums and explicit contracts over free-form text. This reduces downstream parsing errors and makes regressions detectable.
- Isolate non-determinism. Keep probabilistic behavior inside deterministic workflows with caps, retries and guardrails. That contains variance so it is observable and debuggable. Make variability explicit with thresholds, sampling controls or deterministic fallbacks. The goal is not zero variance but controlled variance.
- Version everything. Specs, prompts, context and models must be traceable. Versioning enables comparisons, rollbacks and safe iteration. Without versions, you cannot explain why behavior changed. Versioned artifacts also enable A/B comparisons and safe rollback paths.
- Build observability. Log inputs, outputs, cost, latency and model versions for every run. Replayability is the only way to understand drift. Capture enough detail to reconstruct any answer end to end. Observability should be baseline, not a post-incident add-on.
- Systematize evaluation. Use golden sets, regression gates and scorecards to turn quality into an engineering signal. Run them continuously so changes fail fast. Evaluation should map to product goals, not just model scores. Automate it so teams do not rely on heroics before release.
- Design for replacement. Abstract providers behind adapters and keep tests tied to contracts. This makes swapping models or vendors routine, not risky. Avoid coupling prompts and logic to a single vendor’s quirks. If a provider changes terms or quality, migration should be a planned operation.
- Allow failure. “No result” is acceptable when confidence is low and the alternative is a wrong answer. Provide fallbacks and escalation paths so users stay safe. Explicit failure modes prevent silent hallucinations. Users trust systems that say “I don’t know” when they should.
- Keep responsibilities small. Narrow tasks are easier to validate and easier to improve. Small scopes also limit blast radius when something drifts. Smaller tasks allow clearer inputs, outputs and evaluation criteria. They also enable parallel development without cascading risk.
- Keep humans accountable. AI proposes, humans decide and approve. Explicit ownership preserves trust and keeps responsibility clear. Decision points should be visible in logs and UI. Accountability also means documenting who approved changes and why.
- Optimize for change. Assume models, prompts and tools will evolve. Use versioning, migrations and rollout gates so updates do not break trust. Design migrations as first-class work with tests and rollback plans. Trust grows when users see improvements arrive without surprises.
These factors synthesize recurring themes from established AI engineering leaders: Simon Willison on constraining agents after untrusted input and strict interfaces (prompt injection design patterns), Anthropic on long-running agent harnesses, structured artifacts and tests (effective harnesses), OpenAI on evaluation as a guardrail against regressions (OpenAI Evals) and structured tool interfaces (function calling cookbook). NVIDIA frames guardrails as programmable controls for safer systems (NeMo Guardrails), Hamel Husain emphasizes continuous evals and trace logging (Your AI Product Needs Evals) and Andrej Karpathy highlights reproducibility and validation checkpoints in model training (nanoGPT).
Appendix B: Infrastructure Stack
User Experience
| AWS service | What it’s used for |
|---|---|
| CloudFront | CDN for the Next.js frontend (SSR + static assets) and image optimization |
| Route 53 | DNS for frontend domains |
| ACM | TLS certificates for CloudFront distributions |
API and Orchestration
| AWS service | What it’s used for |
|---|---|
| API Gateway | Front door for the Fastify API under /v1 |
| Lambda | API handler + background workers (identify/export/lightroom/enhancements) |
| SQS | Async job queues + DLQs for identification/export/enhancement/lightroom workflows |
| EventBridge | Scheduled jobs (Lightroom batch dispatch, session archive) |
| CloudWatch Logs | Lambda log collection and tailing |
Integration
| AWS service | What it’s used for |
|---|---|
| Cognito | User authentication (User Pool, Hosted UI, admin user ops) |
| Bedrock | Vision model inference for identifications |
| KMS | Encrypt/decrypt integration tokens (Lightroom) |
Data and State Management
| AWS service | What it’s used for |
|---|---|
| S3 | Store uploads, exports, session archives, frontend assets |
| DynamoDB | Primary data store (single-table + session log + job tables) |
| SSM Parameter Store | Store frontend config + revalidate token values |
| IAM | Roles/policies granting access to AWS resources |
| CloudFormation | Stack deployment and output discovery |
Appendix C: Frameworks Tools
| Area | Framework / Tool | Version | Notes |
|---|---|---|---|
| Runtime | Node.js | 22 | Backend/infra runtime. |
| Backend | TypeScript | 5.4.0 | Backend compiler. |
| Backend | Fastify | 4.29.1 | API framework. |
| Backend | AWS SDK v3 | 3.94x | Core AWS clients used in the API/worker. |
| Backend | DynamoDB Toolbox | 0.9.5 | Entity/data modeling. |
| Backend | Zod | 3.25.76 | Request validation. |
| Frontend | Next.js | 16.0.10 | App Router frontend. |
| Frontend | React | 19.2.1 | UI framework. |
| Frontend | TypeScript | 5.x | Frontend compiler. |
| Frontend | Tailwind CSS | 4.x | Styling system. |
| Frontend | MUI | 7.3.6 | UI component library. |
| Frontend | OpenNext | 3.1.3 | Deploy/SSR packaging. |
| Infra | AWS CDK | 2.152.0 | Infrastructure as code. |


