Agentic Pilot-to-Production
From whiteboard idea to a real agent in production behind your IAM, with eval coverage, observability, and structured governance documentation. 90 days, fixed scope.
The Problem
74% of GCC enterprises plan to deploy agents in 2026. Only 21% have governance maturity. The pilot-to-production gap kills almost every internal agent project — the demo works in a notebook but never gets behind SSO, never gets evals, never gets shipped to a real user. The blocker isn't the model. It's everything around it.
The Outcome
One agent in production behind your identity provider, with a working eval harness, real observability, and a governance package structured around ISO 42001 and EU AI Act expectations. Your team learns the operating model on the way through, so the second agent costs half as much.
In Scope
Discovery
Weeks 1–2- Use-case selection and impact sizing
- Data inventory, sensitivity classification, and access mapping
- Eval harness design — golden set, synthetic edge cases, red-team prompts
- Architecture decision: model choice, RAG vs tool-use, deployment surface
Build
Weeks 3–8- Agent implementation on Claude / GPT / Gemini via Vercel AI Gateway
- MCP integration layer with audit logging and SSO-bound permissions
- RAG corpus indexing if applicable, with retrieval evals
- Tool-use surface area — internal APIs, knowledge bases, action endpoints
- Observability stack: traces, evals, cost telemetry, anomaly alerts
Hardening
Weeks 9–10- Eval pass against golden set and red-team set
- Security review, penetration test on agent endpoints
- Cost ceiling, rate limits, runaway protection
- Human-in-the-loop escalation paths for ambiguous cases
Production
Weeks 11–12- Production deploy behind your IAM and audit logging
- Governance documentation: AI inventory entry, risk classification, monitoring plan, retirement policy
- Runbook and incident response procedures
- Knowledge transfer to your engineering team
How We Engage
01
Scoping call (60 minutes) — we review your use case, map dependencies, and write back with a yes/no on fit within 48 hours.
02
Fixed Statement of Work — written scope, milestones, named team, fixed price. No hourly billing surprises.
03
Kickoff in ≤14 days — we start the discovery sprint within two weeks of contract signing.
Why Codenovai
We're an operator-first agency — we build AI products of our own (Hisabi.ai) using the same patterns we deploy for clients. The architecture and eval-harness recommendations we ship are ones we'd run ourselves.
FAQ
- Is the model choice locked in or can we swap providers later?
- We default to routing through Vercel AI Gateway, which means the agent calls 'anthropic/claude-sonnet-4.6' as a string and you can swap to OpenAI, Google, or any future provider with a config change. Your business logic, prompts, and evals stay portable. We avoid SDK-locking patterns that would tie you to one foundation model.
- Do you keep working on the agent after the 90 days?
- Most clients move to a Fractional AI Team retainer after the pilot ships — typically a Pod tier (3 specialists) for 6–9 months while the second and third agents go through the same playbook. The retainer is separate from the pilot SoW so you're not locked in. About 20% of clients take ownership in-house immediately after handover; we support either path.
- What if the use case doesn't pan out during discovery?
- If discovery surfaces that the use case won't work — data quality is too poor, the LLM can't hit the accuracy bar, the ROI math collapses — we kill the project at the end of week 2 and refund 80% of the contract value. We've never done it because we screen hard at the scoping call, but the option is in the SoW.
- Where is the agent hosted and where does data go?
- Default deployment is on AWS me-south-1 (Bahrain) for UAE/GCC data residency, with model inference routed through Vercel AI Gateway. For regulated industries we offer on-premise deployment via the Sovereign AI offer. Data leaves your environment only when the agent calls a foundation model API; we can scope an air-gapped variant if that isn't acceptable.
- How do you measure if it's working in production?
- Three numbers, reviewed weekly: task success rate against the golden set, average cost per resolved task, and human escalation rate. We instrument those from day one of production. If any drifts beyond the SoW thresholds, we pause and root-cause before continuing the rollout.