Kinori LabsRequest a slot

Work · Case study

How we shipped a production AI feature in eight weeks.

An anonymised look at a recent AI implementation for a Q3 founder partner. Eval first, RAG layer, hand tuned prompts, latency budget under one second p95, cost ceiling that held through launch. Patterns you can lift into your own product.

The partner and the problem

The partner is a B2B SaaS team with a few thousand active accounts and a product that lives inside a workflow tool their customers open every day. The problem they brought us was familiar: they had been building an AI feature for six months, the demo looked great in a sandbox, and every attempt to ship it into production hit a quality wall. Specific complaints from their internal pilot users: answers that drifted off topic, citations that pointed to the wrong document, and an inference bill that was climbing past four figures per week with under fifty active testers.

They had two engineers on the feature, neither with prior production LLM experience. Their CTO was running the rest of the platform and could not own the AI work on top of that. We came in with a fixed eight week scope to either ship the feature with measurable quality, or document why it could not be shipped in that window.

The constraints

Four constraints, written into the scope on day one. Latency: sub one second p95 to first streaming token, sub three seconds p95 to full response. Cost: under twelve cents per query at the median, under forty cents at p99, hard ceiling. Quality: a measurable eval bar with at least 80 percent task completion against a held out test set. Data: every retrieval had to be auditable, with a citation that pointed to a real source document and a confidence score.

We added a fifth constraint that was not in their original spec: no model lock in. We would not let the feature depend on a single provider, because the model market in 2026 moves too fast to bet the product on one vendor.

The approach: eval first, then everything else

Week one was spent building an eval set. We pulled two hundred real queries from the existing pilot logs, hand labelled the ideal answer for each, and tagged them by difficulty. Eighty percent of the set became training observation, twenty percent was held out for final scoring. We wrote an LLM as judge harness with a small claude-3.5-sonnet model evaluating answer relevance and citation accuracy on a four point scale, calibrated against ten human labelled examples to keep the judge honest.

With the eval set in place we ran the existing prototype through it. Baseline score: 41 percent task completion. The team had spent six months pushing prompts around without ever measuring whether the changes helped. The first hour of running the eval changed how they talked about the feature for the rest of the engagement.

The architecture we shipped

Retrieval layer: documents chunked at 512 tokens with a 64 token overlap, embedded through text-embedding-3-large, stored in pgvector on Postgres alongside the application data so we did not introduce a second database. Hybrid search combining vector similarity with a Postgres tsvector full text match, reranked through a small cross encoder model running on a single GPU node. Top eight chunks passed to the generation layer, with metadata tags preserved through the pipeline so citations could be reconstructed.

Generation layer: gpt-4o-mini for the first pass on every query, with the prompt structured to emit a confidence score in a JSON envelope. If confidence dropped below a tuned threshold of 0.72, the query was escalated to gpt-4o with a richer prompt and the same retrieved context. The escalation rate settled at 14 percent of traffic after the first week of tuning. The result: a 73 percent reduction in inference cost compared to a single model gpt-4o baseline, with no measurable quality drop on the held out eval set.

Caching: deterministic prompts cached at the edge with a 24 hour TTL on the embedding lookups and a 5 minute TTL on full responses, keyed by a hash of the user query plus their account scope. Cache hit rate after two weeks: 31 percent. Observability: every query logged with prompt, retrieved chunks, model used, latency breakdown, and cost. We can pull any production query and replay it locally inside thirty seconds.

What shipped

By week eight the feature was live for all paying customers, gated behind a single feature flag. Final eval score: 84 percent task completion against the held out set, above the 80 percent bar. Production latency: 740ms p95 to first token, 2.4 seconds p95 to full response. Median query cost: 7.8 cents. P99 query cost: 31 cents. Inference bill in the first month post launch: under three thousand dollars with roughly four hundred active users, a number their CFO actually circulated approvingly.

The partner team kept ownership. We pair programmed every component, wrote a runbook for the on call rotation, and stayed on for two weeks of post launch monitoring. The two internal engineers who had been stuck on the feature for six months are now leading its next iteration.

What we learned

Eval first is not a slogan. The single highest leverage hour of the engagement was the hour we ran the existing prototype through a real eval set. Everything downstream became measurable. Cascading models work, but only with a calibrated confidence signal. Without a confidence score you cannot route, and without routing you pay full price for the easy queries. Latency is mostly a retrieval problem, not a generation problem: shaving 300ms off retrieval moved p95 more than swapping models did. Cost is a feature, not a footnote. A feature that costs forty cents per query will get killed in the next planning cycle no matter how good it is.

If you want this kind of work for your product, the studio runs a parallel Fractional AI Officer engagement. It pairs cleanly with a SaaS product development slot when AI is core to the build, or with a Fractional CTO engagement when AI is one of many open questions.

FAQ

Common questions

Why anonymise the partner?+

We do not publish partner names without explicit agreement. The technical patterns generalise, and the specific industry adds little value to a reader trying to apply them.

Could you have shipped faster?+

Not without skipping the eval set, and skipping the eval set means the feature regresses inside a quarter. Eight weeks was the floor for a production grade, observable, cost capped feature.

Is the 73 percent cost reduction real?+

Yes. Baseline was a single model call to gpt-4o per query. Final architecture used gpt-4o-mini for the first pass with escalation to gpt-4o only when the mini result fell below a confidence threshold. The cost delta was measured against a held out traffic sample.

What was the latency budget?+

Sub one second p95 for the streaming first token, sub three seconds p95 for full response. We hit both with caching, parallel retrieval, and a careful choice of context window.

Can you replicate this for our product?+

Yes, with caveats. The pattern fits most retrieval and copilot workloads. We will not promise the same numbers without seeing your data, your latency requirements, and your existing infrastructure.

Related

Keep reading

Next step

Want this for your product?

Tell us what AI feature you are trying to ship and where it is currently stuck. We reply from a real inbox within two business days.