skip to content
$sarthak.giri
// ai10 min read

Building Stockxie — when AI earns its place in the loop

Hard-won notes on prompt design, latency budgets, and not letting the model lie to your users.

Sarthak Giripublished 2026-01-19 · 10 min read

Stockxie is an AI-powered stock-battle and discovery app. Two tickers go in, a verdict comes out, and the model explains why. It would be easy to wrap a GPT call and ship — but financial UIs are unforgiving about hallucination, and recruiters are unforgiving about latency.

Latency is the product

The hardest constraint isn't model quality — it's perceived latency. We budget:

  • < 200ms before any pixel changes on submit (skeleton appears)
  • < 800ms before the first verdict token streams
  • < 4s for the full reasoning paragraph

Anything slower and the user closes the tab. To hit this:

  • Stream from the model the instant we have any output.
  • Compute the deterministic parts (price diff, sector match, P/E gap) on the server in parallel with the LLM call.
  • Render the deterministic parts first, then attach the AI rationale to them as it arrives.

Prompts are code, version them

I treat the prompt like any other source file:

  • Committed in the repo.
  • Has a version number that the model output references back.
  • Evaluated against a golden set of 20 ticker pairs before any change ships.

When the verdict on AAPL vs MSFT shifts from "MSFT slightly favored" to "AAPL strongly favored" between two prompt revisions and nothing else changed, that's a regression. The golden set catches it before users do.

Don't let the model hallucinate prices

The model never sees raw market data. It sees only:

  • A pre-computed delta (price_change_24h).
  • A pre-computed sentiment score.
  • A short fact sheet generated server-side.

The model's job is to narrate, not retrieve. This is the single biggest trust win — when the model writes "AAPL is up 2.4% in the last 24 hours," that number came from our database, not the model's guess.

Failure modes

Some I've learned the hard way:

  • Long-tail tickers with thin data make the model speculate. Detect this server-side and fail loudly instead of generating fiction.
  • Rate-limit errors during streaming need to be surfaced mid-message, not silently dropped — users will assume the bad verdict was real.
  • Prompt injection via ticker symbols (AAPL'); DROP TABLE--) is a real concern. Sanitize before interpolation.

What I'd still change

  • A small fine-tuned model for the boring deterministic narration would beat GPT on latency and cost. Saving for v2.
  • Better eval coverage — I have 20 ticker pairs; I want 200.
  • Caching by (ticker_a, ticker_b, day) would cut 60% of model calls.

Shipping AI features feels easy until the user hits edge case #3. The product work is in those edges.