Back to work
Production2025–2026Founder & Full-Stack Engineer80% complete

ShinobiData

Bloomberg-grade equity-research platform — 10k+ tickers, 200+ screener fields, sub-50ms filters, NL queries in 5 languages.

Summary

ShinobiData is an equity-research platform unifying SEC EDGAR, Alpha Vantage, and analyst data into a single Bloomberg-style surface — search any of 10,000+ listed instruments, drill into 11 financial subsections (chart, P&L, balance sheet, cash flow, ratios, valuation, peers, estimates), and run a 200+-field screener across the full universe in sub-50ms. A custom field-registry codegen makes a single TypeScript schema the source of truth for GraphQL types, Prisma selects, screener columns, fieldmaps, Redis preload selectors, AI prompt context, and the compact index-mapped format — adding a new metric is one line of schema. Backed by a 5-tier GitHub Actions ETL DAG running on a self-hosted EC2 runner, a 200+-column denormalized screening snapshot, three layers of caching (Redis + Cloudflare + ISR), Firebase auth with role-based tiers, an NL → DSL screener with a 3-provider AI fallback chain, and a model-context-protocol server so Claude and OpenAI Apps can call the same data plane.

Target user

Retail investors, fundamental-analysis hobbyists, and finance-curious developers who want Koyfin/Bloomberg-style company deep-dives and a real high-cardinality screener without paying $499/month — plus AI agents that can pull the same data via MCP.

§ 01Stack
01Primary
Next.jsReactTypeScriptPostgreSQLPrisma
02Infrastructure
DockerAWS ECS Fargate + Fargate SpotAWS ECRAWS CodeBuild + S3 build cacheAWS RDS PostgresAWS ElastiCache RedisEC2 self-hosted GitHub Actions runnerGitHub Actions 5-tier ETL DAGCloudflare edge cache + programmatic purge
03Integrations
Alpha Vantage APISEC EDGAR XBRL company-factsAnthropic Claude SonnetOpenAI GPT-4oDeepSeekFirebase Auth + firebase-adminZoho IMAPBillionMailModel Context Protocol
04UI / Frontend
Tailwind CSSshadcn/uiRadix UIRechartsFuse.jsMotionlucide-reactnext-intlxlsx-js-style
§ 02Key features
  1. 01

    Shipped a 200+-column denormalized `CompanyScreeningSnapshot` materialized from 5+ source tables and composite-indexed across market cap / P/E / growth / upside / consensus rating, achieving sub-50ms screener filters across the full 10k+ ticker universe via an in-memory Redis snapshot.

  2. 02

    Built a field-registry codegen that makes a single TypeScript schema the source of truth for 8 downstream artifacts — GraphQL types, TS interfaces, Prisma selects, screener columns, fieldmaps, Redis preload selectors, AI prompt context, and the compact index-mapped format — so adding a new metric is one schema line with zero drift.

  3. 03

    Designed a 5-tier GitHub Actions ETL DAG (instrument listings → 10 Alpha Vantage pipelines → 9 derived precompute pipelines → snapshot rebuild → Redis preload) running on a self-hosted EC2 runner — free compute, no 6-hour timeout, direct VPC access to RDS, systemd-managed with auto-restart.

  4. 04

    Implemented an NL → screener-DSL pipeline that builds prompt context from the field registry (so the model can only reference real fields), then tries DeepSeek → Claude Sonnet 4 → GPT-4o, returning a DSL string the existing screener already knows how to execute.

  5. 05

    Built a 3-layer caching stack — Redis preload with 7-day TTL, Cloudflare edge cache with programmatic purge, and Next.js ISR with top-20 prebuilt company pages — with graceful degradation: Redis miss falls through to Prisma without a user-visible failure.

  6. 06

    Shipped a multi-tier MCP server (`/api/mcp`) that exposes the same screener, search, company, portfolio, and market tools to Claude Desktop, Claude Code, and OpenAI Apps SDK via a single Apollo data layer — with OAuth, per-token rate limiting (60/min, 5000/day), an append-only audit log, and abuse-detection alerts.

  7. 07

    Hardened the screener and search read path against N+1 by reading the denormalized snapshot for screener queries and a Redis-cached top-N for search ranking (Fuse.js fuzzy + Cloudflare edge), with per-resolver timing harnesses and a query-performance benchmark script as part of the build.

  8. 08

    Generalized the platform to 5 locales (English, Japanese, Korean, Vietnamese, Gujarati) via `next-intl` with locale-prefixed App Router pages, so every company page, screener config, and AI prompt is fully translated.

§ 03Hardest problems
  1. The naive shape — one row per company joining income / balance / cash flow / ratios / analyst tables on every filter — was a non-starter at 10k tickers. Solved by precomputing a single denormalized `CompanyScreeningSnapshot` row per `(company_id, period_type, fiscal_year, fiscal_quarter)` via a Tier-3 GitHub Actions workflow, composite-indexing the columns the screener actually filters on, then loading the entire snapshot into Redis and filtering in-memory at request time. Cold path: Prisma. Hot path: ~5–20ms across the full universe. The compact-screener pipeline then reduces the 30MB snapshot to a 500KB index-mapped array for client delivery — market cap collapses to millions, field names disappear in favor of position-based indexing.

  2. Without codegen, adding `peg_ratio` would require coordinated edits across the Prisma schema, GraphQL types, TS interfaces, screener column config, screener allowed-fields whitelist, Redis preload selector, AI prompt context, and the compact-screener index map — and one missed edit means the AI silently invents a field the screener rejects, or the snapshot ships a column nothing reads. Solved by writing a custom `field-codegen.ts` that takes a single TypeScript field-registry schema and regenerates all 8 artifacts in one command. Adding a new metric is one line of schema, run `npm run fields:codegen`, and every layer stays in sync. The AI prompt context is built from the same registry, so the model is structurally incapable of hallucinating a field that doesn't exist.

  3. ECS Fargate would burn cash on the long-running precompute, and the free GitHub Actions runner caps out at 6 hours with no VPC access to RDS. Solved by standing up a self-hosted GitHub Actions runner on EC2 (`scripts/ec2/setup-github-runner.sh`) — systemd-managed with auto-restart on OOM, direct VPC access to RDS for low-latency Prisma reads, no compute bill for the 20 scheduled workflows, and no timeout. The heavy snapshot rebuild itself runs on Fargate Spot — CPU-heavy, can tolerate preemption, and is ~70% cheaper than on-demand. The app deploy stays on regular Fargate so user-facing requests aren't subject to spot interruption.

§ 04What I learned
  • L01
    Codegen pays for itself the first time you avoid a multi-file refactor.

    The field-registry started as a YAGNI smell — 'why generate when you can just write the eight things?' — but the eight things drifted within two weeks. The codegen now means a junior engineer (or me at 2am) can ship a new screener metric without understanding the data plane below the field schema. The Redis preload selector and AI prompt context being downstream of the same source means there's no class of bug where the AI knows about a field that the snapshot doesn't have.

  • L02
    Three caching layers with graceful degradation beats one perfect layer.

    Redis preload gives sub-50ms hot path; Cloudflare gives geo-distributed reads with programmatic purge for fresh data; ISR gives prebuilt company pages that survive a Redis outage. The non-obvious win is that any single layer can fail without the user noticing — a Redis blip falls through to Prisma, a Cloudflare hiccup falls through to the origin, an ISR cache miss revalidates on-demand. Building all three was 2x the work; the resulting availability surface is 10x more forgiving.

  • L03
    Self-hosted runners are the right answer for long-running, VPC-bound ETL — even though they sound like an ops burden.

    The setup script is ~50 lines, systemd handles restart-on-failure, and the runner has been up for months without intervention. The latency win from direct RDS access (vs. tunneling from a hosted GitHub runner) cut several precompute pipelines from ~40min to <15min, which compounds across the daily DAG.

§ 05By the numbers
source files
649
lines of code
~97700
prisma models
36
prisma migrations
34
github actions workflows
20
alpha vantage pipelines
10
precompute pipelines
9
api routes
13
tickers covered
~10000
screener fields
~200
locales
5
ai providers in chain
3
commits
586
development span months
6