ShinobiData
Bloomberg-grade equity-research platform — 10k+ tickers, 200+ screener fields, sub-50ms filters, NL queries in 5 languages.
Summary
ShinobiData is an equity-research platform unifying SEC EDGAR, Alpha Vantage, and analyst data into a single Bloomberg-style surface — search any of 10,000+ listed instruments, drill into 11 financial subsections (chart, P&L, balance sheet, cash flow, ratios, valuation, peers, estimates), and run a 200+-field screener across the full universe in sub-50ms. A custom field-registry codegen makes a single TypeScript schema the source of truth for GraphQL types, Prisma selects, screener columns, fieldmaps, Redis preload selectors, AI prompt context, and the compact index-mapped format — adding a new metric is one line of schema. Backed by a 5-tier GitHub Actions ETL DAG running on a self-hosted EC2 runner, a 200+-column denormalized screening snapshot, three layers of caching (Redis + Cloudflare + ISR), Firebase auth with role-based tiers, an NL → DSL screener with a 3-provider AI fallback chain, and a model-context-protocol server so Claude and OpenAI Apps can call the same data plane.
Target user
Retail investors, fundamental-analysis hobbyists, and finance-curious developers who want Koyfin/Bloomberg-style company deep-dives and a real high-cardinality screener without paying $499/month — plus AI agents that can pull the same data via MCP.
- 01
Shipped a 200+-column denormalized `CompanyScreeningSnapshot` materialized from 5+ source tables and composite-indexed across market cap / P/E / growth / upside / consensus rating, achieving sub-50ms screener filters across the full 10k+ ticker universe via an in-memory Redis snapshot.
- 02
Built a field-registry codegen that makes a single TypeScript schema the source of truth for 8 downstream artifacts — GraphQL types, TS interfaces, Prisma selects, screener columns, fieldmaps, Redis preload selectors, AI prompt context, and the compact index-mapped format — so adding a new metric is one schema line with zero drift.
- 03
Designed a 5-tier GitHub Actions ETL DAG (instrument listings → 10 Alpha Vantage pipelines → 9 derived precompute pipelines → snapshot rebuild → Redis preload) running on a self-hosted EC2 runner — free compute, no 6-hour timeout, direct VPC access to RDS, systemd-managed with auto-restart.
- 04
Implemented an NL → screener-DSL pipeline that builds prompt context from the field registry (so the model can only reference real fields), then tries DeepSeek → Claude Sonnet 4 → GPT-4o, returning a DSL string the existing screener already knows how to execute.
- 05
Built a 3-layer caching stack — Redis preload with 7-day TTL, Cloudflare edge cache with programmatic purge, and Next.js ISR with top-20 prebuilt company pages — with graceful degradation: Redis miss falls through to Prisma without a user-visible failure.
- 06
Shipped a multi-tier MCP server (`/api/mcp`) that exposes the same screener, search, company, portfolio, and market tools to Claude Desktop, Claude Code, and OpenAI Apps SDK via a single Apollo data layer — with OAuth, per-token rate limiting (60/min, 5000/day), an append-only audit log, and abuse-detection alerts.
- 07
Hardened the screener and search read path against N+1 by reading the denormalized snapshot for screener queries and a Redis-cached top-N for search ranking (Fuse.js fuzzy + Cloudflare edge), with per-resolver timing harnesses and a query-performance benchmark script as part of the build.
- 08
Generalized the platform to 5 locales (English, Japanese, Korean, Vietnamese, Gujarati) via `next-intl` with locale-prefixed App Router pages, so every company page, screener config, and AI prompt is fully translated.
The naive shape — one row per company joining income / balance / cash flow / ratios / analyst tables on every filter — was a non-starter at 10k tickers. Solved by precomputing a single denormalized `CompanyScreeningSnapshot` row per `(company_id, period_type, fiscal_year, fiscal_quarter)` via a Tier-3 GitHub Actions workflow, composite-indexing the columns the screener actually filters on, then loading the entire snapshot into Redis and filtering in-memory at request time. Cold path: Prisma. Hot path: ~5–20ms across the full universe. The compact-screener pipeline then reduces the 30MB snapshot to a 500KB index-mapped array for client delivery — market cap collapses to millions, field names disappear in favor of position-based indexing.
Without codegen, adding `peg_ratio` would require coordinated edits across the Prisma schema, GraphQL types, TS interfaces, screener column config, screener allowed-fields whitelist, Redis preload selector, AI prompt context, and the compact-screener index map — and one missed edit means the AI silently invents a field the screener rejects, or the snapshot ships a column nothing reads. Solved by writing a custom `field-codegen.ts` that takes a single TypeScript field-registry schema and regenerates all 8 artifacts in one command. Adding a new metric is one line of schema, run `npm run fields:codegen`, and every layer stays in sync. The AI prompt context is built from the same registry, so the model is structurally incapable of hallucinating a field that doesn't exist.
ECS Fargate would burn cash on the long-running precompute, and the free GitHub Actions runner caps out at 6 hours with no VPC access to RDS. Solved by standing up a self-hosted GitHub Actions runner on EC2 (`scripts/ec2/setup-github-runner.sh`) — systemd-managed with auto-restart on OOM, direct VPC access to RDS for low-latency Prisma reads, no compute bill for the 20 scheduled workflows, and no timeout. The heavy snapshot rebuild itself runs on Fargate Spot — CPU-heavy, can tolerate preemption, and is ~70% cheaper than on-demand. The app deploy stays on regular Fargate so user-facing requests aren't subject to spot interruption.
- L01
Codegen pays for itself the first time you avoid a multi-file refactor.
The field-registry started as a YAGNI smell — 'why generate when you can just write the eight things?' — but the eight things drifted within two weeks. The codegen now means a junior engineer (or me at 2am) can ship a new screener metric without understanding the data plane below the field schema. The Redis preload selector and AI prompt context being downstream of the same source means there's no class of bug where the AI knows about a field that the snapshot doesn't have.
- L02
Three caching layers with graceful degradation beats one perfect layer.
Redis preload gives sub-50ms hot path; Cloudflare gives geo-distributed reads with programmatic purge for fresh data; ISR gives prebuilt company pages that survive a Redis outage. The non-obvious win is that any single layer can fail without the user noticing — a Redis blip falls through to Prisma, a Cloudflare hiccup falls through to the origin, an ISR cache miss revalidates on-demand. Building all three was 2x the work; the resulting availability surface is 10x more forgiving.
- L03
Self-hosted runners are the right answer for long-running, VPC-bound ETL — even though they sound like an ops burden.
The setup script is ~50 lines, systemd handles restart-on-failure, and the runner has been up for months without intervention. The latency win from direct RDS access (vs. tunneling from a hosted GitHub runner) cut several precompute pipelines from ~40min to <15min, which compounds across the daily DAG.
- lines of code
- ~97700
- prisma models
- 36
- prisma migrations
- 34
- github actions workflows
- 20
- alpha vantage pipelines
- 10
- precompute pipelines
- 9
- api routes
- 13
- tickers covered
- ~10000
- screener fields
- ~200
- locales
- 5
- ai providers in chain
- 3
- commits
- 586
- development span months
- 6