Back to work
Production (migration complete; cleanup pending)2026Infrastructure Architect & Migration Engineer90% complete

AWS Infrastructure Consolidation

Collapsed 3 AWS accounts and 4 EC2 instances into a single r6a.large running ECS + Docker Compose — and cut the monthly cloud bill from ~$290 to ~$72 (75% reduction, ~$2.6k saved/year) with zero DNS changes and ~5 seconds of total downtime.

Summary

An AWS migration that consolidated three production accounts (niyamvora@hotmail.com, samayvora@gmail.com, incognitostocks@gmail.com) and four EC2 instances running across them into a single r6a.large in ap-south-1, with everything served by one Nginx reverse proxy and split across two complementary deployment surfaces — ECS tasks (5 first-party services: SimpliDeliver, ShinobiData, Dashboard, GitHub Runner, Nginx Proxy) deployed via CodeBuild → ECR → `ecs update-service`, plus an 8-container Docker Compose stack (DarkHorse frontend/backend/admin/mutualfund-api/mutualfund-ui/llm-sql/unsubscribe-go/unsubscribe-react) and a 2-container Excalidraw stack deployed via AWS SSM → `docker compose pull && up -d`. The cutover used a single Elastic IP swap (~2–5 seconds of measured downtime across all 10+ domains, zero DNS changes), SSH-based deploys were retired in favor of SSM, log access was unified behind a single `logs` CLI fronting CloudWatch and `docker logs`, and RDS was independently right-sized from db.t3.medium → db.t3.small. Steady-state bill dropped from ~$290/mo (AWS + Vercel + managed MySQL) to ~$86/mo on-demand and ~$72/mo with a 1-year No-Upfront EC2 Savings Plan.

Target user

Hiring managers and infra leads evaluating whether the candidate can own a real-world cost + reliability migration end-to-end — billing forensics, target architecture, container packaging, CI/CD redesign, DNS cutover, and post-mortem documentation — with no scheduled downtime window.

§ 01Stack
01Primary
AWS ECSDockerDocker ComposeNginxBash automation
02Infrastructure
EC2 r6a.largeAmazon ECRAWS CodeBuild + S3 build cacheAWS Systems ManagerAWS CloudWatch LogsElastic IPRDS for SQL Server ExpressAWS LambdaAWS Organizations consolidated billingLet's Encrypt + Cloudflare Origin certs
03Integrations
GitHub ActionsCodeBuild GitHub source hooksDocker HubECR Public mirror for `node:*` base imagesCloudflare
04UI / Frontend
AWS ConsoleCustom `logs` CLI
§ 02Key features
  1. 01

    Consolidated 3 AWS accounts and 4 EC2 instances (t3.large + t3.medium + t3.xlarge + t2.medium across ap-south-1) into a single r6a.large running 5 ECS tasks + 10 Docker Compose containers, then right-sized the host from t3.xlarge → r6a.large in a second pass for an additional $78/mo saving.

  2. 02

    Cut steady-state monthly cost from ~$290 to ~$86 (on-demand) and a forecast ~$72 (1-year No-Upfront Savings Plan) — a 75% reduction (~$2,616/year) verified against AWS billing exports for Jan–Mar 2026 vs the post-migration baseline.

  3. 03

    Executed the cutover with a single Elastic IP swap (13.127.181.125, eipalloc-094c3e1a9f07b12c6) — zero DNS changes across 10+ production domains, measured downtime ~2–5 seconds during EIP re-association, with the old t3.xlarge kept stopped (not terminated) as a 1-week rollback safety net.

  4. 04

    Retired SSH-based deploys end-to-end in favor of AWS SSM Run Command — pushed updated `build-push-deploy.yaml` to `simplidelivernext`, `finder_AV`, and `dhs_dashboard_nextjs` repos (commits `43dda0d`, `82b3711`, `04a135a`); rotated `AWS_ACCESS_KEY_ID` / `AWS_SECRET_ACCESS_KEY` / `ECS_INSTANCE_ID` in GitHub Secrets per repo.

  5. 05

    Designed dual deployment surfaces on the same host: ECS task definitions (CPU/memory soft+hard limits, CloudWatch Logs driver, auto-restart on crash) for first-party Next.js apps, and Docker Compose with `restart: unless-stopped` for third-party / sister-team images pulled from Docker Hub — giving CodeBuild speed for owned repos and SSM-driven flexibility for the 8-container DarkHorse stack.

  6. 06

    Migrated the entire DarkHorse stack (8 containers — `darkhorse-repo`, `dhs-backend-node`, `agency-home`, `mutualfund-api-backend`, `mutualfund-frontend`, `llm-sql-flask`, `unsubscribe-golang`, `unsubscribe-react`) plus Excalidraw (2 containers, SQLite DB and JWT/CSRF secrets copied to `/ecs-data/excalidash/prisma/`) onto the same host with no code changes — port assignments rewritten only in compose / task definitions, never in app code.

  7. 07

    Independently right-sized RDS for SQL Server Express from db.t3.medium ($73/mo) → db.t3.small ($22/mo), an additional $51/mo saving — safe because SQL Server Express is capped at 1 GB RAM regardless of host class, so the 4 GB / 8 GB tiers were paying for unusable memory.

  8. 08

    Shipped a `logs <service> [duration|-f]` zsh tool that wraps `aws logs tail /ecs/<service>` for ECS-managed containers and `aws ssm send-command ... docker logs` for the Docker Compose containers — a single grammar across two log surfaces, replacing the previous SSH + tail-per-host workflow.

§ 03Hardest problems
  1. The natural approach — point each domain's A-record (or Cloudflare origin) to the new instance — would have required updating records across Cloudflare (shinobidata, simplideliver, draw, darkhorsestocks subdomains) and a direct A-record provider for the unproxied ones, fanned out over multiple TTLs, with a real outage window per domain. Solved by routing every domain through a single Elastic IP (13.127.181.125) and Nginx hostname-based vhosts on the host: the migration becomes one `aws ec2 disassociate-address` + `associate-address` pair against the new instance ID, taking ~2–5 seconds total across all domains simultaneously, with zero DNS propagation involved. The same EIP that had pointed at the old t3.xlarge now points at the r6a.large; Cloudflare, registrars, and end-users see no change.

  2. First-party Next.js apps benefit from ECS's task lifecycle (CodeBuild → ECR → `ecs update-service --force-new-deployment` for atomic image swap with health-checked rollout), but the 8-container DarkHorse stack is owned by a sister team that ships via GitHub Actions → Docker Hub and needs `docker compose` semantics for selective per-service restart. Running both under ECS would have required wrapping every DarkHorse image in a task definition and rebuilding their CI. Running both under Docker Compose would have meant losing CodeBuild's S3 layer cache and ECS's restart/health/limits guarantees for the first-party apps. The shipped design keeps ECS tasks and Docker Compose containers on the same EC2 host inside the ECS-optimized AMI: ECS owns 5 task definitions including the Nginx proxy task; Docker Compose owns `/ecs-data/{darkhorse,excalidash}/docker-compose.yml` with `restart: unless-stopped`; both write logs to the same CloudWatch account (ECS native for tasks, SSM `docker logs` for compose); both restart automatically on crash; CPU and RAM headroom on the r6a.large (16 GB / 2 vCPU AMD) covers steady-state at ~1 GB RAM and 0.4% CPU across all containers combined.

  3. SQL Server Express on `mutualfunddb-2024-12-v1` (db.t3.medium) was the single most expensive line item across all three accounts ($82/mo) and pinned Account 3 open. A cross-account RDS snapshot copy + restore on Account 1 was the obvious answer but would have required IAM cross-account share, KMS key cross-account access, and a Cloudflare-routed connection-string flip for both `shinobidata.com` and `dashboard.darkhorsestocks.in`. The shipped sequencing — down-size in place first (db.t3.medium → db.t3.small, $73 → $22, safe because SQL Server Express is capped at 1 GB RAM regardless of host class, validated by Account 3's actual 6% avg / 17% peak CPU over a 7-day CloudWatch window) — captured the $51/mo Tier-3 saving immediately, with the cross-account move deferred behind a `Deferred — works cross-account for now` flag in the post-migration reference. The lesson surfaced: in a multi-tier optimization plan, the move that unblocks account closure (cross-account migration) and the move that captures the bulk of the saving (right-sizing) are often *not* the same move, and decoupling them preserved most of the value at a fraction of the risk.

§ 04What I learned
  • L01
    Elastic IP + Nginx vhosts is the right cutover primitive for multi-domain consolidation.

    The instinct on a migration this size is to plan a DNS change window, write a rollback runbook for each TTL, and accept some user-visible outage per domain. Routing every domain through a single EIP and letting Nginx host-route on the new instance collapses N domain cutovers into one EC2 API call — the only DNS-aware moving piece was Cloudflare origin IPs which were already set to the EIP, so they didn't change either. Worth designing the steady-state around this primitive from day one even before a migration is on the table.

  • L02
    Separate `right-size` from `relocate` whenever you can.

    The Tier-3 RDS downsize and the cross-account RDS move were conflated in the first plan and would have blocked each other (the snapshot would have been the old size; the new size would have re-snapshotted; the cross-account share would have re-keyed mid-flight). Splitting them — down-size first, defer relocate — captured $51/mo of $82/mo of savings in 5 minutes of reboot, with the harder move parked behind a flag for a quieter week.

  • L03
    SSM Run Command quietly retires the SSH-key blast radius.

    The pre-migration topology had three SSH key pairs (`niyamvora_ai`, `niyam_ai`, `mahdib` IAM users with corresponding `.pem` files in `credentials/.key/`) and several repos with `SSH_PRIVATE_KEY` + `SSH_HOST` secrets per environment. Switching CI/CD to SSM means deploys authenticate via AWS IAM and never touch a key file — the rotation surface shrinks from 'every repo that has ever deployed' to 'one IAM user', and SSH access can be locked down to a single admin path on the host. Worth the workflow rewrite even ignoring the cost story.

  • L04
    Honest billing math beats optimistic billing math when you're showing the work.

    The first internal projection put the post-migration bill at ~$220/mo (tier-4 with Reserved Instances baked in). The actual May 2026 invoice came in at $85.73 — better than projected — because the r6a.large right-size and the RDS down-size compounded with the Vercel + managed MySQL drops in a way the original tier-by-tier plan in [AWS_COST_OPTIMIZATION.md](https://github.com/) was too conservative to predict. Documenting actuals against projections in the post-migration reference makes the saved-dollars number defensible in a way a forward-looking plan cannot be.

§ 05By the numbers
monthly cost before usd
290
monthly cost after on demand usd
86
monthly cost after savings plan usd
72
monthly savings usd
204
annual savings usd
2616
percent reduction
75
aws accounts before
3
aws accounts after
1
ec2 instances before
4
ec2 instances after
1
ecs task definitions
5
docker compose containers darkhorse
8
docker compose containers excalidash
2
domains routed through single eip
10
measured cutover downtime seconds
5
dns records changed
0
ssh keys retired
3
cloudwatch log groups
8
docs pages written
7
migration window days
2