AWS Infrastructure Consolidation
Collapsed 3 AWS accounts and 4 EC2 instances into a single r6a.large running ECS + Docker Compose — and cut the monthly cloud bill from ~$290 to ~$72 (75% reduction, ~$2.6k saved/year) with zero DNS changes and ~5 seconds of total downtime.
Summary
An AWS migration that consolidated three production accounts (niyamvora@hotmail.com, samayvora@gmail.com, incognitostocks@gmail.com) and four EC2 instances running across them into a single r6a.large in ap-south-1, with everything served by one Nginx reverse proxy and split across two complementary deployment surfaces — ECS tasks (5 first-party services: SimpliDeliver, ShinobiData, Dashboard, GitHub Runner, Nginx Proxy) deployed via CodeBuild → ECR → `ecs update-service`, plus an 8-container Docker Compose stack (DarkHorse frontend/backend/admin/mutualfund-api/mutualfund-ui/llm-sql/unsubscribe-go/unsubscribe-react) and a 2-container Excalidraw stack deployed via AWS SSM → `docker compose pull && up -d`. The cutover used a single Elastic IP swap (~2–5 seconds of measured downtime across all 10+ domains, zero DNS changes), SSH-based deploys were retired in favor of SSM, log access was unified behind a single `logs` CLI fronting CloudWatch and `docker logs`, and RDS was independently right-sized from db.t3.medium → db.t3.small. Steady-state bill dropped from ~$290/mo (AWS + Vercel + managed MySQL) to ~$86/mo on-demand and ~$72/mo with a 1-year No-Upfront EC2 Savings Plan.
Target user
Hiring managers and infra leads evaluating whether the candidate can own a real-world cost + reliability migration end-to-end — billing forensics, target architecture, container packaging, CI/CD redesign, DNS cutover, and post-mortem documentation — with no scheduled downtime window.
- 01
Consolidated 3 AWS accounts and 4 EC2 instances (t3.large + t3.medium + t3.xlarge + t2.medium across ap-south-1) into a single r6a.large running 5 ECS tasks + 10 Docker Compose containers, then right-sized the host from t3.xlarge → r6a.large in a second pass for an additional $78/mo saving.
- 02
Cut steady-state monthly cost from ~$290 to ~$86 (on-demand) and a forecast ~$72 (1-year No-Upfront Savings Plan) — a 75% reduction (~$2,616/year) verified against AWS billing exports for Jan–Mar 2026 vs the post-migration baseline.
- 03
Executed the cutover with a single Elastic IP swap (13.127.181.125, eipalloc-094c3e1a9f07b12c6) — zero DNS changes across 10+ production domains, measured downtime ~2–5 seconds during EIP re-association, with the old t3.xlarge kept stopped (not terminated) as a 1-week rollback safety net.
- 04
Retired SSH-based deploys end-to-end in favor of AWS SSM Run Command — pushed updated `build-push-deploy.yaml` to `simplidelivernext`, `finder_AV`, and `dhs_dashboard_nextjs` repos (commits `43dda0d`, `82b3711`, `04a135a`); rotated `AWS_ACCESS_KEY_ID` / `AWS_SECRET_ACCESS_KEY` / `ECS_INSTANCE_ID` in GitHub Secrets per repo.
- 05
Designed dual deployment surfaces on the same host: ECS task definitions (CPU/memory soft+hard limits, CloudWatch Logs driver, auto-restart on crash) for first-party Next.js apps, and Docker Compose with `restart: unless-stopped` for third-party / sister-team images pulled from Docker Hub — giving CodeBuild speed for owned repos and SSM-driven flexibility for the 8-container DarkHorse stack.
- 06
Migrated the entire DarkHorse stack (8 containers — `darkhorse-repo`, `dhs-backend-node`, `agency-home`, `mutualfund-api-backend`, `mutualfund-frontend`, `llm-sql-flask`, `unsubscribe-golang`, `unsubscribe-react`) plus Excalidraw (2 containers, SQLite DB and JWT/CSRF secrets copied to `/ecs-data/excalidash/prisma/`) onto the same host with no code changes — port assignments rewritten only in compose / task definitions, never in app code.
- 07
Independently right-sized RDS for SQL Server Express from db.t3.medium ($73/mo) → db.t3.small ($22/mo), an additional $51/mo saving — safe because SQL Server Express is capped at 1 GB RAM regardless of host class, so the 4 GB / 8 GB tiers were paying for unusable memory.
- 08
Shipped a `logs <service> [duration|-f]` zsh tool that wraps `aws logs tail /ecs/<service>` for ECS-managed containers and `aws ssm send-command ... docker logs` for the Docker Compose containers — a single grammar across two log surfaces, replacing the previous SSH + tail-per-host workflow.
The natural approach — point each domain's A-record (or Cloudflare origin) to the new instance — would have required updating records across Cloudflare (shinobidata, simplideliver, draw, darkhorsestocks subdomains) and a direct A-record provider for the unproxied ones, fanned out over multiple TTLs, with a real outage window per domain. Solved by routing every domain through a single Elastic IP (13.127.181.125) and Nginx hostname-based vhosts on the host: the migration becomes one `aws ec2 disassociate-address` + `associate-address` pair against the new instance ID, taking ~2–5 seconds total across all domains simultaneously, with zero DNS propagation involved. The same EIP that had pointed at the old t3.xlarge now points at the r6a.large; Cloudflare, registrars, and end-users see no change.
First-party Next.js apps benefit from ECS's task lifecycle (CodeBuild → ECR → `ecs update-service --force-new-deployment` for atomic image swap with health-checked rollout), but the 8-container DarkHorse stack is owned by a sister team that ships via GitHub Actions → Docker Hub and needs `docker compose` semantics for selective per-service restart. Running both under ECS would have required wrapping every DarkHorse image in a task definition and rebuilding their CI. Running both under Docker Compose would have meant losing CodeBuild's S3 layer cache and ECS's restart/health/limits guarantees for the first-party apps. The shipped design keeps ECS tasks and Docker Compose containers on the same EC2 host inside the ECS-optimized AMI: ECS owns 5 task definitions including the Nginx proxy task; Docker Compose owns `/ecs-data/{darkhorse,excalidash}/docker-compose.yml` with `restart: unless-stopped`; both write logs to the same CloudWatch account (ECS native for tasks, SSM `docker logs` for compose); both restart automatically on crash; CPU and RAM headroom on the r6a.large (16 GB / 2 vCPU AMD) covers steady-state at ~1 GB RAM and 0.4% CPU across all containers combined.
SQL Server Express on `mutualfunddb-2024-12-v1` (db.t3.medium) was the single most expensive line item across all three accounts ($82/mo) and pinned Account 3 open. A cross-account RDS snapshot copy + restore on Account 1 was the obvious answer but would have required IAM cross-account share, KMS key cross-account access, and a Cloudflare-routed connection-string flip for both `shinobidata.com` and `dashboard.darkhorsestocks.in`. The shipped sequencing — down-size in place first (db.t3.medium → db.t3.small, $73 → $22, safe because SQL Server Express is capped at 1 GB RAM regardless of host class, validated by Account 3's actual 6% avg / 17% peak CPU over a 7-day CloudWatch window) — captured the $51/mo Tier-3 saving immediately, with the cross-account move deferred behind a `Deferred — works cross-account for now` flag in the post-migration reference. The lesson surfaced: in a multi-tier optimization plan, the move that unblocks account closure (cross-account migration) and the move that captures the bulk of the saving (right-sizing) are often *not* the same move, and decoupling them preserved most of the value at a fraction of the risk.
- L01
Elastic IP + Nginx vhosts is the right cutover primitive for multi-domain consolidation.
The instinct on a migration this size is to plan a DNS change window, write a rollback runbook for each TTL, and accept some user-visible outage per domain. Routing every domain through a single EIP and letting Nginx host-route on the new instance collapses N domain cutovers into one EC2 API call — the only DNS-aware moving piece was Cloudflare origin IPs which were already set to the EIP, so they didn't change either. Worth designing the steady-state around this primitive from day one even before a migration is on the table.
- L02
Separate `right-size` from `relocate` whenever you can.
The Tier-3 RDS downsize and the cross-account RDS move were conflated in the first plan and would have blocked each other (the snapshot would have been the old size; the new size would have re-snapshotted; the cross-account share would have re-keyed mid-flight). Splitting them — down-size first, defer relocate — captured $51/mo of $82/mo of savings in 5 minutes of reboot, with the harder move parked behind a flag for a quieter week.
- L03
SSM Run Command quietly retires the SSH-key blast radius.
The pre-migration topology had three SSH key pairs (`niyamvora_ai`, `niyam_ai`, `mahdib` IAM users with corresponding `.pem` files in `credentials/.key/`) and several repos with `SSH_PRIVATE_KEY` + `SSH_HOST` secrets per environment. Switching CI/CD to SSM means deploys authenticate via AWS IAM and never touch a key file — the rotation surface shrinks from 'every repo that has ever deployed' to 'one IAM user', and SSH access can be locked down to a single admin path on the host. Worth the workflow rewrite even ignoring the cost story.
- L04
Honest billing math beats optimistic billing math when you're showing the work.
The first internal projection put the post-migration bill at ~$220/mo (tier-4 with Reserved Instances baked in). The actual May 2026 invoice came in at $85.73 — better than projected — because the r6a.large right-size and the RDS down-size compounded with the Vercel + managed MySQL drops in a way the original tier-by-tier plan in [AWS_COST_OPTIMIZATION.md](https://github.com/) was too conservative to predict. Documenting actuals against projections in the post-migration reference makes the saved-dollars number defensible in a way a forward-looking plan cannot be.
- monthly cost after on demand usd
- 86
- monthly cost after savings plan usd
- 72
- monthly savings usd
- 204
- annual savings usd
- 2616
- percent reduction
- 75
- aws accounts before
- 3
- aws accounts after
- 1
- ec2 instances before
- 4
- ec2 instances after
- 1
- ecs task definitions
- 5
- docker compose containers darkhorse
- 8
- docker compose containers excalidash
- 2
- domains routed through single eip
- 10
- measured cutover downtime seconds
- 5
- dns records changed
- 0
- ssh keys retired
- 3
- cloudwatch log groups
- 8
- docs pages written
- 7
- migration window days
- 2