From 3-Minute Cold Starts to 60 Seconds: Self-Hosted Whisper on AWS Lambda
Part 3 of my series on building a low-cost personal AI stack on AWS. Part 1 — Squeezing my $1k/month API bill to $20/month with AWS Credits · Part 2 — Drop-in Perplexity Sonar replacement with AWS Bedrock Nova Grounding
TL;DR
I built a self-hosted speech-to-text API on AWS Lambda using faster-whisper. After trying Amazon Transcribe, SageMaker Serverless, and Lambda with a bundled model, I landed on a Lambda + EFS + S3 architecture that achieves ~60-second cold starts for ~$0.21/month in storage costs. Once warm, specifying the language drops response time to ~10s.
Open source: gabrielkoo/aws-lambda-whisper-adaptor
The Problem
I wanted to automatically transcribe Telegram voice messages. The requirements were simple:
- Accuracy: Good enough for Cantonese and Mandarin
- Cost: Pay-per-use, scales to zero when idle
- Latency: Cold start under 60 seconds
Simple enough. Except it took four attempts to get there.
What I Tried (and Why It Didn’t Work)
Option 1: Amazon Transcribe
The obvious first choice — fully managed, pay-per-use, native AWS integration.
Why I rejected it before even trying:
Amazon Transcribe supports zh-CN and zh-TW, but not yue (Cantonese). Whisper large-v3-turbo handles Cantonese significantly better, and accuracy matters more than convenience here.
If you’re transcribing standard Mandarin and don’t need Cantonese support, Transcribe is probably fine. For my use case, it was a non-starter.
Option 2: SageMaker Serverless Inference
SageMaker Serverless scales to zero and handles model serving — sounds perfect.
What happened:
I deployed a SageMaker Serverless endpoint with faster-whisper. The first invocation after idle:
- Container provisioning: ~30s
- Model loading: ~45-60s
- Total cold start: 60-90 seconds
For a voice message that’s 5-10 seconds long, waiting 90 seconds is a terrible experience.
The 6GB memory wall:
SageMaker Serverless maxes out at 6144 MB (6 GB) RAM. Here’s why that’s a problem for Whisper:
whisper-large-v3-turbo(INT8): ~780MB model + ~2GB Python/runtime overhead ≈ 2.8GB minimumwhisper-large-v3(FP16): ~3GB model alone — barely fits, zero headroom for audio processing- Any concurrent requests? You’re OOM.
Lambda goes up to 10,240 MB. That headroom matters.
Cost comparison:
SageMaker Serverless bills per GB-second of inference time. For sporadic voice message transcription (~10s per request, a few times a day), Lambda’s per-invocation pricing is significantly cheaper. My Lambda setup costs ~$0.21/month in storage — the compute is essentially free at this volume.
I deleted the endpoint after testing.
Option 2b: Bedrock Marketplace
AWS Bedrock Marketplace does list Whisper Large V3 Turbo — but it deploys on a dedicated endpoint instance. Auto-scaling is available (including scale-to-zero), but that creates a different problem:
- Keep minimum 1 instance: always paying for idle time, even at 3am
- Scale to zero: cold starts when traffic resumes — SageMaker cold starts are measured in minutes, not seconds
- Not token/usage-based pricing either way
For a Telegram bot that gets a few voice messages a day, you’re either burning money on idle instances or waiting minutes for the first message to transcribe. Lambda’s ~60s cold start looks great by comparison.
Option 3: Lambda with Bundled Model
Next idea: bundle the model directly into the Docker image. No external dependencies, simple architecture.
What happened:
# Download model during build
# Note: SYSTRAN doesn't publish an official CT2 model for large-v3-turbo (see Gotchas)
# Using the community FP16 model instead
RUN python -c "from faster_whisper import WhisperModel; WhisperModel('mobiuslabsgmbh/faster-whisper-large-v3-turbo')"
- Docker image size: ~10GB
- ECR push time: 5+ minutes
- Lambda cold start: 2 minutes 51 seconds
The cold start is dominated by Lambda pulling the 10GB image from ECR. AWS Lambda caches images, but any cold start after the cache expires hits this wall.
Why it didn’t work:
- 3-minute cold start is unusable for interactive transcription
- Every code change requires rebuilding and pushing a 10GB image
- ECR storage: ~$1/month just for the image
Option 4: Lambda + S3 (No EFS)
What if Lambda downloads the model from S3 on cold start, storing it in /tmp?
The problem:
Lambda’s /tmp is ephemeral. Every cold start re-downloads the model from S3:
- S3 download for 1.6GB FP16 model: 30-60 seconds
- S3 download for 780MB INT8 model: 15-30 seconds
This is better than the bundled model approach, but there’s a bigger issue: no caching between Lambda instances. If you have 3 concurrent invocations, all 3 download the model independently. You’re paying for S3 transfer on every cold start.
What Actually Worked: Lambda + EFS + S3
The solution: use EFS as a persistent model cache, bootstrapped from S3. I’ve used EFS for persistent Streamlit state on ECS before — same pattern, different compute layer.
Request → Lambda Function URL
↓
Lambda (VPC)
↓ first cold start only: S3 → EFS
EFS (model cached here permanently)
flowchart TD
A([Request]) --> B[Lambda]
B --> C{EFS marker\nexists?}
C -->|Yes| D["Load from EFS\n~60s cold start"]
C -->|"No — first run"| E["Download S3 → EFS\n~55s"]
E --> F[Write marker file]
F --> D
D --> G[faster-whisper]
G --> H([Transcript])
S3[(S3 model store)] -.->|one-time bootstrap| E
How it works:
- First cold start: Lambda checks for a marker file on EFS. If missing, downloads model from S3 to EFS (~55s for INT8). Writes marker file.
- Subsequent cold starts: Marker file exists → load model directly from EFS (~60s for INT8).
- Warm invocations: Model already in memory → transcription-only time (~10-22s depending on audio length and whether language is specified).
MODEL_SLUG = os.environ['HF_MODEL_REPO'].replace('/', '--')
EFS_MODEL_DIR = f'/mnt/whisper-models/{MODEL_SLUG}'
MODEL_MARKER = f'/mnt/whisper-models/.ready-{MODEL_SLUG}'
def bootstrap_model():
if os.path.exists(MODEL_MARKER):
return WhisperModel(EFS_MODEL_DIR, device='cpu', compute_type='int8')
# First run: sync model from S3 to EFS
s3 = boto3.client('s3')
prefix = f'models/{MODEL_SLUG}/'
os.makedirs(EFS_MODEL_DIR, exist_ok=True)
paginator = s3.get_paginator('list_objects_v2')
for page in paginator.paginate(Bucket=os.environ['MODEL_S3_BUCKET'], Prefix=prefix):
for obj in page.get('Contents', []):
key = obj['Key']
local_path = os.path.join(EFS_MODEL_DIR, key[len(prefix):])
os.makedirs(os.path.dirname(local_path), exist_ok=True)
s3.download_file(os.environ['MODEL_S3_BUCKET'], key, local_path)
open(MODEL_MARKER, 'w').close() # Mark as ready
return WhisperModel(EFS_MODEL_DIR, device='cpu', compute_type='int8')
MODEL = bootstrap_model() # Runs at Lambda init time, cached for warm invocations
Why EFS works:
- EFS persists across Lambda instances — model is downloaded once, reused forever
- EFS is mounted at
/mnt/whisper-models— Lambda reads it like a local filesystem - S3 VPC Gateway Endpoint is free — no NAT Gateway needed (saves ~$32/month)
- EFS storage: ~$0.19/month for the 780MB INT8 model
INT8 vs FP16: The Model Size Trade-off
Note: SYSTRAN (the faster-whisper maintainers) don’t publish an official CTranslate2 conversion of large-v3-turbo — only large-v3 and distil-large-v3. For turbo, you’ll need a community-converted model.
I tested two community CT2 models:
| Model | Size | First Bootstrap | EFS Cold Start |
|---|---|---|---|
mobiuslabsgmbh/faster-whisper-large-v3-turbo (FP16) |
1.6 GB | ~126s | ~82s |
Zoont/faster-whisper-large-v3-turbo-int8-ct2 (INT8) |
780 MB | ~55s | ~60s |
The INT8 model is significantly faster on cold start with minimal accuracy loss. For voice message transcription, the quality difference is imperceptible in practice.
I’m running the INT8 model in production.
Cost Breakdown
| Resource | Monthly Cost |
|---|---|
| EFS storage (780MB INT8) | ~$0.19 |
| S3 storage (780MB) | ~$0.02 |
| Lambda compute | ~$0.00167/warm invocation* |
| S3 VPC Gateway Endpoint | Free |
| NAT Gateway | Not needed ($0) |
| Total (storage only) | ~$0.21/month |
*10GB × 10s = 100 GB-seconds per warm invocation. The Lambda free tier covers 400,000 GB-seconds/month — roughly 4,000 warm invocations. For a personal bot, compute cost is effectively $0. Storage dominates.
Compare to SageMaker Serverless: minimum ~$5-10/month for similar workloads, plus the 60-90s cold start penalty.
Why not Provisioned Concurrency? PC keeps Lambda permanently warm (no cold starts), but costs ~$0.0000097222/GB-second. For a 10GB function running 24/7: ~$252/month. Even a minimal 4GB setup runs ~$100/month — roughly 500x more than the $0.21 storage approach. For a personal bot with a few voice messages a day, the occasional ~60s cold start is a fine trade-off.
vs. OpenAI Whisper API
OpenAI’s Whisper API costs $0.006/minute. Here’s how it compares for a bot averaging 15s voice messages:
| Volume | OpenAI Whisper API | Self-hosted Lambda |
|---|---|---|
| 50 msgs/month | $0.08 | $0.21 (storage only) |
| 140 msgs/month | $0.21 | $0.21 ← break-even |
| 500 msgs/month | $0.75 | $0.21 (storage only) |
| 1,000 msgs/month | $1.50 | $0.21 (storage only) |
| 4,000 msgs/month | $6.00 | $0.21 (storage only) |
Lambda compute is free within the free tier (~4,000 warm invocations/month). Beyond that, it’s $0.00167/invocation — but that’s a high volume for a personal bot.
Break-even: ~140 messages/month. Above that, Lambda wins on cost.
But cost isn’t the only reason to self-host:
- Geographic availability: OpenAI’s API is not available in Hong Kong — HK falls under China’s regional restriction. Azure OpenAI does offer Whisper, but support typically lags behind the official API by months. If you’re in HK (or other restricted regions), self-hosting isn’t just cheaper — it’s the only option.
- Cantonese accuracy:
language=yuewith Whisper large-v3-turbo is noticeably better than the managed API for Cantonese - Privacy: audio never leaves your infrastructure
- No rate limits: Lambda scales independently
Architecture
Telegram voice message
↓
OpenClaw (gateway)
↓
Lambda Function URL (auth via token)
↓
Lambda (VPC, 10GB RAM, 900s timeout)
↓
EFS /mnt/whisper-models/{model-slug}
↓
faster-whisper (CTranslate2, INT8)
↓
Transcript
Lambda configuration:
- Memory: 10,240 MB — actual usage is ~2.2GB (INT8 model), but Lambda allocates CPU proportional to memory. 10GB gives ~6 vCPUs vs ~2.3 vCPUs at 4GB, cutting warm transcription from ~16s to ~10s. You’re paying for CPU, not RAM.
- Timeout: 900s (handles long audio files)
- VPC: Default VPC (no NAT Gateway)
- EFS: Mounted at
/mnt/whisper-models
Memory vs. cost trade-off (tested, 3 runs each):
| Config | Cold Start | Warm (2.5s audio) | GB-seconds/invocation |
|---|---|---|---|
| 4,096 MB | ~60s | ~21s | 84 (~$0.00140) |
| 6,144 MB | ~60s | ~16s | 96 (~$0.00160) |
| 8,192 MB | ~60s | ~18s | 144 (~$0.00240) |
| 10,240 MB | ~60s | ~15s | 150 (~$0.00250) |
Cold start is ~60s across all configs — it’s EFS I/O bound, not CPU bound, so more memory doesn’t help here. Warm inference time does scale with memory (more vCPUs = faster CTranslate2 decoding). Interestingly, 4GB is the cheapest per invocation — the warm time savings at higher memory don’t offset the extra GB-seconds. Within the free tier, cost differences are negligible regardless.
API Compatibility
The adaptor exposes two endpoints so it works as a drop-in replacement for existing integrations:
OpenAI compatible (/v1/audio/transcriptions):
curl -X POST https://<function-url>/v1/audio/transcriptions \
-H "Authorization: Token <secret>" \
-F "file=@audio.ogg" \
-F "language=zh"
{"text": "transcript here"}
Deepgram compatible (/v1/listen):
curl -X POST https://<function-url>/v1/listen?language=yue \
-H "Authorization: Token <secret>" \
-H "Content-Type: audio/ogg" \
--data-binary @audio.ogg
Performance Tip: Always Specify Language
When no language is specified, Whisper runs language detection on the first audio chunk — adding noticeable overhead. For a 2.5s voice message:
| Request | Response Time |
|---|---|
| No language (auto-detect) | ~22s |
language=zh (Mandarin) |
~14s |
language=yue (Cantonese) |
~10s |
That’s a 2x speedup just from passing a language hint. Two ways to do it:
Option A — per-request query param (recommended, keeps Lambda language-agnostic):
# Deepgram endpoint
curl -X POST https://<function-url>/v1/listen?language=yue \
-H "Authorization: Token <secret>" \
-H "Content-Type: audio/ogg" \
--data-binary @audio.ogg
# OpenAI endpoint
curl -X POST https://<function-url>/v1/audio/transcriptions \
-H "Authorization: Token <secret>" \
-F "file=@audio.ogg" \
-F "language=yue"
Option B — Lambda env var (simpler if you only ever transcribe one language):
WHISPER_LANGUAGE=yue
I use Option A — the language is set in my OpenClaw config (language: "yue" in the audio model), which passes it as ?language=yue to the Lambda on every request.
Real-time Factor
Once warm, the Lambda transcribes faster than real-time for typical voice messages:
| Audio Duration | Warm Response Time | Real-time Factor |
|---|---|---|
| 2.5s | ~10s | 4x |
| 33s | ~23s | 0.68x ✅ faster than real-time |
The 2.5s result looks slow (4x), but Whisper processes audio in 30-second chunks — the overhead is fixed regardless of audio length. For longer messages, the real-time factor drops well below 1x.
Open Source
The project is open source at gabrielkoo/aws-lambda-whisper-adaptor.
Key features:
- Any faster-whisper model via
HF_MODEL_REPOenv var - GitHub Actions workflow to sync models from HuggingFace → S3
- Pre-built Docker image:
ghcr.io/gabrielkoo/aws-lambda-whisper-adaptor:latest - Configurable language detection via
WHISPER_LANGUAGEenv var or per-request parameter
Gotchas (Things I Learned the Hard Way)
float16 doesn’t work on CPU Lambda
CTranslate2 requires GPU for efficient float16 computation. If you set WHISPER_COMPUTE_TYPE=float16, you’ll get:
ValueError: Requested float16 compute type, but the target device or backend
do not support efficient float16 computation.
Use int8 instead. CTranslate2 quantizes the FP16 model weights at load time — same model, faster CPU inference.
Not all “faster-whisper” models are compatible
I tested the official SYSTRAN repos (Systran/faster-whisper-large-v3 and Systran/faster-distil-whisper-large-v3). Results were not great:
faster-distil-whisper-large-v3: loaded, but returned English for Cantonese audiofaster-whisper-large-v3: failed outright —RuntimeError: Unable to open file 'model.bin'
The community-converted models (Zoont/..., mobiuslabsgmbh/...) work reliably. Stick to those until the SYSTRAN format issues are resolved.
/tmp is too small for model downloads
Lambda’s /tmp defaults to 512MB (configurable up to 10GB, but costs extra). More practically: when syncing models locally before uploading to S3, your local /tmp may also be too small — we hit this downloading a 1.5GB model on a Raspberry Pi where /tmp only had ~856MB free. Use a directory with more space (e.g. your home directory) for local model prep.
Changing HF_MODEL_REPO forces a new EFS bootstrap
When you update the HF_MODEL_REPO env var, Lambda spins up a fresh container. The new container checks for a marker file at /mnt/whisper-models/.ready-{model-slug} — which doesn’t exist yet — and triggers a full S3→EFS bootstrap (~55s for INT8). Plan model switches accordingly.
Race condition on EFS marker (low risk)
If two Lambda instances cold-start simultaneously while EFS is empty, both will attempt to download from S3. This is safe — the downloads are idempotent (same files, same paths), and the marker file write is the last step. Worst case: two containers download the same model in parallel, both succeed, and subsequent invocations use the cached EFS copy. No data corruption risk.
Pre-warming
Cold starts happen when Lambda hasn’t been invoked recently. For predictable usage patterns (e.g. a morning standup bot), pre-warm the Lambda before you need it:
#!/bin/bash
# prewarm.sh — trigger Lambda init before expected usage
curl -s -o /dev/null \
-X POST "$WHISPER_LAMBDA_URL/v1/listen?language=yue" \
-H "Authorization: Token $WHISPER_API_SECRET" \
-H "Content-Type: audio/ogg" \
--data-binary @sample.ogg
echo "Lambda pre-warmed"
Schedule with cron: 0 8 * * * /path/to/prewarm.sh (runs at 8am daily).
Alternatively, use an EventBridge rule to ping the Lambda every few minutes — though at that frequency, Provisioned Concurrency starts making more sense cost-wise.
Conclusion
The Lambda + EFS + S3 architecture achieves:
- ~60s cold start (INT8 model, after first bootstrap); warm invocations with
language=yuerun in ~10s - ~$0.21/month storage cost
- Zero idle cost (scales to zero)
- Deepgram and OpenAI compatible APIs
The key insight: EFS is the missing piece. It provides persistent, fast storage that Lambda can access without a NAT Gateway (using the free S3 VPC Gateway Endpoint for bootstrapping).
I couldn’t find any existing write-up of Whisper on Lambda using EFS for persistent model caching — most approaches either bundle the model in Docker (3-minute cold starts) or re-download from S3 on every cold start (no caching between instances). If you’ve seen this done before, I’d love to know.
Two things worth knowing before you deploy:
- SYSTRAN doesn’t publish an official CT2 model for
large-v3-turbo— use a community-converted one likeZoont/faster-whisper-large-v3-turbo-int8-ct2 - Always pass a
languageparameter if you know it — cuts response time roughly in half
If you’re building voice transcription on AWS and want Whisper-quality accuracy without the SageMaker complexity, give it a try.