From 3-Minute Cold Starts to 60 Seconds: Self-Hosted Whisper on AWS Lambda

Part 3 of my series on building a low-cost personal AI stack on AWS. Part 1 — Squeezing my $1k/month API bill to $20/month with AWS Credits · Part 2 — Drop-in Perplexity Sonar replacement with AWS Bedrock Nova Grounding

TL;DR

I built a self-hosted speech-to-text API on AWS Lambda using faster-whisper. After trying Amazon Transcribe, SageMaker Serverless, and Lambda with a bundled model, I landed on a Lambda + EFS + S3 architecture that achieves ~60-second cold starts for ~$0.21/month in storage costs. Once warm, specifying the language drops response time to ~10s.

Open source: gabrielkoo/aws-lambda-whisper-adaptor

The Problem

I wanted to automatically transcribe Telegram voice messages. The requirements were simple:

Accuracy: Good enough for Cantonese and Mandarin
Cost: Pay-per-use, scales to zero when idle
Latency: Cold start under 60 seconds

Simple enough. Except it took four attempts to get there.

What I Tried (and Why It Didn’t Work)

Option 1: Amazon Transcribe

The obvious first choice — fully managed, pay-per-use, native AWS integration.

Why I rejected it before even trying:

Amazon Transcribe supports zh-CN and zh-TW, but not yue (Cantonese). Whisper large-v3-turbo handles Cantonese significantly better, and accuracy matters more than convenience here.

If you’re transcribing standard Mandarin and don’t need Cantonese support, Transcribe is probably fine. For my use case, it was a non-starter.

Option 2: SageMaker Serverless Inference

SageMaker Serverless scales to zero and handles model serving — sounds perfect.

What happened:

I deployed a SageMaker Serverless endpoint with faster-whisper. The first invocation after idle:

Container provisioning: ~30s
Model loading: ~45-60s
Total cold start: 60-90 seconds

For a voice message that’s 5-10 seconds long, waiting 90 seconds is a terrible experience.

The 6GB memory wall:

SageMaker Serverless maxes out at 6144 MB (6 GB) RAM. Here’s why that’s a problem for Whisper:

whisper-large-v3-turbo (INT8): ~780MB model + ~2GB Python/runtime overhead ≈ 2.8GB minimum
whisper-large-v3 (FP16): ~3GB model alone — barely fits, zero headroom for audio processing
Any concurrent requests? You’re OOM.

Lambda goes up to 10,240 MB. That headroom matters.

Cost comparison:

SageMaker Serverless bills per GB-second of inference time. For sporadic voice message transcription (~10s per request, a few times a day), Lambda’s per-invocation pricing is significantly cheaper. My Lambda setup costs ~$0.21/month in storage — the compute is essentially free at this volume.

I deleted the endpoint after testing.

Option 2b: Bedrock Marketplace

AWS Bedrock Marketplace does list Whisper Large V3 Turbo — but it deploys on a dedicated endpoint instance. Auto-scaling is available (including scale-to-zero), but that creates a different problem:

Keep minimum 1 instance: always paying for idle time, even at 3am
Scale to zero: cold starts when traffic resumes — SageMaker cold starts are measured in minutes, not seconds
Not token/usage-based pricing either way

For a Telegram bot that gets a few voice messages a day, you’re either burning money on idle instances or waiting minutes for the first message to transcribe. Lambda’s ~60s cold start looks great by comparison.

Option 3: Lambda with Bundled Model

Next idea: bundle the model directly into the Docker image. No external dependencies, simple architecture.

What happened:

# Download model during build
# Note: SYSTRAN doesn't publish an official CT2 model for large-v3-turbo (see Gotchas)
# Using the community FP16 model instead
RUN python -c "from faster_whisper import WhisperModel; WhisperModel('mobiuslabsgmbh/faster-whisper-large-v3-turbo')"

Docker image size: ~10GB
ECR push time: 5+ minutes
Lambda cold start: 2 minutes 51 seconds

The cold start is dominated by Lambda pulling the 10GB image from ECR. AWS Lambda caches images, but any cold start after the cache expires hits this wall.

Why it didn’t work:

3-minute cold start is unusable for interactive transcription
Every code change requires rebuilding and pushing a 10GB image
ECR storage: ~$1/month just for the image

Option 4: Lambda + S3 (No EFS)

What if Lambda downloads the model from S3 on cold start, storing it in /tmp?

The problem:

Lambda’s /tmp is ephemeral. Every cold start re-downloads the model from S3:

S3 download for 1.6GB FP16 model: 30-60 seconds
S3 download for 780MB INT8 model: 15-30 seconds

This is better than the bundled model approach, but there’s a bigger issue: no caching between Lambda instances. If you have 3 concurrent invocations, all 3 download the model independently. You’re paying for S3 transfer on every cold start.

What Actually Worked: Lambda + EFS + S3

The solution: use EFS as a persistent model cache, bootstrapped from S3. I’ve used EFS for persistent Streamlit state on ECS before — same pattern, different compute layer.

Request → Lambda Function URL
               ↓
          Lambda (VPC)
               ↓ first cold start only: S3 → EFS
              EFS (model cached here permanently)

flowchart TD
    A([Request]) --> B[Lambda]
    B --> C{EFS marker\nexists?}
    C -->|Yes| D["Load from EFS\n~60s cold start"]
    C -->|"No — first run"| E["Download S3 → EFS\n~55s"]
    E --> F[Write marker file]
    F --> D
    D --> G[faster-whisper]
    G --> H([Transcript])
    S3[(S3 model store)] -.->|one-time bootstrap| E

How it works:

First cold start: Lambda checks for a marker file on EFS. If missing, downloads model from S3 to EFS (~55s for INT8). Writes marker file.
Subsequent cold starts: Marker file exists → load model directly from EFS (~60s for INT8).
Warm invocations: Model already in memory → transcription-only time (~10-22s depending on audio length and whether language is specified).

MODEL_SLUG = os.environ['HF_MODEL_REPO'].replace('/', '--')
EFS_MODEL_DIR = f'/mnt/whisper-models/{MODEL_SLUG}'
MODEL_MARKER = f'/mnt/whisper-models/.ready-{MODEL_SLUG}'

def bootstrap_model():
    if os.path.exists(MODEL_MARKER):
        return WhisperModel(EFS_MODEL_DIR, device='cpu', compute_type='int8')
    
    # First run: sync model from S3 to EFS
    s3 = boto3.client('s3')
    prefix = f'models/{MODEL_SLUG}/'
    os.makedirs(EFS_MODEL_DIR, exist_ok=True)
    
    paginator = s3.get_paginator('list_objects_v2')
    for page in paginator.paginate(Bucket=os.environ['MODEL_S3_BUCKET'], Prefix=prefix):
        for obj in page.get('Contents', []):
            key = obj['Key']
            local_path = os.path.join(EFS_MODEL_DIR, key[len(prefix):])
            os.makedirs(os.path.dirname(local_path), exist_ok=True)
            s3.download_file(os.environ['MODEL_S3_BUCKET'], key, local_path)
    
    open(MODEL_MARKER, 'w').close()  # Mark as ready
    return WhisperModel(EFS_MODEL_DIR, device='cpu', compute_type='int8')

MODEL = bootstrap_model()  # Runs at Lambda init time, cached for warm invocations

Why EFS works:

EFS persists across Lambda instances — model is downloaded once, reused forever
EFS is mounted at /mnt/whisper-models — Lambda reads it like a local filesystem
S3 VPC Gateway Endpoint is free — no NAT Gateway needed (saves ~$32/month)
EFS storage: ~$0.19/month for the 780MB INT8 model

INT8 vs FP16: The Model Size Trade-off

Note: SYSTRAN (the faster-whisper maintainers) don’t publish an official CTranslate2 conversion of large-v3-turbo — only large-v3 and distil-large-v3. For turbo, you’ll need a community-converted model.

I tested two community CT2 models:

Model	Size	First Bootstrap	EFS Cold Start
`mobiuslabsgmbh/faster-whisper-large-v3-turbo` (FP16)	1.6 GB	~126s	~82s
`Zoont/faster-whisper-large-v3-turbo-int8-ct2` (INT8)	780 MB	~55s	~60s

The INT8 model is significantly faster on cold start with minimal accuracy loss. For voice message transcription, the quality difference is imperceptible in practice.

I’m running the INT8 model in production.

Cost Breakdown

Resource	Monthly Cost
EFS storage (780MB INT8)	~$0.19
S3 storage (780MB)	~$0.02
Lambda compute	~$0.00167/warm invocation*
S3 VPC Gateway Endpoint	Free
NAT Gateway	Not needed ($0)
Total (storage only)	~$0.21/month

*10GB × 10s = 100 GB-seconds per warm invocation. The Lambda free tier covers 400,000 GB-seconds/month — roughly 4,000 warm invocations. For a personal bot, compute cost is effectively $0. Storage dominates.

Compare to SageMaker Serverless: minimum ~$5-10/month for similar workloads, plus the 60-90s cold start penalty.

Why not Provisioned Concurrency? PC keeps Lambda permanently warm (no cold starts), but costs ~$0.0000097222/GB-second. For a 10GB function running 24/7: ~$252/month. Even a minimal 4GB setup runs ~$100/month — roughly 500x more than the $0.21 storage approach. For a personal bot with a few voice messages a day, the occasional ~60s cold start is a fine trade-off.

vs. OpenAI Whisper API

OpenAI’s Whisper API costs $0.006/minute. Here’s how it compares for a bot averaging 15s voice messages:

Volume	OpenAI Whisper API	Self-hosted Lambda
50 msgs/month	$0.08	$0.21 (storage only)
140 msgs/month	$0.21	$0.21 ← break-even
500 msgs/month	$0.75	$0.21 (storage only)
1,000 msgs/month	$1.50	$0.21 (storage only)
4,000 msgs/month	$6.00	$0.21 (storage only)

Lambda compute is free within the free tier (~4,000 warm invocations/month). Beyond that, it’s $0.00167/invocation — but that’s a high volume for a personal bot.

Break-even: ~140 messages/month. Above that, Lambda wins on cost.

But cost isn’t the only reason to self-host:

Geographic availability: OpenAI’s API is not available in Hong Kong — HK falls under China’s regional restriction. Azure OpenAI does offer Whisper, but support typically lags behind the official API by months. If you’re in HK (or other restricted regions), self-hosting isn’t just cheaper — it’s the only option.
Cantonese accuracy: language=yue with Whisper large-v3-turbo is noticeably better than the managed API for Cantonese
Privacy: audio never leaves your infrastructure
No rate limits: Lambda scales independently

Architecture

Telegram voice message
        ↓
   OpenClaw (gateway)
        ↓
Lambda Function URL (auth via token)
        ↓
Lambda (VPC, 10GB RAM, 900s timeout)
        ↓
EFS /mnt/whisper-models/{model-slug}
        ↓
faster-whisper (CTranslate2, INT8)
        ↓
    Transcript

Lambda configuration:

Memory: 10,240 MB — actual usage is ~2.2GB (INT8 model), but Lambda allocates CPU proportional to memory. 10GB gives ~6 vCPUs vs ~2.3 vCPUs at 4GB, cutting warm transcription from ~16s to ~10s. You’re paying for CPU, not RAM.
Timeout: 900s (handles long audio files)
VPC: Default VPC (no NAT Gateway)
EFS: Mounted at /mnt/whisper-models

Memory vs. cost trade-off (tested, 3 runs each):

Config	Cold Start	Warm (2.5s audio)	GB-seconds/invocation
4,096 MB	~60s	~21s	84 (~$0.00140)
6,144 MB	~60s	~16s	96 (~$0.00160)
8,192 MB	~60s	~18s	144 (~$0.00240)
10,240 MB	~60s	~15s	150 (~$0.00250)

Cold start is ~60s across all configs — it’s EFS I/O bound, not CPU bound, so more memory doesn’t help here. Warm inference time does scale with memory (more vCPUs = faster CTranslate2 decoding). Interestingly, 4GB is the cheapest per invocation — the warm time savings at higher memory don’t offset the extra GB-seconds. Within the free tier, cost differences are negligible regardless.

API Compatibility

The adaptor exposes two endpoints so it works as a drop-in replacement for existing integrations:

OpenAI compatible (/v1/audio/transcriptions):

curl -X POST https://<function-url>/v1/audio/transcriptions \
  -H "Authorization: Token <secret>" \
  -F "file=@audio.ogg" \
  -F "language=zh"

{"text": "transcript here"}

Deepgram compatible (/v1/listen):

curl -X POST https://<function-url>/v1/listen?language=yue \
  -H "Authorization: Token <secret>" \
  -H "Content-Type: audio/ogg" \
  --data-binary @audio.ogg

Performance Tip: Always Specify Language

When no language is specified, Whisper runs language detection on the first audio chunk — adding noticeable overhead. For a 2.5s voice message:

Request	Response Time
No language (auto-detect)	~22s
`language=zh` (Mandarin)	~14s
`language=yue` (Cantonese)	~10s

That’s a 2x speedup just from passing a language hint. Two ways to do it:

Option A — per-request query param (recommended, keeps Lambda language-agnostic):

# Deepgram endpoint
curl -X POST https://<function-url>/v1/listen?language=yue \
  -H "Authorization: Token <secret>" \
  -H "Content-Type: audio/ogg" \
  --data-binary @audio.ogg

# OpenAI endpoint
curl -X POST https://<function-url>/v1/audio/transcriptions \
  -H "Authorization: Token <secret>" \
  -F "file=@audio.ogg" \
  -F "language=yue"

Option B — Lambda env var (simpler if you only ever transcribe one language):

WHISPER_LANGUAGE=yue

I use Option A — the language is set in my OpenClaw config (language: "yue" in the audio model), which passes it as ?language=yue to the Lambda on every request.

Real-time Factor

Once warm, the Lambda transcribes faster than real-time for typical voice messages:

Audio Duration	Warm Response Time	Real-time Factor
2.5s	~10s	4x
33s	~23s	0.68x ✅ faster than real-time

The 2.5s result looks slow (4x), but Whisper processes audio in 30-second chunks — the overhead is fixed regardless of audio length. For longer messages, the real-time factor drops well below 1x.

Open Source

The project is open source at gabrielkoo/aws-lambda-whisper-adaptor.

Key features:

Any faster-whisper model via HF_MODEL_REPO env var
GitHub Actions workflow to sync models from HuggingFace → S3
Pre-built Docker image: ghcr.io/gabrielkoo/aws-lambda-whisper-adaptor:latest
Configurable language detection via WHISPER_LANGUAGE env var or per-request parameter

Gotchas (Things I Learned the Hard Way)

`float16` doesn’t work on CPU Lambda

CTranslate2 requires GPU for efficient float16 computation. If you set WHISPER_COMPUTE_TYPE=float16, you’ll get:

ValueError: Requested float16 compute type, but the target device or backend
do not support efficient float16 computation.

Use int8 instead. CTranslate2 quantizes the FP16 model weights at load time — same model, faster CPU inference.

Not all “faster-whisper” models are compatible

I tested the official SYSTRAN repos (Systran/faster-whisper-large-v3 and Systran/faster-distil-whisper-large-v3). Results were not great:

faster-distil-whisper-large-v3: loaded, but returned English for Cantonese audio
faster-whisper-large-v3: failed outright — RuntimeError: Unable to open file 'model.bin'

The community-converted models (Zoont/..., mobiuslabsgmbh/...) work reliably. Stick to those until the SYSTRAN format issues are resolved.

`/tmp` is too small for model downloads

Lambda’s /tmp defaults to 512MB (configurable up to 10GB, but costs extra). More practically: when syncing models locally before uploading to S3, your local /tmp may also be too small — we hit this downloading a 1.5GB model on a Raspberry Pi where /tmp only had ~856MB free. Use a directory with more space (e.g. your home directory) for local model prep.

Changing `HF_MODEL_REPO` forces a new EFS bootstrap

When you update the HF_MODEL_REPO env var, Lambda spins up a fresh container. The new container checks for a marker file at /mnt/whisper-models/.ready-{model-slug} — which doesn’t exist yet — and triggers a full S3→EFS bootstrap (~55s for INT8). Plan model switches accordingly.

Race condition on EFS marker (low risk)

If two Lambda instances cold-start simultaneously while EFS is empty, both will attempt to download from S3. This is safe — the downloads are idempotent (same files, same paths), and the marker file write is the last step. Worst case: two containers download the same model in parallel, both succeed, and subsequent invocations use the cached EFS copy. No data corruption risk.

Pre-warming

Cold starts happen when Lambda hasn’t been invoked recently. For predictable usage patterns (e.g. a morning standup bot), pre-warm the Lambda before you need it:

#!/bin/bash
# prewarm.sh — trigger Lambda init before expected usage
curl -s -o /dev/null \
  -X POST "$WHISPER_LAMBDA_URL/v1/listen?language=yue" \
  -H "Authorization: Token $WHISPER_API_SECRET" \
  -H "Content-Type: audio/ogg" \
  --data-binary @sample.ogg
echo "Lambda pre-warmed"

Schedule with cron: 0 8 * * * /path/to/prewarm.sh (runs at 8am daily).

Alternatively, use an EventBridge rule to ping the Lambda every few minutes — though at that frequency, Provisioned Concurrency starts making more sense cost-wise.

Conclusion

The Lambda + EFS + S3 architecture achieves:

~60s cold start (INT8 model, after first bootstrap); warm invocations with language=yue run in ~10s
~$0.21/month storage cost
Zero idle cost (scales to zero)
Deepgram and OpenAI compatible APIs

The key insight: EFS is the missing piece. It provides persistent, fast storage that Lambda can access without a NAT Gateway (using the free S3 VPC Gateway Endpoint for bootstrapping).

I couldn’t find any existing write-up of Whisper on Lambda using EFS for persistent model caching — most approaches either bundle the model in Docker (3-minute cold starts) or re-download from S3 on every cold start (no caching between instances). If you’ve seen this done before, I’d love to know.

Two things worth knowing before you deploy:

SYSTRAN doesn’t publish an official CT2 model for large-v3-turbo — use a community-converted one like Zoont/faster-whisper-large-v3-turbo-int8-ct2
Always pass a language parameter if you know it — cuts response time roughly in half

If you’re building voice transcription on AWS and want Whisper-quality accuracy without the SageMaker complexity, give it a try.