Building Your Own Modal: A First-Principles Guide to Serverless GPU Infrastructure
Why Modal Matters
Modal has quietly become one of the most impressive pieces of infrastructure in the ML ecosystem. The core value proposition is deceptively simple: write Python, decorate it, and it runs in the cloud with GPUs. No Dockerfiles. No Kubernetes manifests. No SSH sessions into EC2 instances. Just @app.function(gpu="A100") and you’re training models on hardware that costs $30,000.
But underneath that simple API is a genuinely hard engineering problem. How do you make remote execution feel local? How do you spin up GPU containers in milliseconds instead of minutes? How do you serialize arbitrary Python functions with their closures and dependencies?
I’ve been thinking about how I’d build this from scratch. Not because I’m planning to compete with Modal1, but because understanding the architecture reveals fundamental insights about distributed computing, container orchestration, and the physics of cold starts.
The Architecture
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Client SDK │────▶│ Control Plane │────▶│ Worker Pool │
│ (local Python) │ │ (API + Queue) │ │ (containers) │
└─────────────────┘ └──────────────────┘ └─────────────────┘
This looks simple but each arrow represents months of engineering. Let’s break it down.
Function Serialization: The Hard Part
The fundamental challenge is capturing a Python function and everything it needs to run. This includes:
- The function bytecode itself
- Closure variables (anything referenced from enclosing scopes)
- Module dependencies (imports)
- Global state (unfortunately)
import cloudpickle
def serialize_function(fn):
return cloudpickle.dumps(fn)
Cloudpickle does the heavy lifting here, but it’s not magic. It walks the function’s __code__ object, captures __globals__, and recursively serializes referenced objects. This works remarkably well until it doesn’t—try serializing a function that references a database connection or a file handle and watch it explode.
Modal gets around this with careful API design. Functions decorated with @app.function are analyzed at definition time, and their dependencies are explicitly declared through the Image class. This is why you write Image().pip_install("torch") instead of just having torch in your requirements.txt. The system needs to know exactly what goes into the container before your function ever runs.
The Decorator API
class App:
def __init__(self, name):
self.name = name
self.functions = {}
def function(self, image=None, gpu=None, memory=None):
def decorator(fn):
wrapped = RemoteFunction(fn, image, gpu, memory)
self.functions[fn.__name__] = wrapped
return wrapped
return decorator
class RemoteFunction:
def __init__(self, fn, image, gpu, memory):
self.fn = fn
self.config = {"image": image, "gpu": gpu, "memory": memory}
def remote(self, *args, **kwargs):
payload = serialize_function(self.fn)
return submit_job(payload, args, kwargs, self.config)
def local(self, *args, **kwargs):
return self.fn(*args, **kwargs)
The .remote() vs .local() distinction is crucial. During development, you call .local() to test on your machine. In production, .remote() ships the function to the cloud. Same code, different execution context. This is what makes Modal feel seamless.
Image Building: Two Approaches
Approach A: Dockerfile generation (simpler)
class Image:
def __init__(self):
self.commands = ["FROM python:3.11-slim"]
def pip_install(self, *packages):
self.commands.append(f"RUN pip install {' '.join(packages)}")
return self
def to_dockerfile(self):
return "\n".join(self.commands)
This works but it’s slow. Every pip_install becomes a Docker layer, and rebuilding means re-running pip from scratch if anything changes.
Approach B: Layered snapshots (what Modal actually does)
Instead of generating Dockerfiles, build images incrementally:
- Start from a base image with Python and common packages pre-installed
- Hash each
.pip_install()/.apt_install()call by its contents - Check if a layer with that hash already exists
- If yes, reuse it. If no, create it and cache it.
This is why Modal rebuilds are fast. When you change one dependency, only that layer gets rebuilt. The graph of layer dependencies is a DAG, and Modal caches aggressively at every node.
The Cold Start Problem
This is where things get interesting. Cold start = time from job submission to function execution. Let’s break down where time goes:
| Phase | Time | Notes |
|---|---|---|
| Scheduling decision | 10-50ms | Fast if you’re not stupid |
| Image pull | 5-60s | The killer |
| Container creation | 100-500ms | Varies by runtime |
| Python interpreter init | 200-500ms | Unavoidable |
| Import dependencies | 1-10s | Second killer |
| App init (load model) | 1-60s | User code, hard to optimize |
For ML workloads, a 30-second cold start is unacceptable. Imagine a user hitting your inference endpoint and waiting half a minute for PyTorch to import. This is why Modal’s cold start optimization is their core competitive advantage.
Strategy 1: Warm Pools
The most effective solution is to just… not have cold starts. Keep containers running and idle, ready to accept work:
class WarmPoolManager:
def __init__(self):
self.pools = {} # image_hash -> list of warm containers
async def get_container(self, image_hash, config):
pool = self.pools.get(image_hash, [])
if pool:
return pool.pop() # Instant!
return await self.create_container(image_hash, config) # Slow
async def return_container(self, container, image_hash):
await container.reset() # Clear state, keep process
self.pools[image_hash].append(container)
The economics here are interesting. A warm A100 container costs ~$2/hour even when idle. But if your customers are paying $3/hour when active and experiencing 10x better latency, the math works out. Modal keeps per-customer, per-function warm pools. Expensive, but necessary.
Strategy 2: Lazy Image Pulling with eStargz
Traditional container pulls download the entire image before starting. But what if you could start immediately and fetch files on-demand?
eStargz (Seekable tar.gz) reformats container layers so individual files can be fetched via HTTP range requests. The container starts with a stub filesystem, and files are pulled lazily on first access.
# Convert image to estargz format
ctr-remote image optimize docker.io/myimage:latest
In practice, this means your container starts in ~500ms, and the first import of torch takes an extra second while the .so files are fetched. Total time: 1.5s instead of 30s. The files are cached locally after first access, so subsequent runs are instant.
Strategy 3: CRIU Snapshots
CRIU (Checkpoint/Restore In Userspace) can snapshot a running Linux process—memory, file descriptors, network connections, everything—and restore it later. This is how AWS Lambda achieves sub-100ms cold starts.
# First run: initialize everything (slow, ~30s)
container = start_container()
container.exec("python -c 'import torch; model = load_model()'")
# Checkpoint the entire process state
container.checkpoint("/snapshots/my-model-ready")
# Later cold starts: restore from checkpoint (~100ms)
container = restore_checkpoint("/snapshots/my-model-ready")
For ML workloads where model loading dominates cold start time, this is transformative. Instead of loading a 7GB model from disk every time, you restore a process that already has the model in memory.
The catch: GPU state is tricky. CUDA contexts don’t checkpoint cleanly, so you need to reinitialize the GPU after restore. There are workarounds (checkpoint before GPU init, then init on restore), but it’s not seamless.
Strategy 4: Fork-Based Isolation
Instead of starting new containers, fork from a warm parent:
# Parent process (warm, imports loaded)
import torch
import transformers
while True:
job = queue.get()
pid = os.fork()
if pid == 0:
# Child: already has torch in memory via COW
run_job(job)
os._exit(0)
else:
os.waitpid(pid, 0)
Copy-on-write semantics mean the fork is nearly instant. The child process shares memory pages with the parent until it writes to them. For read-heavy workloads (inference), this means you get the parent’s entire import tree for free.
This is how Gunicorn’s prefork model works, and it’s remarkably effective. The limitation is isolation—forked processes share file descriptors and can interfere with each other in subtle ways.
Let’s Do Some Math
How much does all this optimization matter? Let’s compute.
Assume you’re running an inference service that handles 1000 requests/day, with each request taking 2 seconds of GPU time. Without warm pools:
- Cold start: 30s average
- Compute: 2s
- Total latency: 32s
- Daily GPU hours: 1000 × 32s = 8.9 hours
With warm pools (assume 90% warm hit rate):
- 900 requests × 2s = 1800s
- 100 requests × 32s = 3200s
- Total: 5000s = 1.4 hours
- Daily GPU hours: 1.4 hours compute + 24 hours warm pool = 25.4 hours
Wait, that’s worse! The warm pool costs more than the cold starts saved.
But here’s what the math misses: latency matters. Users won’t wait 32 seconds. They’ll leave. If warm pools increase your conversion rate by 50%, the economics flip entirely. This is why Modal can charge premium prices—they’re not just selling compute, they’re selling latency.
Let’s redo the math with CRIU snapshots:
- Cold start with snapshot: 200ms
- 900 warm requests × 2s = 1800s
- 100 cold requests × 2.2s = 220s
- Total: 2020s = 0.56 hours
- No warm pool cost
- Daily GPU hours: 0.56 hours
Now we’re talking. Snapshots give you the latency benefits of warm pools without the idle cost. This is why AWS invested so heavily in Firecracker + snapshotting for Lambda.
The Control Plane
Let’s talk about the orchestration layer. You need:
- API server: Receives job submissions, returns handles
- Job queue: Distributes work to workers
- Scheduler: Decides which worker runs which job
- Metadata store: Tracks job state, logs, results
Technology choices:
| Component | Options | My pick |
|---|---|---|
| API | FastAPI, gRPC | gRPC for perf, FastAPI for dev speed |
| Queue | Redis Streams, NATS, Kafka | Redis Streams (simple, fast enough) |
| Scheduler | Kubernetes, Nomad, custom | Kubernetes for GPUs, custom for everything else |
| Database | Postgres, CockroachDB | Postgres + S3 for blobs |
async def submit_job(function_payload, args, kwargs, config):
job_id = uuid4()
# Store function and args in S3 (cheap, durable)
await s3.put(f"jobs/{job_id}/function.pkl", function_payload)
await s3.put(f"jobs/{job_id}/args.pkl", pickle.dumps((args, kwargs)))
# Queue for execution
await redis.xadd("jobs", {
"job_id": str(job_id),
"config": json.dumps(config),
"image_hash": config["image"].hash()
})
return JobHandle(job_id)
The worker side is simpler:
async def worker_main():
while True:
_, job = await redis.xread("jobs", block=0)
fn = cloudpickle.loads(await s3.get(f"jobs/{job.id}/function.pkl"))
args, kwargs = pickle.loads(await s3.get(f"jobs/{job.id}/args.pkl"))
try:
result = fn(*args, **kwargs)
await s3.put(f"jobs/{job.id}/result.pkl", pickle.dumps(result))
await redis.set(f"job:{job.id}:status", "completed")
except Exception as e:
await redis.set(f"job:{job.id}:status", f"failed:{e}")
GPU Scheduling
GPUs are the hard part. Unlike CPUs, you can’t easily time-slice a GPU between processes.2 Each job needs exclusive access to a GPU for the duration of its execution.
Kubernetes has reasonable GPU support via the NVIDIA device plugin:
apiVersion: v1
kind: Pod
spec:
containers:
- name: worker
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
gpu-type: a100
But Kubernetes scheduling is slow (~5s to schedule a pod) and doesn’t understand GPU topology. If you need to co-locate two pods that communicate via NVLink, you’re on your own.
For serious GPU workloads, you probably want a custom scheduler that understands:
- GPU memory requirements (not all A100s are equal—40GB vs 80GB)
- Multi-GPU jobs (need GPUs on same node for NVLink)
- Preemption (can we evict a low-priority job to run a high-priority one?)
- Bin packing (fit small jobs onto partially-used GPUs)
This is genuinely hard. I suspect Modal has a custom scheduler, but they haven’t written about it publicly.
What Would It Cost?
Let’s estimate the cost to build a minimal Modal clone:
| Component | Effort | Notes |
|---|---|---|
| SDK + serialization | 2-4 weeks | Cloudpickle does the heavy lifting |
| Image builder | 4-8 weeks | Layer caching is tricky |
| Control plane | 4-8 weeks | API, queue, scheduler |
| Worker runtime | 2-4 weeks | Container management |
| Warm pools | 4-8 weeks | Predictive scaling is hard |
| CRIU integration | 4-8 weeks | GPU state is painful |
| Web UI | 4-8 weeks | Logs, monitoring, billing |
| GPU scheduling | 8-16 weeks | The hardest part |
Total: 8-16 months for a small team. And that gets you to feature parity with Modal circa 2022. They’ve had three more years to optimize.
The 80/20 Version
If I wanted 80% of Modal’s value with 20% of the effort:
- Skip warm pools initially. Accept 10-30s cold starts.
- Use Kubernetes. Don’t build a custom scheduler.
- Use Kaniko for in-cluster image builds.
- Use Redis for job queue and state.
- Use S3 for function/result storage.
- Skip CRIU. It’s powerful but complex.
This gets you a working system in 2-3 months. You can add warm pools and snapshotting later when cold starts become the bottleneck.
# Minimal SDK - ~200 lines of code
@app.function(image=Image().pip_install("torch"), gpu="T4")
def train(config):
import torch
# ... training code ...
return metrics
# Usage
handle = train.remote({"lr": 0.001})
result = handle.result() # Blocks until complete
Closing Thoughts
Modal is impressive not because any single component is revolutionary, but because they’ve executed well on dozens of hard problems simultaneously. Function serialization, image building, cold start optimization, GPU scheduling, secrets management, volume mounts, web endpoints, cron jobs—each one is a project unto itself.
The fundamental insight is that developer experience matters. Modal could have built yet another Kubernetes wrapper with YAML files and kubectl commands. Instead, they asked: what if deploying to the cloud felt like running code locally? That question led them to solve problems that existing infrastructure ignored.
If you’re building ML infrastructure, the lesson isn’t “copy Modal.” It’s “understand your users’ pain points at a deep level and solve them end-to-end.” Modal’s users don’t care about containers or orchestration. They care about training models and running inference. Modal made the infrastructure invisible, and that’s why it works.
Thanks to Claude for helping me think through this architecture. The conversation that led to this post is preserved in my chat history if you want to see the iterative refinement process.