The Simplest Complex SWE Project

What makes software engineering hard or complex? It varies between projects & domains - but there’s a common flavor of complexity that experienced engineers implicitly understand.
How would you best convey this to a bright CS undergrad? College misleads about where complexity lives, often implying it lives between linked-lists & Leetcodes. This is an advantage of internships ahead of entering the workforce: the gain of some hard-to-convey understanding about the sources of real world challenges.
This blog post takes a stab at conveying that intuition. It’s written for aspiring software engineers that know how to code. I will provide a single project, with a list of complications that should allow you to grok the real-world engineering complexity.
The core project should be doable in an hour. The entire list of complications should take you a month or more. I encourage you to read the entire list, and attempt as many of the complications as you’d like. You need not commit to technical decisions ahead of time, expect to change many of them as you progress.
This decisions here are representative of what a founder of a ML startup, or the 5th engineer there might be thinking about - but not representative of what a Google engineer would be thinking about, as these decisions are already nailed down at large companies and most engineers there build on top of bespoke infra.
Feel free to ask an LLM to help you understand parts of this document, or about trade-offs associated with any decisions. You may benefit from understanding the decisions, even if you don’t ultimately implement them.

Task

Build a website that allows a user to submit a prompt, and uses an AI image generation model to create an image. Allow them to log in and have a gallery that persists past image generations.
For reference, here’s a website that does that, with its source code. Feel free to copy as much of it as helps.

Core decisions

Frontend & backend

React is the modern standard, and you should probably default to it. Some sample options:
  • Next.js for the frontend + backend
  • Next.js for the frontend + Python for the backend
    • For Python, you’ll likely pick amongst Flask, Starlite, FastAPI
  • Static React for the frontend + Express for the backend

Deployment

Focusing on options with a strong cheap or free offering, consider:
  • Vercel (for Next.js & other JS options)
  • Render (upto 1 small service for free)
  • Fly.io (~$3/mo for a small instance, bills waived under $5/mo)
  • Cloudflare Workers
Additionally, you may consider VPS based options, eg: Digital Ocean, Hetzner - but these aren’t used in practice for modern SWE. Try to figure out why.
Bigger companies often tend to deploy (& store DB & storage) on a hyperscaler cloud (ie: GCP, AWS or Azure). I would usually recommend against this for you. Try to figure out why.

Database

Focusing on options with a strong cheap or free offering, consider:
  • Postgres-based offerings: Neon, Supabase
  • NoSQL offerings: Firebase, Cloudflare D1
You’ll also need to figure out what your DB schema ought to look like.

Auth

All options generally have strong free offerings:
  • DIY: Auth.js
  • Managed: Clerk, Auth0, Stytch
  • DB-coupled: Supabase, Firebase
You likely want to use an OAuth provider on top of that, ie: Sign-in with Google, which will need you to register with Google and figure out config within your Google account.

Model Worker

You may want to have more control over the model worker. All options generally have strong free or cheap offerings. Three tiers of complexity:
  • Pre-hosted (Together, Replicate): an existing model is ready to use with a simple API
  • Custom serverless (Modal, Replicate): you must build your own model Docker image conforming to a protocol, but the provider will handle everything else for you.
  • Machine (Vast.ai, Prime Intellect): you must build your own model Docker image, and also deal with eg: queueing, scale-up.
The last tier is quite hard to start with - I’d avoid it for now and at most start with “Custom serverless”.

Complications

Custom domain

Deploy this to your own domain (~$12/y) - you could use a subdomain of a personal website (eg: imagegen.venki.dev).
  • What registrar do you use? (I’d recommend Cloudflare)
  • How do you configure DNS?

Image persistence

You could start storing images in a DB, but you likely want to use an S3-like bucket instead. I’d recommend Cloudflare R2.

Minimize latency

Reduce latency between submitting a prompt & receiving an image. Some ideas:
  • Avoid unnecessary network roundtrips to authenticate a user (use JWTs)
  • Ensure that your frontend gets the image as soon as the image is done.
    • Also: ensure that if you refresh your page before the image is done, it still shows up without additional refreshes, ie: some sort of polling or websocket.
  • Profile latency to understand how much of it comes from various pieces (eg: round-trips between server ↔ db, auth, worker)
  • Sending large images to a client might be slow, look into compressed formats like WebP, or intermediate images you can send like blurhash

Maximize throughput

Models often get the best throughput by batching requests. How do you allow for batching while limiting latency?

Custom Model

Make modifications to the weights and or inference code, such that you can’t get inference off-the-shelf. This will make hosting the model notably more challenging.

Scaling up

How do various components of your system scale under load? It’s generally pretty trivial for your frontend & backend, but you might need to understand this for your model worker the most.
If you’re managing your own GPUs, you’ll have a lot more complexity here as you decide your own strategies about how you load-balance, queue & scale. For example, do you use a standard HTTP load balancer, or a message queue, or something else?

Scaling-to-zero and cold-starts

GPUs are expensive to keep always-on, especially when you have a custom model which is going to need a dedicated GPU.
One trick is to scale-to-zero, turning off all GPUs when you’re not using them. However, this means that you’re going to need to deal with cold-starts, ie: greater latency when you have zero running instances.
How do you deal with this, and minimize cold-start latency?

Toolchains & Typing

How do you manage packages for your repo? Consider learning about & using uv for Python or bun or pnpm for Typescript.
Ensure that your repos are strongly typed, linted, and formatted. Investigate setting up these configurations. For Python, I’d expect to use Pyright (typing) and Ruff (formatting + linting). For Typescript I’d expect to use Typescript (typing), ESLint (linting) and Prettier (formatting).

Continuous Deployment

New commits to the main branch of your repo should automatically deploy all services, the frontend, backend & worker.
You likely need to use Github Actions to set this up.

Dev & Staging env setup

How should new developers use the repo while writing new code? And how do they test it out? This might be a particularly messy question for ML models that require a GPU to test.
How would you deploy a staging environment, that 1:1 mirrors the production environment?
How do you configure how your worker, server, model talk to the right instances of each other? (ie: dev ↔ dev, prod ↔ prod)

Analytics & Monitoring

You might want to try integrating one or more types of visibility into the state of your app, ie:
  • Web Analytics (Google Analytics, Cloudflare Analytics, Posthog): How are people finding your website? Where are they from? What pages are they visiting?
  • App Analytics (Posthog, Amplitude): What sorts of actions are people taking in your app?
  • Error monitoring (Posthog, Sentry): Are errors being thrown by your app - if so, how often, and what are the stack traces like?
  • Logging: If you run into a bug in production, how are you going to see what was happening in the various pieces of code at the time of the incident?

Frontend

We’ve side-stepped most of the frontend complexity, but there are questions you can tackle like:
  • What sorts of component libraries do you use? (eg: ShadCN UI, Mantine, Chakra UI)
  • What sorts of styling approaches do you use? (eg: Tailwind CSS, styled components, Sass)
  • What sort of state management library do you use? (eg: Redux, MobX, Zustand, Vanilla React)
    • (This app is perhaps too simple for this)
  • What sort of data fetching library do you use? (eg: useSWR, RTK Query, Tanstack Query)
You may decide to add more features if you’d like (eg: paid subscriptions, sharing, public feeds).
 

Subscribe

Liked this post? Get email for new ones