Every SaaS product we've built started with a version of the same conversation: “We want to move fast now but not paint ourselves into a corner.” That tension is real, and it doesn't resolve itself. The architecture decisions you make at v0.1 compound — for better or worse — as you scale. Here's what we've learned about making the right calls early.
Most early-stage SaaS products use one of three tenancy models: a single shared database for all tenants, a shared database with tenant-scoped rows, or a database-per-tenant. Each has tradeoffs.
The row-level tenancy model (one database, all tenants, every table has a tenant_id column) is the most common starting point and usually the right one for early stage. It's operationally simple and makes feature development fast. The risk is data isolation — a missing WHERE tenant_id = ? clause is a cross-tenant data leak. You mitigate this with row-level security in the database, not just application-level guards.
Database-per-tenant becomes worth the operational overhead when you're selling to enterprise customers who require data isolation by contract, or when your compliance posture (HIPAA, SOC 2) makes shared infrastructure risky. You can usually defer this until you have a customer who explicitly requires it.
The migration from shared to isolated tenancy is painful but survivable. The thing you can't easily fix retroactively is a data model that doesn't have tenant isolation at all — where tenant context lives in the session instead of the data layer.
Authentication (who are you?) and authorization (what are you allowed to do?) are different problems that get conflated constantly. Authentication is largely solved — use an identity provider (Auth0, Clerk, Cognito) and don't build it yourself. Authorization is product-specific and almost always needs to be custom.
The pattern we reach for is a simple RBAC (role-based access control) layer with four or five roles per product, implemented close to the data layer. Don't scatter authorization checks throughout your application code — you'll miss a path. Build a permissions layer and route all data access through it.
The mistake that costs teams the most time is implementing authorization as a UI concern. Hiding a button in the frontend is not authorization — it's UX. Your API needs to enforce permissions independently of what the frontend shows.
The moment your product needs to do something that takes longer than a request/response cycle — send an email, process an upload, run a report — you need a job queue. The temptation is to do async work inline using a hack (a setTimeout, a non-awaited promise, a background thread) and ship it. Resist this. Inline async work doesn't retry on failure, doesn't have visibility, and becomes a mystery when something goes wrong in production.
Set up a proper job queue (BullMQ on Redis, or a managed service like Inngest) early, even if you only have two jobs. The infrastructure investment is small. The operational benefit — visibility into what's running, automatic retries, dead letter queues — is significant from the moment you have real users.
A feature flag system is underrated as an architectural investment. The ability to deploy code to production that isn't yet active, to roll features out gradually, and to kill a feature instantly without a rollback changes how your team works.
You don't need LaunchDarkly on day one. A simple features table in your database with tenant and user overrides is enough to start. The key is building the pattern into your codebase early — if (featureEnabled('new-dashboard', user)) — so it becomes the standard way to ship anything non-trivial.
Logs tell you what happened. Traces tell you why it was slow. Metrics tell you whether the system is healthy. You need all three, and they need to be connected. A structured logging setup that emits trace IDs lets you correlate a slow request in your APM tool with the exact log lines that explain it.
Start with structured logging (JSON, not plaintext) from day one. Add distributed tracing before you have more than three services. Set up alerting before you have paying customers — not after. The cost of instrumenting a system properly from the start is much lower than debugging a production incident in a system you can't see inside.
The teams that scale fastest aren't the ones who made the cleverest architectural choices at v0.1. They're the ones who made boring, conventional choices with good operational practices, and who built systems they could observe and reason about under pressure. Scaling is mostly an operational problem, not an architectural one — until it isn't. By then, if you've done the fundamentals well, you'll know exactly what to change.
Writing about design, engineering, and the craft of building healthcare technology.