Table of contents
Open Table of contents
- Introduction
- The Framework
- Step 1: Requirements
- Step 2: Scale Estimation
- Step 3: API Design
- Step 4: Data Model
- Step 5: Slug Generation
- Step 6: Full Architecture
- Step 7: Failure Handling
- Extension: Adding Analytics Dashboard
- Extension: Custom Domains
- What Makes This a Strong Interview Answer
- Key Takeaways
Introduction
A URL shortener is the perfect first system design problem. Simple enough to understand fully, deep enough to touch every concept — scaling, caching, databases, availability, and algorithm design.
This post follows a universal framework applicable to every system design problem.
The Framework
- Understand the requirements
- Estimate the scale
- Design the API
- Design the data model
- High-level architecture
- Deep dive into bottlenecks
- Handle failures
Step 1: Requirements
Functional:
- User gives a long URL → system returns a short URL
- User visits short URL → system redirects to original URL
- Short URLs should be as short as possible
- Optional: custom slug, expiry date
Non-functional:
- High availability — if system is down, every link on the internet breaks
- Low latency redirects — must feel instant
- No collisions — two long URLs cannot get the same short URL
- Scalable — billions of URLs, trillions of redirects
Step 2: Scale Estimation
Numbers drive architecture decisions.
Writes (URL creation):
100 million new URLs/day
100,000,000 / 86,400 = ~1,200 writes/second
Reads (redirects):
Assume 100:1 read-to-write ratio
10 billion redirects/day
10,000,000,000 / 86,400 = ~115,000 reads/second
Storage:
~500 bytes per URL record
100M URLs/day × 365 days × 10 years = 365B records
365B × 500 bytes = ~180 TB over 10 years
Summary:
- 115,000 reads/second
- 1,200 writes/second
- 180 TB storage
- Extremely read-heavy — 100:1 ratio
Step 3: API Design
Create short URL:
POST /api/urls
{ longUrl: "...", customSlug: "mylink", expiryDays: 30 }
Response:
{ shortUrl: "https://bit.ly/abc123", slug: "abc123", expiresAt: "..." }
Redirect:
GET /{slug}
→ HTTP 301 or 302 redirect to long URL
301 vs 302 — matters more than it looks:
| 301 (Permanent) | 302 (Temporary) | |
|---|---|---|
| Browser caches redirect | Yes — forever | No |
| Every click hits your server | No | Yes |
| Analytics tracking | Lost after first click | Every click tracked |
If you need click analytics → use 302. If you want to minimise server load → use 301.
Step 4: Data Model
URLs table:
id → unique identifier (primary key)
slug → "abc123" (indexed, unique)
long_url → original URL
created_at → timestamp
expires_at → nullable
user_id → nullable
click_count → integer
Which database?
Apply the access pattern framework:
- Access pattern: “give me long URL for this slug” — pure key lookup
- Scale: 115,000 reads/second, 180 TB
- Structure: simple, consistent
- Joins needed: no
→ Cassandra for URL data (massive scale, simple key lookup, AP system)
→ PostgreSQL for user accounts (relational, ACID for payments/subscriptions)
This is polyglot persistence in practice.
Step 5: Slug Generation
This is the unique algorithmic challenge. Three approaches:
Approach 1: Random String
Generate random 6-7 character base62 string.
Base62 = a-z + A-Z + 0-9 = 62 characters
7 chars = 62^7 = 3.5 trillion combinations ✅
Problem: collisions are possible. Need to check DB on every generation — expensive at scale.
Approach 2: MD5/SHA256 Hash
Hash the long URL, take first 7 characters.
Problem: different URLs can produce same first 7 chars. Same long URL always produces same hash — two users shortening the same URL get the same slug, breaking per-user analytics.
Approach 3: Auto-Increment ID + Base62 Encoding ✅
Every new URL gets an auto-incrementing ID. Encode that ID in base62.
URL 1 → ID: 1 → slug: "1"
URL 1000000 → ID: 1000000 → slug: "4c92"
URL 10B → ID: 10000000000 → slug: "aMLpW" (7 chars)
Why this wins:
- No collisions — IDs are unique by definition
- No DB lookup needed to check uniqueness
- Gets longer only as you grow
- Deterministic and simple
One downside: sequential IDs expose how many URLs have been created. Fix: add a random offset or shuffle the base62 alphabet privately.
Step 6: Full Architecture
[User]
↓
[Load Balancer]
(Active-Active)
↓ ↓
[App Server 1] [App Server 2]
↓
[Redis Cache]
(Cache Aside, 24hr TTL)
↓ (cache miss only)
[Cassandra Cluster]
(slug → long URL, replicated)
[PostgreSQL]
(user accounts, custom domains)
[Analytics Service]
← Kafka topic: "url.clicked"
(async — never slows the redirect)
Create URL flow:
POST /api/urls
→ Load Balancer → App Server
→ Generate auto-increment ID
→ Encode to Base62 slug
→ Save to Cassandra
→ Return short URL
Redirect flow:
GET /abc123
→ Load Balancer → App Server
→ Check Redis cache for "abc123"
→ Cache hit → redirect instantly ✅
→ Cache miss → lookup Cassandra
→ store in Redis (24hr TTL)
→ redirect user
Why the Cache Layer Is Critical Here
At 115,000 reads/second, popular slugs get hit millions of times. Without cache, Cassandra absorbs all of it unnecessarily.
With Redis (Cache Aside):
- ~90% cache hit rate for popular URLs
- Cassandra sees only ~11,500 reads/second
- Redis TTL handles expired URL eviction automatically
Step 7: Failure Handling
Redis goes down:
All reads fall through to Cassandra
System stays alive — degraded performance only
Redis restarts → cache warms up gradually
App server goes down:
Load Balancer health checks detect failure
Stops routing to failed server
Auto-scaling spins up replacement
Cassandra node goes down:
Data replicated across nodes (replication factor 3)
Other nodes serve the data — no downtime
No data loss
URL expiry:
Check expires_at on every read
If expired → return 404
Background job runs daily → deletes expired records
Extension: Adding Analytics Dashboard
The ask: show URL creators click counts, countries, devices, over time.
Architecture:
User clicks short URL
→ Redirect happens instantly (critical path, unchanged)
→ Simultaneously publish to Kafka topic "url.clicked"
{ slug, userId, country, device, timestamp }
Analytics pipeline:
Kafka → Stream Processor (Flink/Spark Streaming)
→ Aggregates clicks per slug/country/device/hour
→ Stores in ClickHouse (columnar, fast aggregations)
Dashboard API:
→ Check Redis: "analytics:user456:last30days" (15 min TTL)
→ Cache hit → serve immediately
→ Cache miss → query ClickHouse → cache → serve
Key principle: analytics is completely off the critical path. The redirect never waits for analytics processing.
Why ClickHouse over Cassandra for analytics storage:
ClickHouse is a columnar database optimised for aggregation queries — “sum clicks grouped by country over last 30 days.” Cassandra is optimised for key lookups, not aggregations.
Extension: Custom Domains
The ask: premium users use their own domain — mycompany.com/abc123 instead of bit.ly/abc123.
New components needed:
custom_domains table (PostgreSQL):
domain → "mycompany.com"
user_id → "u456"
verified → true
ssl_cert_status → "provisioned"
Request flow:
mycompany.com/abc123 arrives
→ DNS points mycompany.com to your servers
→ App server reads Host header → "mycompany.com"
→ Lookup custom_domains table → find user u456
→ Lookup slug "abc123" for user u456
→ Redirect
Why PostgreSQL for custom_domains, not Cassandra:
- Volume is tiny — maybe 10,000 premium users total, nearly zero load
- Data is relational — domain belongs to user, linked to subscription
- ACID matters — domain setup involves multiple steps (verify → insert → provision SSL → activate) that must all succeed together or roll back
- Cassandra is poor at mutable data with updates — it uses tombstones for deletes, which create overhead over time
Rule: use Cassandra when scale is massive and structure is simple. Use PostgreSQL when data is relational, transactional, and volume is manageable.
The real engineering challenge: SSL certificate provisioning per custom domain. Every
mycompany.com needs its own HTTPS cert. Services like Let’s Encrypt automate this, but it’s the
hardest part of the feature.
What Makes This a Strong Interview Answer
Most answers stop at: “Store URLs in a database, look up on click, redirect.”
A complete answer adds:
- Scale estimation — knowing it’s 100:1 read-heavy shapes every decision
- Base62 encoding — not “generate a random string”
- Cache layer — Redis with Cache Aside, 24hr TTL, explains why
- Polyglot persistence — Cassandra for slugs, PostgreSQL for users
- 301 vs 302 — shows product thinking alongside technical thinking
- Failure handling — what each component’s failure looks like
- Analytics via Kafka — decoupled from critical path
Key Takeaways
- Scale estimation first — numbers decide architecture.
- Auto-increment ID + Base62 encoding is the cleanest slug generation approach.
- Redis cache is not optional here — 115,000 reads/second needs it.
- Different data types → different databases (polyglot persistence).
- Always decouple non-critical work (analytics, notifications) via queues.
- Every component that can fail needs a failure plan.
Part of the system design series. Next: designing a notification system — fan-out at scale.