System Design: URL Shortener

Open Table of contents

Introduction
The Framework
Step 1: Requirements
Step 2: Scale Estimation
Step 3: API Design
Step 4: Data Model
Step 5: Slug Generation
Step 6: Full Architecture
- Why the Cache Layer Is Critical Here
Step 7: Failure Handling
Extension: Adding Analytics Dashboard
Extension: Custom Domains
What Makes This a Strong Interview Answer
Key Takeaways

Introduction

A URL shortener is the perfect first system design problem. Simple enough to understand fully, deep enough to touch every concept — scaling, caching, databases, availability, and algorithm design.

This post follows a universal framework applicable to every system design problem.

The Framework

Understand the requirements
Estimate the scale
Design the API
Design the data model
High-level architecture
Deep dive into bottlenecks
Handle failures

Step 1: Requirements

Functional:

User gives a long URL → system returns a short URL
User visits short URL → system redirects to original URL
Short URLs should be as short as possible
Optional: custom slug, expiry date

Non-functional:

High availability — if system is down, every link on the internet breaks
Low latency redirects — must feel instant
No collisions — two long URLs cannot get the same short URL
Scalable — billions of URLs, trillions of redirects

Step 2: Scale Estimation

Numbers drive architecture decisions.

Writes (URL creation):

100 million new URLs/day
100,000,000 / 86,400 = ~1,200 writes/second

Reads (redirects):

Assume 100:1 read-to-write ratio
10 billion redirects/day
10,000,000,000 / 86,400 = ~115,000 reads/second

Storage:

~500 bytes per URL record
100M URLs/day × 365 days × 10 years = 365B records
365B × 500 bytes = ~180 TB over 10 years

Summary:

115,000 reads/second
1,200 writes/second
180 TB storage
Extremely read-heavy — 100:1 ratio

Step 3: API Design

Create short URL:

POST /api/urls
{ longUrl: "...", customSlug: "mylink", expiryDays: 30 }

Response:
{ shortUrl: "https://bit.ly/abc123", slug: "abc123", expiresAt: "..." }

Redirect:

GET /{slug}
→ HTTP 301 or 302 redirect to long URL

301 vs 302 — matters more than it looks:

	301 (Permanent)	302 (Temporary)
Browser caches redirect	Yes — forever	No
Every click hits your server	No	Yes
Analytics tracking	Lost after first click	Every click tracked

If you need click analytics → use 302. If you want to minimise server load → use 301.

Step 4: Data Model

URLs table:

id          → unique identifier (primary key)
slug        → "abc123" (indexed, unique)
long_url    → original URL
created_at  → timestamp
expires_at  → nullable
user_id     → nullable
click_count → integer

Which database?

Apply the access pattern framework:

Access pattern: “give me long URL for this slug” — pure key lookup
Scale: 115,000 reads/second, 180 TB
Structure: simple, consistent
Joins needed: no

→ Cassandra for URL data (massive scale, simple key lookup, AP system)

→ PostgreSQL for user accounts (relational, ACID for payments/subscriptions)

This is polyglot persistence in practice.

Step 5: Slug Generation

This is the unique algorithmic challenge. Three approaches:

Approach 1: Random String

Generate random 6-7 character base62 string.

Base62 = a-z + A-Z + 0-9 = 62 characters
7 chars = 62^7 = 3.5 trillion combinations ✅

Problem: collisions are possible. Need to check DB on every generation — expensive at scale.

Approach 2: MD5/SHA256 Hash

Hash the long URL, take first 7 characters.

Problem: different URLs can produce same first 7 chars. Same long URL always produces same hash — two users shortening the same URL get the same slug, breaking per-user analytics.

Approach 3: Auto-Increment ID + Base62 Encoding ✅

Every new URL gets an auto-incrementing ID. Encode that ID in base62.

URL 1       → ID: 1          → slug: "1"
URL 1000000 → ID: 1000000    → slug: "4c92"
URL 10B     → ID: 10000000000 → slug: "aMLpW" (7 chars)

Why this wins:

No collisions — IDs are unique by definition
No DB lookup needed to check uniqueness
Gets longer only as you grow
Deterministic and simple

One downside: sequential IDs expose how many URLs have been created. Fix: add a random offset or shuffle the base62 alphabet privately.

Step 6: Full Architecture

                       [User]
                         ↓
                  [Load Balancer]
                  (Active-Active)
                  ↓           ↓
           [App Server 1] [App Server 2]
                  ↓
           [Redis Cache]
           (Cache Aside, 24hr TTL)
                  ↓ (cache miss only)
           [Cassandra Cluster]
           (slug → long URL, replicated)

           [PostgreSQL]
           (user accounts, custom domains)

           [Analytics Service]
           ← Kafka topic: "url.clicked"
           (async — never slows the redirect)

Create URL flow:

POST /api/urls
→ Load Balancer → App Server
→ Generate auto-increment ID
→ Encode to Base62 slug
→ Save to Cassandra
→ Return short URL

Redirect flow:

GET /abc123
→ Load Balancer → App Server
→ Check Redis cache for "abc123"
→ Cache hit → redirect instantly ✅
→ Cache miss → lookup Cassandra
             → store in Redis (24hr TTL)
             → redirect user

Why the Cache Layer Is Critical Here

At 115,000 reads/second, popular slugs get hit millions of times. Without cache, Cassandra absorbs all of it unnecessarily.

With Redis (Cache Aside):

~90% cache hit rate for popular URLs
Cassandra sees only ~11,500 reads/second
Redis TTL handles expired URL eviction automatically

Step 7: Failure Handling

Redis goes down:

All reads fall through to Cassandra
System stays alive — degraded performance only
Redis restarts → cache warms up gradually

App server goes down:

Load Balancer health checks detect failure
Stops routing to failed server
Auto-scaling spins up replacement

Cassandra node goes down:

Data replicated across nodes (replication factor 3)
Other nodes serve the data — no downtime
No data loss

URL expiry:

Check expires_at on every read
If expired → return 404
Background job runs daily → deletes expired records

Extension: Adding Analytics Dashboard

The ask: show URL creators click counts, countries, devices, over time.

Architecture:

User clicks short URL
→ Redirect happens instantly (critical path, unchanged)
→ Simultaneously publish to Kafka topic "url.clicked"
  { slug, userId, country, device, timestamp }

Analytics pipeline:
Kafka → Stream Processor (Flink/Spark Streaming)
      → Aggregates clicks per slug/country/device/hour
      → Stores in ClickHouse (columnar, fast aggregations)

Dashboard API:
→ Check Redis: "analytics:user456:last30days" (15 min TTL)
→ Cache hit → serve immediately
→ Cache miss → query ClickHouse → cache → serve

Key principle: analytics is completely off the critical path. The redirect never waits for analytics processing.

Why ClickHouse over Cassandra for analytics storage:

ClickHouse is a columnar database optimised for aggregation queries — “sum clicks grouped by country over last 30 days.” Cassandra is optimised for key lookups, not aggregations.

Extension: Custom Domains

The ask: premium users use their own domain — mycompany.com/abc123 instead of bit.ly/abc123.

New components needed:

custom_domains table (PostgreSQL):
  domain        → "mycompany.com"
  user_id       → "u456"
  verified      → true
  ssl_cert_status → "provisioned"

Request flow:

mycompany.com/abc123 arrives
→ DNS points mycompany.com to your servers
→ App server reads Host header → "mycompany.com"
→ Lookup custom_domains table → find user u456
→ Lookup slug "abc123" for user u456
→ Redirect

Why PostgreSQL for custom_domains, not Cassandra:

Volume is tiny — maybe 10,000 premium users total, nearly zero load
Data is relational — domain belongs to user, linked to subscription
ACID matters — domain setup involves multiple steps (verify → insert → provision SSL → activate) that must all succeed together or roll back
Cassandra is poor at mutable data with updates — it uses tombstones for deletes, which create overhead over time

Rule: use Cassandra when scale is massive and structure is simple. Use PostgreSQL when data is relational, transactional, and volume is manageable.

The real engineering challenge: SSL certificate provisioning per custom domain. Every mycompany.com needs its own HTTPS cert. Services like Let’s Encrypt automate this, but it’s the hardest part of the feature.

What Makes This a Strong Interview Answer

Most answers stop at: “Store URLs in a database, look up on click, redirect.”

A complete answer adds:

Scale estimation — knowing it’s 100:1 read-heavy shapes every decision
Base62 encoding — not “generate a random string”
Cache layer — Redis with Cache Aside, 24hr TTL, explains why
Polyglot persistence — Cassandra for slugs, PostgreSQL for users
301 vs 302 — shows product thinking alongside technical thinking
Failure handling — what each component’s failure looks like
Analytics via Kafka — decoupled from critical path

Key Takeaways

Scale estimation first — numbers decide architecture.
Auto-increment ID + Base62 encoding is the cleanest slug generation approach.
Redis cache is not optional here — 115,000 reads/second needs it.
Different data types → different databases (polyglot persistence).
Always decouple non-critical work (analytics, notifications) via queues.
Every component that can fail needs a failure plan.

Part of the system design series. Next: designing a notification system — fan-out at scale.