Key System Characteristics

Before designing a large-scale distributed system, architects must define the baseline characteristics and Non-Functional Requirements (NFRs) the system must achieve.

1. Scalability

Scalability is the capability of a system to handle a growing amount of work or its potential to be enlarged to accommodate that growth.

Horizontal Scaling (Scaling Out): Adding more servers to a pool of resources.
Preferred for distributed systems because it offers near-infinite scalability.
Utilizes cheaper commodity hardware.
Requires stateless application design and complex data partitioning.
Vertical Scaling (Scaling Up): Upgrading the power of an existing server (more CPU, RAM, faster Disks).
Simpler to implement but has a hard physical ceiling.
Creates a Single Point of Failure (SPOF).

2. Reliability

Reliability is the probability a system will fail in a given period. A distributed system is considered reliable if it keeps delivering its services even when one or several of its software or hardware components fail.

Implementation: Achieved through strict redundancy.
If a server fails, a replica must instantly take its place.
Data must be replicated across multiple physical disks, racks, or geographical data centers.

3. Availability

Availability is the percentage of time a system remains operational and accessible to clients under normal conditions.

Measurement: Measured in "Nines" (e.g., 99.9% uptime = ~8.7 hours downtime/year; 99.999% "Five Nines" = ~5 minutes downtime/year).
Reliability vs. Availability: A reliable system is inherently available. An available system is not necessarily reliable (e.g., may stay online but return stale/incorrect data due to partial backend failure).

4. Efficiency

Efficiency measures how well a system performs its required tasks, using two primary metrics:

Latency (Response Time): Time required to process a single request and return a response. Modern web systems aim for low latency (< 200ms).
Throughput (Bandwidth): Number of operations a system can handle over a specific period (e.g., Requests Per Second or MB/s).

Note: A system can have high throughput but terrible latency if requests take a long time to process but the system can process millions of them simultaneously.

5. Serviceability / Manageability

Serviceability is the simplicity and speed with which a system can be repaired, maintained, or updated.

Implementation: Achieved through aggressive automation, CI/CD pipelines, distributed tracing, and centralized logging.
If it takes weeks to deploy a database schema change or isolate a bug, the system lacks serviceability.