📊

Site Reliability Engineer

Engineering reliability—what happens when you treat operations as a software problem

What Does an SRE Do?

Site Reliability Engineers are software engineers who specialize in reliability. They write code to automate operations, define and measure service level objectives (SLOs), respond to incidents, and work with development teams to build more reliable systems.

SRE was invented at Google in 2003 when Ben Treynor was asked to run a production team. His answer: "Let's run it like a software engineering team." SREs spend at least 50% of their time on engineering work, not just operations.

The key insight of SRE is the error budget: if your SLO is 99.9% availability, you have a 0.1% error budget. Development teams can "spend" this budget on features and velocity. When the budget is exhausted, reliability work takes priority.

📜 Brief History

2003: Google hired Ben Treynor to lead a production team. He formed the first SRE team, applying software engineering principles to operations.

2003-2015: SRE grew within Google, developing practices like error budgets, SLOs, blameless postmortems, and toil budgets. The term remained largely internal.

2016: Google published the "Site Reliability Engineering" book, sharing practices with the world. The industry took notice.

2017-Present: SRE adoption exploded. Companies like Netflix, LinkedIn, Dropbox, and thousands more adopted SRE practices. "The Site Reliability Workbook" (2018) provided practical guidance.

🔄 SRE vs. DevOps

As Google's VP of Engineering Ben Treynor says: "SRE is a specific implementation of DevOps with some idiosyncratic extensions."

DevOps

  • • Cultural movement
  • • Broad principles (CALMS)
  • • Any implementation
  • • Focus on collaboration

SRE

  • • Specific role/practice
  • • Concrete methods (SLOs, error budgets)
  • • Engineering-first approach
  • • Focus on reliability

📏 SLIs, SLOs, and SLAs

Service Level Indicator (SLI)

A quantitative measure of service behavior: latency, error rate, throughput, availability.

Service Level Objective (SLO)

A target value for an SLI: "99.9% of requests complete in under 200ms." Internal goal.

Service Level Agreement (SLA)

A contract with consequences: "If availability drops below 99.9%, customer gets credits." External commitment.

Common SLI Types

Availability

99.9% of requests succeed

Is the service up and responding?

Latency

95th percentile < 200ms

How fast does the service respond?

Throughput

Handles 10k requests/second

How much traffic can it handle?

Error Rate

Less than 0.1% 5xx errors

What fraction of requests fail?

🛠️ Key Skills

Essential

Linux & Systems

Deep understanding of OS internals, networking, performance

Essential

Programming

Python, Go, or similar for automation and tooling

Essential

Observability

Metrics, logs, traces, distributed tracing systems

Core

Incident Management

On-call, incident response, postmortems, runbooks

Core

Cloud Platforms

AWS, GCP, or Azure at scale

Core

Kubernetes

Container orchestration, operators, service mesh

Important

Capacity Planning

Load testing, scaling strategies, cost optimization

Advanced

Chaos Engineering

Fault injection, game days, resilience testing

📈 Career Path

Junior SRE / SWE

0-2 years

Learning systems, on-call rotation, toil reduction

Site Reliability Engineer

2-5 years

Service ownership, reliability improvements, automation

Senior SRE

5-8 years

Architecture reviews, cross-team reliability, mentoring

Staff SRE / Tech Lead

8-12 years

Org-wide reliability strategy, complex incidents

Principal SRE / Director

12+ years

Industry influence, executive partnerships, culture

⚙️ Understanding Toil

Toil is work that is manual, repetitive, automatable, tactical, lacks enduring value, and scales linearly with service growth. SREs aim to keep toil under 50% of their time.

Examples of Toil

  • • Manual deployments
  • • Repetitive ticket handling
  • • Manual scaling operations
  • • Copying and pasting configs

Not Toil (Engineering)

  • • Writing automation scripts
  • • Designing monitoring systems
  • • Improving deployment pipelines
  • • Building self-healing systems

🚀 Getting Started

  1. Build software engineering skills: SREs are engineers first—learn to code well
  2. Understand distributed systems: CAP theorem, consensus, failure modes
  3. Read the SRE books: Free from Google—essential reading for the field
  4. Practice incident response: Join on-call rotations, participate in postmortems
  5. Learn observability: Prometheus, Grafana, distributed tracing
  6. Enter from SWE or DevOps: Most SREs transition from adjacent roles
  7. Target SRE-focused companies: Google, Meta, LinkedIn have strong SRE cultures

© CubiCube AI. Built for Nexartis AI/Agents infrastructure.