Site Reliability Engineer
Engineering reliability—what happens when you treat operations as a software problem
What Does an SRE Do?
Site Reliability Engineers are software engineers who specialize in reliability. They write code to automate operations, define and measure service level objectives (SLOs), respond to incidents, and work with development teams to build more reliable systems.
SRE was invented at Google in 2003 when Ben Treynor was asked to run a production team. His answer: "Let's run it like a software engineering team." SREs spend at least 50% of their time on engineering work, not just operations.
The key insight of SRE is the error budget: if your SLO is 99.9% availability, you have a 0.1% error budget. Development teams can "spend" this budget on features and velocity. When the budget is exhausted, reliability work takes priority.
📜 Brief History
2003: Google hired Ben Treynor to lead a production team. He formed the first SRE team, applying software engineering principles to operations.
2003-2015: SRE grew within Google, developing practices like error budgets, SLOs, blameless postmortems, and toil budgets. The term remained largely internal.
2016: Google published the "Site Reliability Engineering" book, sharing practices with the world. The industry took notice.
2017-Present: SRE adoption exploded. Companies like Netflix, LinkedIn, Dropbox, and thousands more adopted SRE practices. "The Site Reliability Workbook" (2018) provided practical guidance.
🔄 SRE vs. DevOps
As Google's VP of Engineering Ben Treynor says: "SRE is a specific implementation of DevOps with some idiosyncratic extensions."
DevOps
- • Cultural movement
- • Broad principles (CALMS)
- • Any implementation
- • Focus on collaboration
SRE
- • Specific role/practice
- • Concrete methods (SLOs, error budgets)
- • Engineering-first approach
- • Focus on reliability
📏 SLIs, SLOs, and SLAs
Service Level Indicator (SLI)
A quantitative measure of service behavior: latency, error rate, throughput, availability.
Service Level Objective (SLO)
A target value for an SLI: "99.9% of requests complete in under 200ms." Internal goal.
Service Level Agreement (SLA)
A contract with consequences: "If availability drops below 99.9%, customer gets credits." External commitment.
Common SLI Types
Availability
99.9% of requests succeed
Is the service up and responding?
Latency
95th percentile < 200ms
How fast does the service respond?
Throughput
Handles 10k requests/second
How much traffic can it handle?
Error Rate
Less than 0.1% 5xx errors
What fraction of requests fail?
🛠️ Key Skills
Linux & Systems
Deep understanding of OS internals, networking, performance
Programming
Python, Go, or similar for automation and tooling
Observability
Metrics, logs, traces, distributed tracing systems
Incident Management
On-call, incident response, postmortems, runbooks
Cloud Platforms
AWS, GCP, or Azure at scale
Kubernetes
Container orchestration, operators, service mesh
Capacity Planning
Load testing, scaling strategies, cost optimization
Chaos Engineering
Fault injection, game days, resilience testing
📈 Career Path
Junior SRE / SWE
0-2 yearsLearning systems, on-call rotation, toil reduction
Site Reliability Engineer
2-5 yearsService ownership, reliability improvements, automation
Senior SRE
5-8 yearsArchitecture reviews, cross-team reliability, mentoring
Staff SRE / Tech Lead
8-12 yearsOrg-wide reliability strategy, complex incidents
Principal SRE / Director
12+ yearsIndustry influence, executive partnerships, culture
⚙️ Understanding Toil
Toil is work that is manual, repetitive, automatable, tactical, lacks enduring value, and scales linearly with service growth. SREs aim to keep toil under 50% of their time.
Examples of Toil
- • Manual deployments
- • Repetitive ticket handling
- • Manual scaling operations
- • Copying and pasting configs
Not Toil (Engineering)
- • Writing automation scripts
- • Designing monitoring systems
- • Improving deployment pipelines
- • Building self-healing systems
🚀 Getting Started
- Build software engineering skills: SREs are engineers first—learn to code well
- Understand distributed systems: CAP theorem, consensus, failure modes
- Read the SRE books: Free from Google—essential reading for the field
- Practice incident response: Join on-call rotations, participate in postmortems
- Learn observability: Prometheus, Grafana, distributed tracing
- Enter from SWE or DevOps: Most SREs transition from adjacent roles
- Target SRE-focused companies: Google, Meta, LinkedIn have strong SRE cultures