SRE/Infra Engineer Resources
TIER 1
Section titled “TIER 1”SRE / Systems Engineering — From Google
- SRE Book → https://sre.google/sre-book/table-of-contents/ — Not a cover-to-cover book. Read chapters on error budgets, SLOs, toil, and on-call first. That’s the philosophy.
- SRE Workbook → https://sre.google/workbook/table-of-contents/ — More hands-on than the SRE book. The NALSD (Non-Abstract Large System Design) chapter is gold for infra engineers doing system design.
- Building Secure and Reliable Systems → https://google.github.io/building-secure-and-reliable-systems/raw/toc.html — Read this after the SRE book. It’s where security and reliability intersect — very relevant for banking.
- LinkedIn School of SRE → https://linkedin.github.io/school-of-sre/ — Underrated. Structured curriculum by people who run LinkedIn-scale infra. Great for going deep on Linux, networking, and databases from an SRE lens.
Distributed Systems
- Designing Data-Intensive Applications (DDIA) by Martin Kleppmann — This is by far the most practical book you’ll ever find about distributed systems. Read it slowly, twice. The chapters on replication, partitioning, and distributed transactions are the ones infra people actually live with. Non-negotiable. https://dataintensive.net/
- Database Internals by Alex Petrov — A fantastic book for anyone wondering how a database works. Read it after DDIA, as the author dives in more detail. B-trees, LSM-trees, consensus algorithms at implementation level.
- Understanding Distributed Systems by Roberto Vitillo — Lighter than DDIA, better as a first pass if DDIA feels heavy. Good for building mental models before going deep.
TIER 2 — LINUX & SYSTEMS INTERNALS (Your home territory)
Section titled “TIER 2 — LINUX & SYSTEMS INTERNALS (Your home territory)”- The Linux Programming Interface (TLPI) by Michael Kerrisk — The single most complete reference on Linux system calls, processes, signals, IPC, and sockets. 1500+ pages. Don’t read linearly, use it like a bible. Best Linux book written.
- How Linux Works by Brian Ward — Read this first before TLPI. It’s the entry ramp. Covers boot process, kernel, filesystems, and networking in accessible depth.
- Linux Kernel Development by Robert Love — After TLPI. Gets into actual kernel internals — scheduling, memory management, VFS. You don’t need this on day 1, but you’ll want it in year 2-3.
The person you must follow: Brendan Gregg
Brendan Gregg developed the USE Method (Utilization, Saturation, and Errors), a methodology for performance analysis. He created flame graphs and pioneered eBPF as an observability technology. He’s the most important practitioner in Linux performance alive.
- Blog → https://www.brendangregg.com — Required reading. Start with “Linux Performance Analysis in 60,000 Milliseconds” (the Netflix post). Bookmark the Linux performance tools diagram — it’s on walls at Netflix and Google.
- Systems Performance, 2nd Edition by Brendan Gregg — Covers performance analysis methods and Linux tools including perf, Ftrace, and eBPF. Addresses hardware, kernel, and application internals, and how they perform. This is the reference for anyone debugging production systems.
- BPF Performance Tools by Brendan Gregg — Goes deep on eBPF. Read Systems Performance first.
- YouTube → Search “Brendan Gregg LISA” and “Brendan Gregg Linux Performance” — his USENIX/LISA talks are 40–90 mins and worth every minute. Specifically: Linux Systems Performance (LISA 2019) on YouTube.
TIER 3 — SYSTEM DESIGN (HLD for Infra people)
Section titled “TIER 3 — SYSTEM DESIGN (HLD for Infra people)”- ByteByteGo → https://bytebytego.com — Alex Xu’s newsletter and YouTube channel. The YouTube shorts on “how X works at scale” are legitimate. The newsletter goes deeper. Actually useful, not just interview prep.
- Educative.io — Grokking the System Design Interview is overused but solid for building vocabulary fast. Don’t over-rely on it.
- High Scalability → http://highscalability.com — Technical blog posts about systems architecture. Real write-ups on how companies like Twitter, Netflix, and Uber actually built their systems. Read the “Building X” case studies.
What to add:
- Martin Kleppmann’s Cambridge Lectures (free on YouTube) — 8-lecture series on distributed systems from the DDIA author himself. Covers Lamport clocks, Raft, consensus, linearizability. Best free distributed systems course available.
- MIT 6.824 Distributed Systems → https://pdos.csail.mit.edu/6.824/ — Free course from MIT. Read the papers (Raft, MapReduce, Spanner, Zookeeper). These are what real systems are built on.
- The System Design Primer → https://github.com/donnemartin/system-design-primer — GitHub repo with 200k+ stars. Good reference map, not a replacement for books.
TIER 4 — BLOGS PRACTITIONERS ACTUALLY READ DAILY
Section titled “TIER 4 — BLOGS PRACTITIONERS ACTUALLY READ DAILY”Brendan Gregg’s Blog — highly technical posts about systems internals, performance, and SRE. Everything Sysadmin by Tom Limoncelli — blog posts about SysAdmin/DevOps/SRE. High Scalability — technical blog posts about systems architecture.
More:
- Netflix Tech Blog → https://netflixtechblog.com — The place where Brendan Gregg’s team publishes. Real production war stories. The chaos engineering posts, the eBPF posts, the performance posts are all must-reads.
- Cloudflare Blog → https://blog.cloudflare.com — Legitimately excellent deep technical writing. Their posts on TCP, BGP, DNS, and DDoS mitigation are some of the best public writing on networking at scale.
- Meta Engineering Blog → https://engineering.fb.com — How Facebook scales Cassandra, Memcached, TAO (their social graph DB). Real architecture decisions explained.
- AWS Architecture Blog → https://aws.amazon.com/blogs/architecture/ — If you’re on AWS every day, read this. Reference architectures + engineering decisions.
- Uber Engineering → https://www.uber.com/blog/engineering/ — Their posts on Kafka at scale, Schemaless (their DB), and microservices migrations are case studies in real infra decision-making.
- Cindy Sridharan’s Blog → https://copyconstruct.medium.com — Blog posts about distributed systems and their management. Her writing on observability and testing in production is some of the clearest thinking in the field.
- Increment Magazine → https://increment.com — A digital magazine about how teams build and operate software systems at scale. Every issue is free online. The on-call, reliability, and infrastructure issues are excellent.
TIER 5 — AWESOME LISTS (You already know these, use them as maps not destinations)
Section titled “TIER 5 — AWESOME LISTS (You already know these, use them as maps not destinations)”- awesome-sre → https://github.com/dastergon/awesome-sre — The definitive SRE resource aggregator.
- howtheysre → https://github.com/upgundecha/howtheysre — A curated knowledge repository of SRE best practices, tools, and culture adopted by leading technology organizations, compiled from engineering blogs, conferences, and meetups. Goldman Sachs, Airbnb, LinkedIn, Netflix — how they all actually do SRE.
- system-design-primer → https://github.com/donnemartin/system-design-primer
- roadmap.sh/devops → https://roadmap.sh/devops — Good for knowing what you don’t know yet.
TIER 6 — YOUTUBE CHANNELS (Actually watch these)
Section titled “TIER 6 — YOUTUBE CHANNELS (Actually watch these)”- Brendan Gregg — USENIX/LISA talks as above
- USENIX/LISA Conference talks → https://www.youtube.com/@USENIXAssociation — Real talks from engineers at Google, Netflix, Cloudflare. Not tutorial fluff.
- SREcon → Search YouTube. Real practitioner talks from the industry’s SRE conference.
- Hussein Nasser → YouTube — Excellent on networking, databases, and backend architecture. One of the better technical educators on YouTube who actually goes deep.
- Martin Kleppmann → YouTube — His Cambridge lecture series on distributed systems.
NEWSLETTERS WORTH SUBSCRIBING
Section titled “NEWSLETTERS WORTH SUBSCRIBING”- SRE Weekly → https://sreweekly.com — Curated weekly SRE reading list. People actually use this.
- KubeWeekly — If you’re living in K8s, this matters.
- ByteByteGo Newsletter — Alex Xu sends a system design deep-dive weekly.
NETWORKING
Section titled “NETWORKING”Books — Read in this order
Section titled “Books — Read in this order”Computer Networks: A Top-Down Approach by Kurose & Ross — The standard university textbook, but genuinely good. Starts at the application layer (HTTP, DNS, email) and works down to physical. Best first networking book for someone who already does systems work, because you see how the thing you use every day actually functions. Use the 7th or 8th edition.
TCP/IP Illustrated, Vol. 1 by W. Richard Stevens — This unique book illustrates the TCP/IP protocol with hands-on examples. Instead of going through the RFC, Stevens uses popular diagnostic tools to show the protocol in action — providing a much greater understanding of TCP mechanisms such as connection establishment, timeouts, sliding windows, retransmissions, and fragmentation. This is the book senior engineers have on their desks. Read it after Kurose & Ross. Vol. 1 is the one you actually need. The series is a career-long reference.
Computer Networks by Andrew Tanenbaum — Exhaustive at every level from basic EE modulation to Layer 7 applications. More academic than Stevens but covers breadth no other book matches. Treat it as a deep-dive reference, not a cover-to-cover read.
Network Warrior by Gary A. Donahue — Practitioners love this one. Real-world routers, switches, firewalls, and troubleshooting from someone who actually ran production networks. Less theory, more “this is what you’ll actually do.” Read this alongside Stevens.
High Performance Browser Networking by Ilya Grigorik (free online: https://hpbn.co) — The author was a web performance engineer at Google. Covers TCP, UDP, TLS, HTTP/1.1, HTTP/2, WebSockets, WebRTC from a performance lens. Directly applicable to cloud and SRE work. Free.
Blogs for Networking
Section titled “Blogs for Networking”Cloudflare Blog → https://blog.cloudflare.com — Already in the previous list, but worth repeating here specifically for networking. Their posts on BGP, DNS, anycast routing, TCP tuning, and DDoS are the best public writing on production networking that exists. Not vendor docs, actual engineering war stories.
Julia Evans (b0rk) → https://jvns.ca — Her posts on DNS, TCP, networking tools (tcpdump, curl, dig) are the clearest explanations of networking internals you’ll find anywhere. The zines she makes on DNS and networking are legitimately excellent for building mental models fast.
IETF RFCs — Not a blog, but learn to read RFCs. RFC 793 (TCP), RFC 791 (IP), RFC 1035 (DNS). You won’t read them for fun, but when you’re debugging something weird at 2am, the RFC is the source of truth. Bookmark https://tools.ietf.org.
PERFORMANCE OPTIMIZATION
Section titled “PERFORMANCE OPTIMIZATION”Already covered Brendan Gregg heavily. Adding what’s missing:
Web Scalability for Startup Engineers by Artur Ejsmont — Bridges the gap between Linux performance and cloud-scale application architecture. Good for understanding caching layers, CDNs, queue-based architectures, and how performance decisions at the infra layer ripple into application behavior.
The Art of Capacity Planning by John Allspaw — Allspaw ran ops at Flickr and Etsy. This short book is about predicting system load and planning infrastructure ahead of demand. Old but the mental models are timeless. Directly useful in SRE/infra work.
High Performance MySQL by Baron Schwartz — If you’re dealing with any database-backed infrastructure (banking — you will be), this is the reference. Query optimization, indexing strategy, replication, partitioning. The performance engineering mindset in here applies beyond MySQL.
USE Method by Brendan Gregg → https://www.brendangregg.com/usemethod.html — Not a book, a free methodology page. Utilization, Saturation, Errors — a systematic checklist for debugging resource bottlenecks. Print this out. It’s the framework working SREs actually use when something is on fire.
Netflix Tech Blog performance posts → https://netflixtechblog.com — Search “performance” in the Netflix blog. The posts on JVM profiling, eBPF observability, and their “Linux Performance in 60 Seconds” article are field manuals.
COST OPTIMIZATION / FINOPS
Section titled “COST OPTIMIZATION / FINOPS”This is a newer discipline but growing fast, especially in banking/enterprise where cloud bills hit nine figures.
Cloud FinOps by J.R. Storment & Mike Fuller (O’Reilly) — The definitive book on cloud financial management. Covers the FinOps lifecycle: inform, optimize, operate. Written by the founders of the FinOps Foundation. This is the primary text for anyone who wants to own cost governance seriously.
Efficient Cloud FinOps by Sánchez & García (Packt, 2024) — A practical guide covering FinOps for AWS, Azure, and GCP — the three phases of inform, optimize, and operate — with real-world case studies and architectural patterns. More hands-on than the O’Reilly book above.
FinOps Foundation → https://www.finops.org — Free resources, frameworks, and the FOCUS open cost specification. This is the industry body. Their website has playbooks, training, and the state of FinOps annual report. Bookmark it.
AWS Well-Architected Framework — Cost Optimization Pillar → https://aws.amazon.com/architecture/well-architected/ — Free from AWS. Covers rightsizing, reserved instances, savings plans, idle resource detection, and Graviton migration. Not exciting reading, but this is what AWS actually recommends and what interviewers and architects reference.
AWS Cost Management Blog → https://aws.amazon.com/blogs/aws-cloud-financial-management/ — Where new FinOps features get announced and explained. Bookmark and skim monthly.
Kubernetes Cost Optimization → https://www.kubecost.com/blog — Kubecost’s blog covers K8s-specific cost breakdown, namespace-level attribution, and spot instance strategies. Very practical since K8s cost visibility is notoriously hard.
SCALABILITY
Section titled “SCALABILITY”Most of this overlaps with distributed systems (DDIA, High Scalability blog) already covered. Additions:
The Art of Scalability by Abbott & Fisher — The only book specifically about organizational and architectural scalability together. Introduces the AKF Scale Cube (X/Y/Z axis scaling). Used by engineers at PayPal, eBay. More strategic than tactical but gives you vocabulary for HLD conversations.
Release It! by Michael Nygard — How production systems fail and how to design them so they don’t. Covers circuit breakers, bulkheads, timeouts, and cascading failure patterns. One of the most practically useful books for SRE/infra people that isn’t labeled “SRE.”
Google SRE Workbook — NALSD Chapter → https://sre.google/workbook/non-abstract-large-system-design/ — Already in the previous list, but calling it out again specifically for scalability. The Non-Abstract Large System Design framework is Google’s internal methodology for designing systems at scale. Free.
All Things Distributed by Werner Vogels → https://www.allthingsdistributed.com — Werner Vogels is Amazon’s CTO. His blog covers distributed systems, eventual consistency, and AWS architectural decisions. Long-form posts from someone who actually runs the world’s largest cloud.
CORE SUBJECTS
Section titled “CORE SUBJECTS”Operating Systems
Section titled “Operating Systems”Operating Systems: Three Easy Pieces (OSTEP) by Arpaci-Dusseau — Perhaps the most popular OS book today. It breaks down a really wide and deep topic into bite-sized chunks. Each chapter is typically less than 20 pages, written in clear, accessible language, making complex concepts digestible without sacrificing depth. Covers universal concepts like how virtual memory works, what happens during a context switch, how file systems manage data, and how concurrency is handled at the OS level. Completely free → https://pages.cs.wisc.edu/~remzi/OSTEP/
This is the OS book to read before Linux Kernel Development. It explains the why behind what you’ll see in the kernel code. The “Dinosaur Book” (Modern Operating Systems by Silberschatz) is the classic academic alternative, but OSTEP is more readable and free.
Databases
Section titled “Databases”Database Internals by Alex Petrov — Already in the previous list. How databases actually work at the storage engine level. B-trees, LSM-trees, write-ahead logs, distributed consensus. Read after DDIA.
Readings in Database Systems (The Red Book) — http://www.redbook.io — Free online. Curated collection of the most important database research papers with commentary from Peter Bailis and Joe Hellerstein. This is what database PhDs read. Not a textbook, a paper collection — dip in when you want to understand why something was designed a certain way.
CMU Database Course (free on YouTube) — Andy Pavlo’s 15-445/645 course at CMU. Full lectures, slides, and assignments. One of the best database courses ever put on the internet, free. Covers storage models, indexing, query execution, concurrency control, and distributed databases. YouTube: search “CMU Database Systems Andy Pavlo.”
Algorithms & Data Structures — The Honest Take
Section titled “Algorithms & Data Structures — The Honest Take”For SRE/infra/platform engineering, you don’t need LeetCode hard. What you actually need:
The Algorithm Design Manual by Steven Skiena — Better for engineers than CLRS (Introduction to Algorithms). Focuses on recognizing which algorithmic approach fits which problem. The second half is a catalog of algorithm problems by type. Used by working engineers, not just academia.
Introduction to Algorithms (CLRS) by Cormen et al — The reference text. Dense. You don’t read it, you look things up in it. Useful when you need to understand time complexity of something in a storage system or a scheduling algorithm.
What’s actually relevant for infra/SRE work from algorithms: consistent hashing (load balancing, distributed caches), binary search (log analysis, sorted data structures), graph algorithms (network topology, dependency resolution), heap/priority queues (task schedulers), bloom filters (existence checks in distributed systems). You’ll encounter all of these in Kubernetes internals, etcd, and distributed databases. Learn these specifically rather than grinding LeetCode.
FULL MAP — Everything Together
Section titled “FULL MAP — Everything Together”To avoid hunting across two responses, here’s the complete categorized list:
SRE / Systems Engineering
- Google SRE Book → https://sre.google/sre-book/table-of-contents/
- Google SRE Workbook → https://sre.google/workbook/table-of-contents/
- Building Secure & Reliable Systems → https://google.github.io/building-secure-and-reliable-systems/raw/toc.html
- LinkedIn School of SRE → https://linkedin.github.io/school-of-sre/
- awesome-sre → https://github.com/dastergon/awesome-sre
- howtheysre → https://github.com/upgundecha/howtheysre
Linux & Systems Internals
- How Linux Works — Brian Ward
- The Linux Programming Interface — Michael Kerrisk
- Linux Kernel Development — Robert Love
- Systems Performance 2nd Ed — Brendan Gregg
- BPF Performance Tools — Brendan Gregg
- Brendan Gregg’s blog/talks → https://www.brendangregg.com
Networking
- Computer Networks: Top-Down Approach — Kurose & Ross
- TCP/IP Illustrated Vol. 1 — W. Richard Stevens
- Computer Networks — Tanenbaum
- Network Warrior — Gary Donahue
- High Performance Browser Networking (free) → https://hpbn.co
- Julia Evans blog → https://jvns.ca
- Cloudflare Blog → https://blog.cloudflare.com
Distributed Systems & System Design
- Designing Data-Intensive Applications — Martin Kleppmann → https://dataintensive.net
- Database Internals — Alex Petrov
- Understanding Distributed Systems — Roberto Vitillo
- MIT 6.824 → https://pdos.csail.mit.edu/6.824/
- Martin Kleppmann Cambridge lectures (YouTube)
- System Design Primer → https://github.com/donnemartin/system-design-primer
- ByteByteGo → https://bytebytego.com
- High Scalability → http://highscalability.com
- All Things Distributed → https://www.allthingsdistributed.com
Performance Optimization
- USE Method → https://www.brendangregg.com/usemethod.html
- Web Scalability for Startup Engineers — Artur Ejsmont
- The Art of Capacity Planning — John Allspaw
- High Performance MySQL — Baron Schwartz
- Netflix Tech Blog → https://netflixtechblog.com
Scalability
- The Art of Scalability — Abbott & Fisher
- Release It! — Michael Nygard
- SRE Workbook NALSD Chapter → https://sre.google/workbook/non-abstract-large-system-design/
Cost Optimization / FinOps
- Cloud FinOps — Storment & Fuller (O’Reilly)
- Efficient Cloud FinOps — Sánchez & García (Packt, 2024)
- FinOps Foundation → https://www.finops.org
- AWS Well-Architected Cost Pillar → https://aws.amazon.com/architecture/well-architected/
- AWS Cloud Financial Management Blog → https://aws.amazon.com/blogs/aws-cloud-financial-management/
- Kubecost Blog → https://www.kubecost.com/blog
Core CS
- OSTEP (free) → https://pages.cs.wisc.edu/~remzi/OSTEP/
- The Red Book (DB papers, free) → http://www.redbook.io
- CMU Database Course → YouTube, Andy Pavlo
- The Algorithm Design Manual — Steven Skiena
- CLRS — Reference only
Blogs / Newsletters (Daily/Weekly)
- Brendan Gregg → https://www.brendangregg.com
- Netflix Tech Blog → https://netflixtechblog.com
- Cloudflare Blog → https://blog.cloudflare.com
- Meta Engineering → https://engineering.fb.com
- Uber Engineering → https://www.uber.com/blog/engineering/
- Cindy Sridharan → https://copyconstruct.medium.com
- Julia Evans → https://jvns.ca
- Increment → https://increment.com
- SRE Weekly → https://sreweekly.com
- ByteByteGo newsletter
- AWS Architecture Blog → https://aws.amazon.com/blogs/architecture/
The Honest Reading Order for You Specifically
Section titled “The Honest Reading Order for You Specifically”Given where you are right now, here’s the sequence that makes sense:
- How Linux Works (3–4 weeks) — fills gaps fast
- LinkedIn School of SRE (parallel, read online) — practical framing
- Google SRE Book (selective chapters, not cover to cover)
- TLPI (long-term reference, dip in and out)
- DDIA (3 months, slow and deliberate)
- Brendan Gregg’s Systems Performance (year 2)
- Linux Kernel Development (year 2–3, when you’re debugging kernel-level stuff)
- BPF Performance Tools (year 3+)