Practicing Site Reliability Engineering

2 days - Advanced

This course is tailored for professionals eager to master Site Reliability Engineering (SRE). Learn Google's cutting-edge approaches to ensuring system reliability and performance, blending essential SRE principles with hands-on experience in key operational tasks.

devopssrecloud

Training details

Description

As one of the SRE discipline founders puts it, Site Reliability Engineering is "what happens when you ask software engineers to manage IT infrastructure and operations". SRE involves applying proven software development techniques to operational tasks such as monitoring, deployment, and incident management.

An SRE approach aims to optimize both teams and the technical systems they operate. The goal is to enhance system evolution without compromising on reliability and availability requirements. This involves continuous health monitoring of systems, maximizing task automation, and ongoing team learning.

Through a mix of theoretical modules, revisiting the foundations of SRE, and practical work to apply concepts, this training offers participants a deeper understanding of the techniques, methods, and tools essential for implementing an SRE approach in their environment.

Objectives

Target Audience

Anyone operationally involved or in a management position of a production IT system, including:

Prerequisites

Understanding and knowledge of common DevOps terminology and concepts.

Pedagogical method

Training includes theoretical contributions, discussions on participants' contexts, practical experience feedback from trainers, supplemented with practical work and simulations.

Evaluation and follow-up mode

Skill acquisition is evaluated throughout the session via workshops and practical applications. A post-training satisfaction survey is systematically conducted, and a training certificate is issued to participants, detailing the training objectives, nature, program, duration, and formalized learnings.

Program

  1. Day 1

    • Introduction to Site Reliability Engineering
    • History and emergence of the SRE discipline
    • Its integration at Google
    • Correlation with the DevOps movement
    • Operating Production Systems
      • Different roles and responsibilities of an SRE team
      • Ensuring application and service reliability
      • Managing the error budget
      • Minimizing toil
    • SRE: Ensuring Service and Application Reliability
      • Software lifecycle
      • Definitions of Reliability
    • Monitoring
      • Concepts and Definitions (Monitoring vs. Observability)
      • Effective alert systems
      • Applied statistics in monitoring
    • On-call organization
      • Efficient incident diagnosis
      • Error report writing
      • Practical Exercise: "Diagnosing and Resolving a Production Incident"
    • Production Readiness Review for service or application management
    • Release Engineering: Change Management
  2. Day 2

    • SRE: Managing the Error Budget
      • Risk management in IT systems
      • Tools for SRE team management: SLI, SLO, SLA, Error Budget
      • Practical Exercise: "Setting up SLI/SLO/Error Budget for a Service/Application"
    • SRE: Automating Services
      • Economic constraints - team scalability
      • Addressing toil: low-value-added tasks
      • Identifying toil
      • Allocating time for automation
      • System operation automation organization
      • Practical Exercise: "Identifying Automatable Tasks"
    • Organization and Culture
      • SRE vs DevOps
      • SRE Team in an Agile Organization
      • Integration and Impact on the Rest of the Organization
      • Establishing a Learning Culture
      • Psychological Safety
      • Blameless Postmortem
      • Integrating a New SRE

Contact us to discuss your project

Send us an email and we will get back to you as soon as possible[email protected]