Training details

Description

As one of the SRE discipline founders puts it, Site Reliability Engineering is "what happens when you ask software engineers to manage IT infrastructure and operations". SRE involves applying proven software development techniques to operational tasks such as monitoring, deployment, and incident management.

An SRE approach aims to optimize both teams and the technical systems they operate. The goal is to enhance system evolution without compromising on reliability and availability requirements. This involves continuous health monitoring of systems, maximizing task automation, and ongoing team learning.

Through a mix of theoretical modules, revisiting the foundations of SRE, and practical work to apply concepts, this training offers participants a deeper understanding of the techniques, methods, and tools essential for implementing an SRE approach in their environment.

Objectives

Learn the principles and practices of Site Reliability Engineering
Identify different roles in an SRE team
Learn to set performance and reliability goals, and define means to achieve them
Monitor the reliability of your platform
Facilitate dialogue with development/product teams through the management of a shared "error budget"
Effectively handle incidents, turning them into improvement and learning opportunities

Target Audience

Anyone operationally involved or in a management position of a production IT system, including:

Ops and System Administrators
Information System Managers (COO, CTO, etc.)
Developers
Consultants
Integrators
Operators

Prerequisites

Understanding and knowledge of common DevOps terminology and concepts.

Pedagogical method

Training includes theoretical contributions, discussions on participants' contexts, practical experience feedback from trainers, supplemented with practical work and simulations.

Proportion of presentations: 70%
Proportion of practical cases: 20%
Proportion of experience sharing: 10%

Evaluation and follow-up mode

Skill acquisition is evaluated throughout the session via workshops and practical applications. A post-training satisfaction survey is systematically conducted, and a training certificate is issued to participants, detailing the training objectives, nature, program, duration, and formalized learnings.

Program

Day 1
- Introduction to Site Reliability Engineering
- History and emergence of the SRE discipline
- Its integration at Google
- Correlation with the DevOps movement
- Operating Production Systems
  - Different roles and responsibilities of an SRE team
  - Ensuring application and service reliability
  - Managing the error budget
  - Minimizing toil
- SRE: Ensuring Service and Application Reliability
  - Software lifecycle
  - Definitions of Reliability
- Monitoring
  - Concepts and Definitions (Monitoring vs. Observability)
  - Effective alert systems
  - Applied statistics in monitoring
- On-call organization
  - Efficient incident diagnosis
  - Error report writing
  - Practical Exercise: "Diagnosing and Resolving a Production Incident"
- Production Readiness Review for service or application management
- Release Engineering: Change Management
Day 2
- SRE: Managing the Error Budget
  - Risk management in IT systems
  - Tools for SRE team management: SLI, SLO, SLA, Error Budget
  - Practical Exercise: "Setting up SLI/SLO/Error Budget for a Service/Application"
- SRE: Automating Services
  - Economic constraints - team scalability
  - Addressing toil: low-value-added tasks
  - Identifying toil
  - Allocating time for automation
  - System operation automation organization
  - Practical Exercise: "Identifying Automatable Tasks"
- Organization and Culture
  - SRE vs DevOps
  - SRE Team in an Agile Organization
  - Integration and Impact on the Rest of the Organization
  - Establishing a Learning Culture
  - Psychological Safety
  - Blameless Postmortem
  - Integrating a New SRE

Practicing Site Reliability Engineering