Practicing Site Reliability Engineering
2 days - Advanced
This course is tailored for professionals eager to master Site Reliability Engineering (SRE). Learn Google's cutting-edge approaches to ensuring system reliability and performance, blending essential SRE principles with hands-on experience in key operational tasks.
Training details
Description
As one of the SRE discipline founders puts it, Site Reliability Engineering is "what happens when you ask software engineers to manage IT infrastructure and operations". SRE involves applying proven software development techniques to operational tasks such as monitoring, deployment, and incident management.
An SRE approach aims to optimize both teams and the technical systems they operate. The goal is to enhance system evolution without compromising on reliability and availability requirements. This involves continuous health monitoring of systems, maximizing task automation, and ongoing team learning.
Through a mix of theoretical modules, revisiting the foundations of SRE, and practical work to apply concepts, this training offers participants a deeper understanding of the techniques, methods, and tools essential for implementing an SRE approach in their environment.
Objectives
- Learn the principles and practices of Site Reliability Engineering
- Identify different roles in an SRE team
- Learn to set performance and reliability goals, and define means to achieve them
- Monitor the reliability of your platform
- Facilitate dialogue with development/product teams through the management of a shared "error budget"
- Effectively handle incidents, turning them into improvement and learning opportunities
Target Audience
Anyone operationally involved or in a management position of a production IT system, including:
- Ops and System Administrators
- Information System Managers (COO, CTO, etc.)
- Developers
- Consultants
- Integrators
- Operators
Prerequisites
Understanding and knowledge of common DevOps terminology and concepts.
Pedagogical method
Training includes theoretical contributions, discussions on participants' contexts, practical experience feedback from trainers, supplemented with practical work and simulations.
- Proportion of presentations: 70%
- Proportion of practical cases: 20%
- Proportion of experience sharing: 10%
Evaluation and follow-up mode
Skill acquisition is evaluated throughout the session via workshops and practical applications. A post-training satisfaction survey is systematically conducted, and a training certificate is issued to participants, detailing the training objectives, nature, program, duration, and formalized learnings.
Program
Day 1
- Introduction to Site Reliability Engineering
- History and emergence of the SRE discipline
- Its integration at Google
- Correlation with the DevOps movement
- Operating Production Systems
- Different roles and responsibilities of an SRE team
- Ensuring application and service reliability
- Managing the error budget
- Minimizing toil
- SRE: Ensuring Service and Application Reliability
- Software lifecycle
- Definitions of Reliability
- Monitoring
- Concepts and Definitions (Monitoring vs. Observability)
- Effective alert systems
- Applied statistics in monitoring
- On-call organization
- Efficient incident diagnosis
- Error report writing
- Practical Exercise: "Diagnosing and Resolving a Production Incident"
- Production Readiness Review for service or application management
- Release Engineering: Change Management
Day 2
- SRE: Managing the Error Budget
- Risk management in IT systems
- Tools for SRE team management: SLI, SLO, SLA, Error Budget
- Practical Exercise: "Setting up SLI/SLO/Error Budget for a Service/Application"
- SRE: Automating Services
- Economic constraints - team scalability
- Addressing toil: low-value-added tasks
- Identifying toil
- Allocating time for automation
- System operation automation organization
- Practical Exercise: "Identifying Automatable Tasks"
- Organization and Culture
- SRE vs DevOps
- SRE Team in an Agile Organization
- Integration and Impact on the Rest of the Organization
- Establishing a Learning Culture
- Psychological Safety
- Blameless Postmortem
- Integrating a New SRE
- SRE: Managing the Error Budget
Contact us to discuss your project
Send us an email and we will get back to you as soon as possible[email protected]