3 min read

Empowering Production Support Teams: Training, Reflection, and Cultivating Culture for Success

 

In the world of Managed Services and Site Reliability Engineering (MS/SRE), the significance of comprehensive training in handling production incidents cannot be overstated. Even the best engineered solutions will experience outages. As systems grow in scale and complexity, failures will happen. These situations can be critical challenges for a company, potentially leading to significant revenue loss or reputational damage. The root cause is often something completely unanticipated, never encountered before, non-standard, or a complex convergence of issues requiring root cause analysis. The decisions made by the SRE teams during these moments of high-pressure, high-stakes, and urgent timelines are vital. So, the big questions are:

  • How do you train an MS/SRE team to make good decisions when in the pilot’s seat during large incident?
  • Experience goes a long way. How do I ramp up new team members?
  • How do you maintain a good MS/SRE culture and mindset to address these situations?

Here are 3 key “habits” that are often overlooked in training production support teams:


1. Building Competence: Using a Mentorship Model 

“The trust is realized in an instant. The act is practices step by step.”  
-Zen saying 

Despite being drowned in metrics and data, a well-seasoned engineer from the SRE team can take control during a crisis and make the right decisions. This can be completely overwhelming for a less-experienced engineer. It can take time to understand normal and exceptional system behaviors, observability, and incident management. One of the best models for transferring this nuanced knowledge is a well-run mentorship program, a crucial part of the SRE roadmap. For critical responsibility areas, it is important to assign a mentor to an engineer trying to ramp up on a system. This pairing may take some time until a level of competence is established. Constant open dialogue, working sessions, and assessments occur until the domain knowledge is transferred.

2. Learn from the Past to Improve the Future: The critical importance of post-mortems 

"If it can be destroyed by the truth, it deserves to be destroyed by the truth.” 
-Carl Sagan 

One of the most overlooked learning opportunities lies in incident post-mortems. Different disciplines that handle high-stakes and high-pressure situations have their own form of post incident reviews. Fighter pilots have post sortie debriefs. Surgical teams have post operation debriefs. Professional sports teams will watch post-game film. Bringing all key stakeholders and the SRE teams together in the post-incident review is important. The process of discussing what happened, what was observed, why certain decisions were made, and actions taken is high learning value on multiple levels. At a procedural level, this allows the SRE team to evaluate and improve the effectiveness of its incident management practices. At a solution level, all teams get an understanding of any vulnerabilities within the system and can plan action to improve reliability. The best learning opportunities are in understanding the point of view of who commanded the SRE services during the incident response. There is value in reviewing the data points, observations, planning, and thoughts during the crisis. An open dialogue of expertise across teams helps understand the good and bad of how the incident was handled. On our SRE teams, we place heavy focus on post-mortems. For the more complex post-mortems, we make sure to use them as teaching points and will often refer to them in training new team members.

3. Fostering Success by Establishing a Learning Culture 

“If you run alone, you run fast. If you run together, you run far.”  
- Zambian Proverb 

Cultivating a robust and positive team culture is what allows consistent and long-term MS/SRE success. A culture that places a premium on open communication, supported by the right communication tools, establishes a foundation for sharing insights, rapid learning, and strong collaboration. The easier you make it for the SRE teams to collaborate, the easier it is to keep everybody on the same page and shared mindset.

Another key recommendation is to establish, track, and manage KPIs for the SRE team. As W. Edward Deming’s quote suggests, “If you cannot measure it, you cannot manage it.” Measuring KPIs provides a framework to manage and understand how your teams are performing. It can highlight where the team is running well and areas to improve. By keeping the KPIs open and transparent across the team and across projects, they serve as visible goals and indicators for each team member. Over time it allows the collective team to maintain strong standards and grow through continuous improvement.

Establishing a culture that emphasizes a shared focus on learning and continuous improvement allows the SRE team to tackle challenges cohesively and keep our customers happy.

Summary 

“You are the average of the 5 habits you repeat most."  
- James Clear 

We have covered 3 “habits” that we find most useful in ramping up, improving, and empowering the training of SRE teams. These practices especially prepare teams for the scenarios that are hardest to train for. The true key to success here is building that culture of a shared mindset, a strong focus on continuous learning, and ingraining these SRE practices into the team’s DNA.

Corporations need to focus efforts upon creating business value, improving the bottom line, and fostering innovation. Any organization reliant on their systems should entrust their operations to teams capable of running them reliably. A strong SRE company will minimize downtime, continually improve system reliability, and help maintain a solid reputation. At AAXIS, our teams achieve the above by embracing the habits in this article to help manage systems that process over 7B+ in eCommerce transactions per year through our site reliability engineering services and managed services.

Want your systems in capable hands? Book an appointment today. 

 

SupplyCore Inc. Teams with AAXIS and Oro to Upgrade Its Digital Commerce Systems

SupplyCore Inc. Teams with AAXIS and Oro to Upgrade Its Digital Commerce Systems

LOS ANGELES, California, March 8, 2021 /PRNewswire/ -- SupplyCore Inc., a supply chain integrator with 33 years’ experience providing the U.S....

Read More
Managed Cloud Services: Unlock Your Digital Commerce Cloud Benefits

Managed Cloud Services: Unlock Your Digital Commerce Cloud Benefits

For many IT leaders, the transition to digital commerce in the cloud is a mandate for survival. After all, the cloud’s advantages are many: increased...

Read More
10 Critical Factors for a Successful Delivery of Complex Digital Commerce Programs

10 Critical Factors for a Successful Delivery of Complex Digital Commerce Programs

In the ever-evolving landscape of digital commerce, the successful delivery of programs is contingent upon a myriad of factors that extend beyond...

Read More