Published on 2023-02-10
Sky - Support & Service Management
Quickly supporting a major client
Sky engaged Nexteam to provide incident management support across their website. Focusing on handling incidents raised by their monitoring platforms and other teams, as well as proactively improving their monitoring systems and wider support processes. Additionally we were to review their existing processes and make recommendations with the aim of transitioning to an SRE model.
Incident Handling
We are responsible for diagnosing any alerts that are generated by the monitoring system (Dynatrace) and created and tracked in ServiceNow. Team is responsible for finding out the root cause by looking at various systems and logs.
Identifying gaps in monitoring and addressing them
As part of our engagement we also performed gap analysis of their existing monitoring and alerting levels to ensure its monitoring of all key journeys. Team added additional monitoring on Dynatrace APM and Dynatrace Synthetics so they have good coverage. The Team also optimised existing alerting to tune out false positive alerts.
Setting up KPIs (Core web vitals SLOs and SLIs)
We also applied industry standard measurements for all of the products. This allied us to bring consistency across the platform and identify areas to improve. We started monitoring core web vitals and then set up Service Level Indicators & Service Level Objectives for SRE best practice.
Grafana Dashboards
We also bult multiple dashboards in Grafana from multiple data sources for engineering, product and marketing teams.
Streamline Oncall Support Process
We implemented a new out of hours on-call process which includes rota management and on-call payment improvements for the engineering team. This process looked to reduce how often engineers had to be on-call but also streamlined the process so that less people can support the platform but also not be on-call frequently.