You’ll want to know about the department that our role is in…
Our Service Reliability Team bridge the gap between operations and software developers. They encompass a wide range of responsibilities including managing and monitoring system availability, performance, efficiency, incident response as well as ensuring that software is deployed properly and delivering a reliable service to end users so when the finished product is ready for production, there are no surprises.
The department operates between the hours of 08:00 and 18:00, Monday to Friday. The breakdown of your normal hours of work will be by agreement with your Line Manager.
The role:
An SRE Monitoring Engineer is responsible for ensuring that all systems, applications, and networks function efficiently by continuously monitoring their performance, availability, and security. They set up and maintain monitoring tools, insights, and alerts to ensure software applications and systems are running properly. The focus is on systems and application monitoring (log, metrics events), covering existing and open-source monitoring tooling.
Tasks & responsibilities include:
- Implement and monitor system checks for early detection of potential problems
- Develop visualizations in Grafana and Azure Application Insights for end-user experience, application, infrastructure, and security
- Apply strong technical skills and good business knowledge together with investigative techniques to identify and resolve issues efficiently and in a timely manner.
- Work on initiatives and continuous improvement processes around proactive application health, monitoring, reporting, and technical support.
- Act proactively and help organizations uncover performance bottlenecks across the system.
The successful candidate will have:
- Experience working with various monitoring tools (Grafana, Application Insights, etc)
- Hands-on experience in designing and building dashboards
- Hands-on experience with setting up and assisting incident management workflows
- Automation skills and the ability to automate a full DevOps/GitOPS pipeline. Must understand infrastructure and configurations, CI/CD pipelines, app performance monitoring, and more.
- Good technical knowledge in implementing, troubleshooting, and performance tuning of hardware, operating system, and system services.
Additional ‘desirable’ but not essential skills:
- Terraform
- Kubernetes
- Progressive Delivery tooling (eg Argo, flux)