We’re an award-winning global outsourcer providing contact center and back office services on behalf of our global clients. Come work at a place where innovation and teamwork come together to support the most exciting missions in the world!
Role objective
The Head of Site Reliability Engineering is a hybrid technical‑leadership role. You will:
• Own reliability of production services running on AWS while steering the roadmap for platform resilience and building out the SRE team.
• Lead and grow a remote team of SREs—coaching, hiring, performance‑managing, and fostering a blameless culture.
• Set and enforce Service Level Objectives (SLOs), error budgets, and incident response processes.
• Drive automation via Infrastructure‑as‑Code (Pulumi / TypeScript), CI/CD, and observability pipelines.
• Represent the SRE discipline to product, engineering, and senior leadership across our global business.
• Hands on monitoring and incident response will be critical as the team grows.
This role offers the opportunity to build reliability engineering from the ground up in a mission-critical IoT platform.
Key Responsibilities
Leadership & People Management
• Build an SRE team of initially 3-6 engineers: goal setting, career development, regular 1:1s, and annual performance reviews.
• Ensure operational system knowledge is captured and that the team is kept "fresh" on operating and troubleshooting procedures.
• Recruit, onboard, and mentor new engineers; scale the team to meet business growth.
• Maintain an inclusive, psychologically‑safe culture centred on learning and continuous improvement.
• Own, and participate in, the on‑call roster for the team, ensuring equitable rotations and sustainable workloads.
Service Level Management & Reliability
• Define, monitor, and enforce SLOs and error budgets across all production systems.
• Continuously analyse error‑budget burn to halt risky deployments and guide capacity decisions.
• Champion a data‑driven reliability mindset throughout engineering and product teams.
Infrastructure Automation & Management
• Architect and implement Infrastructure‑as‑Code in Pulumi/TypeScript for AWS resources (EKS, MSK, SingleStore, MongoDB, S3, etc.).
• Lead large‑scale migration or modernisation projects (e.g., Kubernetes upgrades, multi‑AZ resilience).
• Eliminate toil—any manual task >2 engineer‑days/quarter or frequently repeated becomes an automation candidate.
Incident Response & Post‑Mortem Leadership
• Participate in on-call monitoring and response roster.
• Serve as escalation point and incident commander.
• Ensure post‑mortems are published within 48 hours with actionable “never again” tasks tracked to closure.
• Improve runbooks and game‑day exercises; train engineers on incident command principles.
Security & Compliance
• Enforce least‑privilege IAM policies and champion DevSecOps practices.
• Contribute to SOC 2 & ISO 27001 evidence collection and continuous control monitoring.
• Oversee security patch pipelines, vulnerability management, and secrets hygiene.
Operational Excellence & Continuous Improvement
• Own reliability KPIs (MTTR, change failure rate, meantime between failures).
• Lead quarterly reliability reviews and drive the reliability roadmap.
• Partner with Product on capacity forecasts and cost‑optimisation initiatives.
Join the A-Team and experience the A-Life!