This position is ideally suited for experienced software engineers who are passionate about building high-quality systems and are interested in expanding their expertise in observability, distributed systems, and developer experience. You will help design, build and maintain systems that empower engineers across us to monitor, understand, and troubleshoot their services more effectively.
Our observability team is responsible for delivering scalable and user-friendly solutions to over 150 engineers working across more than 20 teams. Were focused on enabling rapid incident detection and resolution, improving our reliability posture, and supporting a culture of continuous improvement.
What you'll be doing:
Design, build, and maintain observability tools and infrastructure that help our engineers provide actionable insights into the performance and reliability of our systems.
Collaborate with other engineers and teams to enhance the developer experience around monitoring, logging, alerting, and tracing.
Develop and evolve our internal tooling to simplify the process of instrumenting and observing services.
Partner with engineering teams to improve incident response and recovery workflows, and ensure systems meet internal SLOs/SLAs and reliability targets.
Support the migration from our legacy ELK stack to a modern observability platform using Prometheus, Mimir, Grafana, Honeycomb, Loki, Quickwit, and OpenTelemetry.
Contribute to knowledge sharing and the ongoing development of best practices in observability across the organisation.
What you'll need:
4+ years of professional experience as a software engineer, with a strong foundation in building and maintaining production systems.
Proficiency in one or more modern programming languages such as Python, Java, JavaScript, or Ruby.
Familiarity with Kubernetes, AWS, and infrastructure-as-code tools such as Terraform.
Experience working with observability tools and platforms (e.g. Prometheus, Grafana, ELK, Honeycomb, Loki, or similar).
A strong interest in developer experience and platform tooling, with the ability to empathise with engineering teams as internal customers.
Excellent communication skills, with the ability to collaborate effectively across teams and explain complex technical concepts clearly.
A proactive mindset focused on long-term impact, sustainable engineering practices, and continuous improvement.
Preferred Qualifications:
Experience with OpenTelemetry or distributed tracing systems.
Understanding of observability-driven development and service reliability principles (e.g. SRE, MTTR, SLIs/SLOs).
Experience optimising observability systems for cost and performance at scale.
Knowledge of microservices architectures and how to monitor and debug distributed systems.
Contributions to open-source projects in the observability or monitoring space