Work on reliability for our top enterprise solutions
Leverage your understanding of the systems we use to assist in resolving production issues in real time
Bring your knowledge and perspective on reliability to preemptive (design reviews) and corrective (post mortems) discussions
Lead service reliability reviews and audits and present findings to stakeholders
Find patterns and pain points that hinder availability, and produce large-scale solutions
Generate and prioritize tasks for infrastructure teams to aid in improving uptime and reducing blast radius
Analyze system performance and scalability requirements, identify bottlenecks, and then propose and implement solutions to optimize system capacity
A Senior R&D employee with 5+ years of experience managing large engineering projects
Youre experienced with monitoring, logging, and tracing mechanisms
You have an excellent understanding of how web applications work – from browsers and caches to the database, and back
Youre skilled in site reliability engineering principles, including scalability, availability, performance, and fault tolerance
You’re highly motivated by the idea of automating failure remediation processes
Youre great at jumping between multiple tasks and you know how to analyze risk and prioritize accordingly
You enjoy critical thinking and problem solving and are seasoned in conflict resolution
3+ years experience in coding and/or running production systems over the cloud – a big advantage