What you'll do:
Automate Deployment and Operation
Oversee deployment of Kafka and RabbitMQ clusters (including Confluent Cloud & CFK). Build automation pipelines to ensure repeatability and resiliency across environments.
Monitor and Support Production Systems
Own production stability of global Kafka clusters. Handle on-call rotations, incident management, troubleshooting, and scaling challenges.
Improve Infrastructure Observability
Build and maintain observability systems: dashboards, alerting pipelines, metrics collection (Prometheus, Grafana, etc.).
Optimize System Performance
Collaborate with peers on benchmarking and optimization initiatives. Work on tuning Kafka brokers, cluster configurations, and runtime parameters.
Provide Developer Support and Training (Infra-focused)
Help developers configure topics, quotas, and consumers appropriately. Train service owners to interpret monitoring data and avoid pitfalls.
Develop and Maintain Infrastructure
Contribute to building infrastructure tools and scripts (IaC, Helm charts, etc.) that make provisioning and managing clusters reliable and efficient.
Secure Infrastructure Access
Configure and maintain secure access patterns across streaming infrastructure, ensuring proper authentication and role-based access controls are enforced for both developers and services.
8+ years of experience in DevOps, SRE, or Infrastructure Engineering roles.
Deep hands-on Kafka experience, including deploying, maintaining, scaling, and monitoring clusters.
Experience with RabbitMQ.
Extensive experience with Docker, Kubernetes, Helm, and GitOps-style deployments.
Infrastructure as Code experience (Terraform, Pulumi, etc.).
Strong skills in scripting and automation (Python, Bash, etc.).
Familiarity with Confluent Cloud, Confluent for Kubernetes, and similar tools.
Solid understanding of authentication and authorization mechanisms in distributed systems.
Production support mindset – with proven troubleshooting and incident resolution history.
Collaboration and communication skills – especially with dev teams depending on platform support.
Experience with Istio Service Mesh (bonus).
Experience with GovCloud (bonus).

















