Responsibilities
Design, build, and maintain scalable cloud infrastructure (AWS, GCP).
Implement Infrastructure as Code (IaC) using tools like Terraform, Helm.
Manage and optimize Kubernetes clusters and containerized workloads.
Ensure system reliability, scalability, and security across production and non-production environments.
Improve monitoring, logging, and alerting systems for proactive issue detection and resolution.
Develop and optimize CI/CD pipelines to improve software delivery speed and reliability.
Collaborate closely with developers, operations, and product teams.
Participate in on-call rotations and incident response, driving postmortems and improvements.
Mentor engineers and contribute to building a strong DevOps culture.
Continuously evaluate and adopt new tools, technologies, and processes to improve infrastructure and operations.
5+ years of experience as a DevOps, Site Reliability Engineer (SRE), or similar role.
Strong experience with cloud platforms (AWS, GCP).
Expertise with Kubernetes, Docker, and container orchestration.
Hands-on experience with CI/CD tools (GitHub Actions, Jenkins, ArgoCD, etc.).
Proficiency in Infrastructure as Code (Terraform, Helm, CloudFormation, Ansible).
Strong knowledge of Linux systems, networking, and security best practices.
Experience with monitoring & observability (Prometheus, Grafana, ELK, Datadog, etc.).
Proficiency in scripting/programming (Python, Bash).
Familiarity with databases, caching, and messaging systems (MySQL, PostgreSQL, Redis, Kafka, etc.).
Excellent problem-solving and troubleshooting skills.
Strong communication skills and ability to work collaboratively across teams.