Required AI-Ops & Cloud Platform Engineer
About the job:
As an AIOps Engineer, you will revolutionize enterprise operations by building intelligent systems that autonomously monitor, predict, and optimize business-critical infrastructure. You'll architect AI-driven operational platforms that transform reactive IT management into proactive, self-healing systems, enabling organizations to achieve unprecedented levels of reliability, efficiency, and performance at scale.
What Youll Do:
Monitor and optimize LLM application performance (latency, token usage, drift, failures)
Automate anomaly detection and remediation using Python and ML-based tooling
Design and manage cloud infrastructure (AWS, Azure, or GCP) using Terraform
Build dashboards, alerts, and predictive models to ensure system reliability
Ensure infrastructure is scalable, secure, and cost-effective.
About the job:
As an AIOps Engineer, you will revolutionize enterprise operations by building intelligent systems that autonomously monitor, predict, and optimize business-critical infrastructure. You'll architect AI-driven operational platforms that transform reactive IT management into proactive, self-healing systems, enabling organizations to achieve unprecedented levels of reliability, efficiency, and performance at scale.
What Youll Do:
Monitor and optimize LLM application performance (latency, token usage, drift, failures)
Automate anomaly detection and remediation using Python and ML-based tooling
Design and manage cloud infrastructure (AWS, Azure, or GCP) using Terraform
Build dashboards, alerts, and predictive models to ensure system reliability
Ensure infrastructure is scalable, secure, and cost-effective.
Requirements:
3+ years of experience in DevOps, SRE, or Cloud Engineering
Proficient in at least one major cloud provider: AWS, Azure, or GCP
Hands-on experience with Terraform and Python automation
Proven ability to design and implement cloud-native architectures
Built secure Landing Zones with strong network/security best practices
Experience with monitoring tools such as Prometheus, Datadog, or ELK
Comfortable with Kubernetes, Docker, and Serverless infrastructures
CI/CD experience using Azure DevOps, GitHub Actions, or GitLab
Bonus Points:
Experience with LLMOps and vector databases (e.g., Pinecone, Weaviate)
Background in anomaly detection or AI/ML-based alerting systems
Knowledge of FinOps practices and cloud cost optimization.
3+ years of experience in DevOps, SRE, or Cloud Engineering
Proficient in at least one major cloud provider: AWS, Azure, or GCP
Hands-on experience with Terraform and Python automation
Proven ability to design and implement cloud-native architectures
Built secure Landing Zones with strong network/security best practices
Experience with monitoring tools such as Prometheus, Datadog, or ELK
Comfortable with Kubernetes, Docker, and Serverless infrastructures
CI/CD experience using Azure DevOps, GitHub Actions, or GitLab
Bonus Points:
Experience with LLMOps and vector databases (e.g., Pinecone, Weaviate)
Background in anomaly detection or AI/ML-based alerting systems
Knowledge of FinOps practices and cloud cost optimization.
This position is open to all candidates.