Required MLOps Engineer – AI Infra Group
Tel Aviv Full-time
We are on an expedition to find you, someone who is passionate about creating intuitive, out-of-this-world production-grade AI infrastructure. This group builds scalable, high-performance AI systems for internal users and external customers, designed to run seamlessly across cloud and on-premise environments using the latest hardware advancements.
Responsibilities
Design, build, and maintain scalable Kubernetes-based infrastructure for ML workloads across on-premise and cloud environments
Architect hybrid infrastructure solutions enabling seamless model flow from on-premise training environments to cloud-based inference deployments
Implement model registry and artifact management strategies that support cross-environment synchronization, versioning, and governance
Design secure, efficient data and model transfer mechanisms between on-premise and cloud (networking, storage replication, caching strategies)
Implement and manage GPU scheduling, resource allocation, and cluster autoscaling for heterogeneous compute environments
Build and maintain CI/CD pipelines for ML systems, including model versioning, testing, and promotion across environments
Develop observability solutions (logging, monitoring, alerting) for ML infrastructure across hybrid deployments
Collaborate with ML Engineers to define infrastructure requirements and SLAs for training and serving workloads.
Tel Aviv Full-time
We are on an expedition to find you, someone who is passionate about creating intuitive, out-of-this-world production-grade AI infrastructure. This group builds scalable, high-performance AI systems for internal users and external customers, designed to run seamlessly across cloud and on-premise environments using the latest hardware advancements.
Responsibilities
Design, build, and maintain scalable Kubernetes-based infrastructure for ML workloads across on-premise and cloud environments
Architect hybrid infrastructure solutions enabling seamless model flow from on-premise training environments to cloud-based inference deployments
Implement model registry and artifact management strategies that support cross-environment synchronization, versioning, and governance
Design secure, efficient data and model transfer mechanisms between on-premise and cloud (networking, storage replication, caching strategies)
Implement and manage GPU scheduling, resource allocation, and cluster autoscaling for heterogeneous compute environments
Build and maintain CI/CD pipelines for ML systems, including model versioning, testing, and promotion across environments
Develop observability solutions (logging, monitoring, alerting) for ML infrastructure across hybrid deployments
Collaborate with ML Engineers to define infrastructure requirements and SLAs for training and serving workloads.
Requirements:
5+ years of experience in infrastructure engineering, platform engineering, or DevOps, preferably supporting ML or data-intensive workloads
Experience designing and operating hybrid cloud architectures (on-premise + cloud) with focus on data/model synchronization
Familiarity with model registry solutions (MLflow or cloud-native registries) and artifact management at scale
Experience with GPU compute infrastructure, device plugins, and resource scheduling (e.g., NVIDIA GPU Operator)
Proficiency in IaC tools (Terraform) and GitOps practices (ArgoCD)
Experience with monitoring and observability stacks (Prometheus, Grafana, ELK)
Familiarity with ML workflows to understand workload characteristics and requirements.
5+ years of experience in infrastructure engineering, platform engineering, or DevOps, preferably supporting ML or data-intensive workloads
Experience designing and operating hybrid cloud architectures (on-premise + cloud) with focus on data/model synchronization
Familiarity with model registry solutions (MLflow or cloud-native registries) and artifact management at scale
Experience with GPU compute infrastructure, device plugins, and resource scheduling (e.g., NVIDIA GPU Operator)
Proficiency in IaC tools (Terraform) and GitOps practices (ArgoCD)
Experience with monitoring and observability stacks (Prometheus, Grafana, ELK)
Familiarity with ML workflows to understand workload characteristics and requirements.
This position is open to all candidates.


















