Required ML Engineer – AI Infra Group
Tel Aviv Full-time
We are on an expedition to find you, someone who is passionate about creating intuitive, out-of-this-world production-grade AI infrastructure. This group builds scalable, high-performance AI systems for internal users and external customers, designed to run seamlessly across cloud and on-premise environments using the latest hardware advancements.
Responsibilities
Design and optimize LLM serving infrastructure using inference engines (vLLM, TensorRT-LLM, Triton Inference Server)
Implement and tune distributed inference strategies including tensor parallelism, pipeline parallelism, and multi-node serving
Develop and apply model compression techniques to optimize cost, latency, and memory footprint while maintaining model quality
Build self-service fine-tuning platforms that enable data scientists to run experiments (LoRA, QLoRA, full fine-tuning) in a standardized, reproducible, and governed manner
Optimize inference performance through batching strategies, KV-cache tuning, and speculative decoding
Develop reusable APIs, abstractions, and platform services for model deployment, scaling, and lifecycle management
Collaborate with AI researchers and product teams to productionize models and meet latency/throughput requirements
Evaluate and benchmark new model architectures, compression methods, and serving frameworks.
Tel Aviv Full-time
We are on an expedition to find you, someone who is passionate about creating intuitive, out-of-this-world production-grade AI infrastructure. This group builds scalable, high-performance AI systems for internal users and external customers, designed to run seamlessly across cloud and on-premise environments using the latest hardware advancements.
Responsibilities
Design and optimize LLM serving infrastructure using inference engines (vLLM, TensorRT-LLM, Triton Inference Server)
Implement and tune distributed inference strategies including tensor parallelism, pipeline parallelism, and multi-node serving
Develop and apply model compression techniques to optimize cost, latency, and memory footprint while maintaining model quality
Build self-service fine-tuning platforms that enable data scientists to run experiments (LoRA, QLoRA, full fine-tuning) in a standardized, reproducible, and governed manner
Optimize inference performance through batching strategies, KV-cache tuning, and speculative decoding
Develop reusable APIs, abstractions, and platform services for model deployment, scaling, and lifecycle management
Collaborate with AI researchers and product teams to productionize models and meet latency/throughput requirements
Evaluate and benchmark new model architectures, compression methods, and serving frameworks.
Requirements:
5+ years of experience in software engineering or ml engineering with significant focus on ML systems or backend infrastructure
Strong proficiency in Python and deep learning frameworks (PyTorch)
Hands-on experience with LLM inference engines (vLLM, TensorRT-LLM, Triton Inference Server)
Deep understanding of transformer architectures and LLM-specific optimizations (attention mechanisms, KV-cache, quantization techniques like GPTQ, AWQ, GGUF)
Experience with distributed training/fine-tuning frameworks (Ray, DeepSpeed, FSDP)
Ability to build developer-facing tools and platforms with clear APIs and documentation
Understanding of GPU performance profiling and optimization
Familiarity with LLM evaluation methodologies and benchmarking.
5+ years of experience in software engineering or ml engineering with significant focus on ML systems or backend infrastructure
Strong proficiency in Python and deep learning frameworks (PyTorch)
Hands-on experience with LLM inference engines (vLLM, TensorRT-LLM, Triton Inference Server)
Deep understanding of transformer architectures and LLM-specific optimizations (attention mechanisms, KV-cache, quantization techniques like GPTQ, AWQ, GGUF)
Experience with distributed training/fine-tuning frameworks (Ray, DeepSpeed, FSDP)
Ability to build developer-facing tools and platforms with clear APIs and documentation
Understanding of GPU performance profiling and optimization
Familiarity with LLM evaluation methodologies and benchmarking.
This position is open to all candidates.















