Senior AI Engineer
Role details
Job location
Tech stack
Job description
-
Develop the Machine Learning Platform management system.
-
Design and implement intuitive user interfaces and APls for seamless interaction with the platform.
-
Ensure robust access control and security measures for the Machine Learning Platform.
-
Regularly evaluate and enhance platform performance, scalability, and reliability. Integrate tools for data versioning, experiment tracking, and workflow orchestration.
-
Build the toolchains, service, pipeline for model development workflow, and model serving architecture.
-
Create automated pipelines for data preprocessing, feature engineering, and dataset versioning.
-
Develop Cl/CD pipelines for deploying models into production environments with minimal downtime.
-
Enable support for distributed model training and hyperparameter optimization.
-
Incorporate A/B testing frameworks for evaluating multiple model deployments.
-
Collaborate with data scientists and engineers to streamline the model development lifecycle.
-
Prioritize various metrics for model training and inferencing monitoring. Implement logging and monitoring tools to track model performance, resource utilization, and throughput.
-
Develop dashboards to visualize key metrics such as latency, accuracy, and drift detection in realtime.
-
Establish alerting mechanisms to detect and respond to anomalies or performance degradation.
-
Continuously refine metric prioritization based on stakeholder feedback and evolving business goals.
-
Develop and maintaining the high-performance LLM training GPU infrastructure and cluster.
-
Optimize GPU utilization for large-scale training workloads, ensuring minimal resource wastage.
-
Implement fault-tolerant and distributed training strategies for handling large language models (LLMs).
-
Evaluate and integrate emerging hardware technologies, such as TPUs, into the training infrastructure.
-
Regularly update cluster configurations to support new frameworks and model architectures.
-
Manage scheduling and resource allocation for multi-tenant GPU clusters.
-
Understand the auto scale for inference service and multi-models for dynamical loading.
-
Design systems that dynamically allocate resources based on real-time demand for inference services.
-
Develop mechanisms for loading and unloading models in memory to optimize latency and resource usage.
-
Implement strategies for caching frequently used models to improve inference performance.
-
Experiment with serverless architectures to further enhance scalability and cost efficiency.
-
Ensure compatibility with edge devices and deploy lightweight models for edge inference.
-
Support, troubleshoot, and resolve any issues during the training and inferencing.
-
Create detailed runbooks for common troubleshooting scenarios to reduce resolution times.
-
Perform root cause analysis for failures and implement long-term fixes to prevent recurrence.
-
Collaborate with DevOps and IT teams to ensure the stability of underlying infrastructure.
-
Develop self-healing systems that can automatically recover from common training or inference issues.
-
Provide technical support and guidance to data scientists and engineers working on the platform.
Requirements
Requires a Bachelor's degree in Communications Engineering, Artificial Intelligence, Software Engineering, a related field, or a foreign degree equivalent. Must have 2 years of experience in job offered or related occupation. Must have 2 years of experience in:
-
Designing, Implementing, or optimizing large-scale distributed training systems using technologies like Horovod, DeepSpeed, PyTorch Distributed, or Ray;
-
Tensor/model parallelism and pipeline parallelism;
-
Utilizing cloud-native or on-prem infrastructure (Kubernetes, Docker, Slurm) to support scalable, fault-tolerant, and resource-efficient AI workloads across multi-node GPU clusters;
-
Using Performance Profiling and Optimization to diagnose and improve end-to-end training performance by optimizing data pipelines (e.g., DALI, tf.data), minimizing communication overhead (e.g., NCCL, gRPC), and tuning hardware-specific kernels (e.g., CUDA, Triton);
-
Systems Programming and Automation in systems-level programming with Python, Bash, and C++ or Go;
-
Automating deployment and orchestration of AI workloads and monitoring using Prometheus, Grafana, Weights & Biases.
Benefits & conditions
$209,000.00
Maximum: $275,400.00
In addition to the base salary and/or OTE listed Zoom has a Total Direct Compensation philosophy that takes into consideration; base salary, bonus and equity value.
Note: Starting pay will be based on a number of factors and commensurate with qualifications & experience.
We also have a location based compensation structure; there may be a different range for candidates in this and other locations., As part of our award-winning workplace culture and commitment to delivering happiness, our benefits program offers a variety of perks, benefits, and options to help employees maintain their physical, mental, emotional, and financial health; support work-life balance; and contribute to their community in meaningful ways. Click Learn for more information.