Senior Machine Learning Engineer, GenAI Data
Role details
Job location
Tech stack
Job description
-
High-Scale Data Orchestration: Architect and maintain automated pipelines for the ingestion, cleaning, and pre-processing of multi-modal datasets (video, 3D,) spanning petabytes of data
-
Synthetic Data Generation: Leverage image and video generation models to scale multi-modal synthetic datasets
-
Research-to-Production Bridge: Partner with research teams to create training data for research experiments - research and implement synthetic data creation pipelines
-
Scalable Evaluation Frameworks: Build and own evaluation-automating both heuristic-based metrics and human-in-the-loop interfaces to evaluate and benchmark training datasets and in-house foundation models
-
Model Deployment & API Architecture: Design and optimize high-throughput, low-latency Inference APIs for internal and external consumer access
-
Autonomous SOTA Tracking: Actively participate in literature reviews and paper reading groups to identify and implement the latest optimizations in generative modeling
-
Resource Efficiency & Observability: Implement monitoring pipeline health, optimizing data loading to ensure GPUs are used efficiently
Requirements
Do you have experience in Systems engineering?, Do you have a Bachelor's degree?, In this role, you will partner directly with our AI researchers to advance beyond experimental datasets and into the realm of dynamic, high-fidelity data synthesis and evaluation. You will bridge the gap between research prototypes working locally to scaling for millions of users. You will design, implement, and scale robust, high-performance infrastructure to crawl, create, curate, store, and serve the massive datasets required for these models. We are seeking accomplished software engineers with a passion for data, experience building large distributed systems, and a commitment to writing high-quality, well-tested code to solve complex data challenges at scale. Your contributions will ensure that our foundation models receive the highest quality data, thereby supporting the next generation of creative AI., * 8+ years of experience as a research-focused data systems engineer (preferably working with 3D and video foundation models)
- Expertise in building scalable ML data pipelines for both batch and real-time environments. Experience working with and processing very large datasets (Petabytes or more).
- Versatile: You're a generalist and you are comfortable with several languages and technologies already; you are adaptable in any situation
- Team-Player & Technical Leader: You are a collaborative team member who actively mentors peers, drives technical excellence, and takes ownership of leading and delivering key features and projects across team boundaries
- Python Proficiency: You can write high-quality Python code for automation, tooling, and infrastructure management
- Experience with cloud data platforms and distributed processing technologies (e.g., Spark, Ray, Kubeflow, S3, etc.).
- Are passionate about the potential of generative AI, particularly in creative domains like 3D/4D content.
- A Bachelor's degree or equivalent experience in Computer Science, Computer Engineering, or a similar technical field
You are:
- MLOps Experience: Knowledge of experiment tracking (Weights & Biases, MLflow) and versioning for massive datasets.
- Custom Tooling Development: Experience building internal "human-in-the-loop" tools for data labeling specific to video or 3D.
- C++ Knowledge: Optimize the performance of data loaders and being comfortable modifying engine code.
- Game development and digital content creation tools: Experience with making Roblox games, using Blender, Unreal Engine, or Unity.
Roles that are based in an office are onsite Tuesday, Wednesday, and Thursday, with optional presence on Monday and Friday (unless otherwise noted).