MLOps & AI Infrastructure Engineer
Role details
Job location
Tech stack
Job description
enterprise Feature Store, combining: into online stores. Databricks / Spark pipelines for offline feature computation, backfills and training datasets. Point-in-time correctness for offline training and backtesting. Low-latency, high-throughput online feature serving with clear SLAs, TTL semantics and multi-tenant safety. Help data scientists and domain teams onboard new features safely and consistently across Flink and Databricks. Offline-online parity checks, data quality, drift and freshness monitoring for critical feature groups. Unified feature retrieval APIs (online/offline/batch) and SDK/CLI usage from models and services. MLOps platform implementation (training, serving, observability) Implement and improve training and evaluation pipelines: Promotion flows from dev * staging * production, following platform standards. Work on online and batch inference paths: Model packaging and deployment. Integrate and extend agents and AI services (built by the AI Team and MLOps) to automate
Requirements
key parts of the Feature Store and MLOps workflows (health checks, drift and quality analysis, documentation/specs, incident triage, FinOps suggestions, etc.). Design these automations with clear guardrails: observable, auditable and easy to roll back, always keeping humans in control of production decisions. Access control, secrets management and PII handling in features and models. Data Science squads and the AI Team to understand requirements and unblock use cases. Contribute to internal documentation, RFCs, examples and onboarding guides so other engineers and data scientists can adopt the platform more easily. Solid experience as a Senior Engineer working on: MLOps, data platforms, or large-scale backend / distributed systems. Hands-on experience with big data / streaming technologies (e.g. Spark, Flink, Kafka, Kinesis, or similar). Proven track record building production-grade ML pipelines: Experiment tracking and reproducible training flows. CI/CD for models and data pipelines. Online and batch inference at scale. Familiarity with cloud-based ML platforms and containerized deployments (e.g. Data and model drift, freshness and quality checks. Comfortable communicating with Data Scientists, ML Engineers and Infra/SRE, translating requirements into concrete technical solutions. Log/metric/incident analysis or documentation generation. Flexibility: we have flexible schedules and we are driven by performance. Language classes: we provide free English, Spanish, or Portuguese classes. Social budget: you'll get a monthly budget to chill out with your team (in person or remotely) and deepen your connections Also, you can check out our webpage, Linkedin and Youtube for more about dLocal