MLOps & AI Infrastructure Engineer

dlocal
2 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English, Spanish, Portuguese

Job location

Tech stack

API
Artificial Intelligence
Big Data
Continuous Integration
Distributed Systems
Machine Learning
Azure
AI Infrastructure
Delivery Pipeline
Spark
Backend
Containerization
AI Platforms
Low Latency
Apache Flink
Kafka
Data Management
Machine Learning Operations
Video Streaming
Data Pipelines
Databricks

Job description

enterprise Feature Store, combining: into online stores. Databricks / Spark pipelines for offline feature computation, backfills and training datasets. Point-in-time correctness for offline training and backtesting. Low-latency, high-throughput online feature serving with clear SLAs, TTL semantics and multi-tenant safety. Help data scientists and domain teams onboard new features safely and consistently across Flink and Databricks. Offline-online parity checks, data quality, drift and freshness monitoring for critical feature groups. Unified feature retrieval APIs (online/offline/batch) and SDK/CLI usage from models and services. MLOps platform implementation (training, serving, observability) Implement and improve training and evaluation pipelines: Promotion flows from dev * staging * production, following platform standards. Work on online and batch inference paths: Model packaging and deployment. Integrate and extend agents and AI services (built by the AI Team and MLOps) to automate

Requirements

key parts of the Feature Store and MLOps workflows (health checks, drift and quality analysis, documentation/specs, incident triage, FinOps suggestions, etc.). Design these automations with clear guardrails: observable, auditable and easy to roll back, always keeping humans in control of production decisions. Access control, secrets management and PII handling in features and models. Data Science squads and the AI Team to understand requirements and unblock use cases. Contribute to internal documentation, RFCs, examples and onboarding guides so other engineers and data scientists can adopt the platform more easily. Solid experience as a Senior Engineer working on: MLOps, data platforms, or large-scale backend / distributed systems. Hands-on experience with big data / streaming technologies (e.g. Spark, Flink, Kafka, Kinesis, or similar). Proven track record building production-grade ML pipelines: Experiment tracking and reproducible training flows. CI/CD for models and data pipelines. Online and batch inference at scale. Familiarity with cloud-based ML platforms and containerized deployments (e.g. Data and model drift, freshness and quality checks. Comfortable communicating with Data Scientists, ML Engineers and Infra/SRE, translating requirements into concrete technical solutions. Log/metric/incident analysis or documentation generation. Flexibility: we have flexible schedules and we are driven by performance. Language classes: we provide free English, Spanish, or Portuguese classes. Social budget: you'll get a monthly budget to chill out with your team (in person or remotely) and deepen your connections Also, you can check out our webpage, Linkedin and Youtube for more about dLocal

Apply for this position