Staff Machine Learning (ML) Operations Engineer

CoAdvantage Corporation

Bradenton, United States of America

1 month ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Intermediate

Compensation

$ 90K

Job location

Remote

Bradenton, United States of America

Tech stack

Artificial Intelligence

Airflow

Azure

Continuous Integration

Cursor (Graphical User Interface Elements)

Graph Database

Python

Machine Learning

Neo4j

Search Technologies

Pulumi

Large Language Models

Snowflake

Multi-Agent Systems

AI Platforms

Kubernetes

Bicep

Machine Learning Operations

Terraform

Databricks

Job description

CoAdvantage is an HCM company providing payroll, ASO, and PEO services to 16,000 clients. We deliver payroll, benefits, HR compliance, time/PTO, and risk management solutions, and we are building a governed AI platform that will become a primary source of differentiation versus AI-native competitors. The AI program runs three substrates (engineering knowledge graph, analytics feature store, customer knowledge store) and a multi-agent harness. The Principal AI Architect designs the platform. The Staff MLOps Engineer makes it operationally real, repeatable, and safe to deploy.

What You'll Own- You are the operational backbone of the AI platform

Build and own the deployment pipelines for models, agents, prompts, evals, feature definitions, and KG/vector indices. Everything that touches production goes through a pipeline you wrote.
Operate the feature store (offline + online), the knowledge graph infra (ADO-KG and Customer Graph), and the vector indexing layer- ingestion, materialization, freshness, drift, lineage.
Stand up the eval harness as CI: every agent, prompt, and model change runs its eval suite on PR; a regression that breaks an eval blocks merge.
Wire the observability plane: traces for every agent step, prompts and tool calls captured with PII redaction, cost and latency SLOs per surface, drift monitors, on-call runbooks.
Operate the HITL queue infrastructure- routing, SLAs, audit, and the feedback loop back into evals and the KG.
Own incident response for AI surfaces: cross-tenant leakage, prompt injection, agent loop runaway, capability drift, KG poisoning. You write the runbooks and you carry the pager.
Manage cost, capacity, and model routing across LM tiers (frontier vs. cheap-and-fast)- agents should land on the right tier automatically, with budgets and circuit breakers.
Own secrets, identity, and AuthZ enforcement at the infra layer- tenant scoping must be enforced independently of the LLM, every time.
You will write a lot of code. You will not be a "platform PM".

How We Work

AI-first coding. Claude Code, Copilot, and successor tools are the default development surface. We expect you to author pipelines, IaC, runbooks, eval harnesses, and operators with agentic coding tools in the loop.
Build your own agentic workflows. Repetitive ops work- incident triage, drift investigation, eval failure root-cause, capacity forecasting- gets automated as an agentic workflow you author and own.
Every workflow is testable. Every pipeline, every agentic ops workflow, every runbook has tests: unit, integration, eval-on-PR, replay against a golden incident set.
Ambiguity is the job. Specs from the Architect will be 80% complete on purpose. You fill the last 20% by shipping, instrumenting, and reporting back what the operational reality is.
You estimate. Every workstream returns with a timeline, a confidence interval, an explicit list of dependencies, and the smallest version you could ship in a week.
You suggest the tools. Specific opinions on orchestration (Dagster vs. Airflow vs. Prefect), serving, tracing, feature store, registry, and vector indexer- and the willingness to defend them.

First 90 Days

Ship the deployment pipeline for one agent end-to-end: code eval-on-PR staged rollout traced production drift monitor rollback. Used by the first production agent.
Stand up eval-as-CI: PRs to any agent, prompt, or model run their suite automatically; failures block merge; results posted to the PR.
Bring up an online feature store for the MLR Pricing / Premium Tiering model with a freshness watchdog and a fail-closed posture on stale features.
Define and implement the cross-tenant leakage probe as a continuous CI check against the Customer Knowledge Store retrieval layer.
Publish the incident runbook set for the four catastrophic-tier risks (cross-tenant leakage, prompt injection, KG poisoning, agent loop runaway) and rehearse one of them with the team.

Requirements

Do you have experience in Tooling?, * 5+ years of production software / platform engineering; at least 3 years operating ML or LLM systems at meaningful scale.

Strong Python + at least one IaC stack (Terraform, Pulumi, Bicep). Comfortable in containers, Kubernetes, and a major cloud (Azure preferred).
Working fluency with agentic coding tools (Claude Code, Cursor, Copilot, or equivalent) as a daily driver, and the code commits, IaC, or runbooks to show for it.
Production experience operating at least one of: feature store, vector index, knowledge graph, LLM serving stack - with on-call responsibility.
Built and operated CI/CD for ML or LLM systems, including model/agent registries, eval gates, staged rollouts, and rollback.
Hands-on with observability for AI- traces, prompt/output capture, PII redaction, cost telemetry, drift detection.
Comfort designing and shipping testable agentic workflows- composing tools, writing the eval, defining the success criterion, gating on it.
Track record of carrying the pager and writing the postmortems.

Preferred Experience

Experience under HIPAA, SOC 2, or state-level payroll/tax compliance regimes.
PEO / HCM / payroll / benefits / insurance domain familiarity.
Production use of Databricks, Azure AI Search, Snowflake, Langfuse / Phoenix, Feast or Databricks Feature Store, Neo4j or TigerGraph.
Experience automating ops work with agentic workflows you authored (not just consumed).
OSS contributions to MLOps / LLMOps tooling.

Benefits & conditions

4.14.1 out of 5 stars Bradenton, FL Remote $80,000 - $90,000 a year - Full-time, Pulled from the full job description

Health insurance
401(k) matching
Paid time off
Vision insurance
Dental insurance
Paid holidays, Health Insurance

Dental Insurance

Vision Insurance

401(k) Matching

Paid Time Off (PTO)

Paid Holidays

Remote Work

Bonus

Life Insurance

Role details

Job location

Tech stack

Job description

Requirements

Benefits & conditions

Apply for this position

Good distractions

Moments

Videos View all