Staff Machine Learning (ML) Operations Engineer

CoAdvantage Corporation
Bradenton, United States of America
4 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Intermediate
Compensation
$ 90K

Job location

Remote
Bradenton, United States of America

Tech stack

Artificial Intelligence
Airflow
Azure
Continuous Integration
Cursor (Graphical User Interface Elements)
Graph Database
Python
Machine Learning
Neo4j
Search Technologies
Pulumi
Large Language Models
Snowflake
Multi-Agent Systems
AI Platforms
Kubernetes
Bicep
Machine Learning Operations
Terraform
Databricks

Job description

CoAdvantage is an HCM company providing payroll, ASO, and PEO services to 16,000 clients. We deliver payroll, benefits, HR compliance, time/PTO, and risk management solutions, and we are building a governed AI platform that will become a primary source of differentiation versus AI-native competitors. The AI program runs three substrates (engineering knowledge graph, analytics feature store, customer knowledge store) and a multi-agent harness. The Principal AI Architect designs the platform. The Staff MLOps Engineer makes it operationally real, repeatable, and safe to deploy.

What You'll Own- You are the operational backbone of the AI platform

  • Build and own the deployment pipelines for models, agents, prompts, evals, feature definitions, and KG/vector indices. Everything that touches production goes through a pipeline you wrote.
  • Operate the feature store (offline + online), the knowledge graph infra (ADO-KG and Customer Graph), and the vector indexing layer- ingestion, materialization, freshness, drift, lineage.
  • Stand up the eval harness as CI: every agent, prompt, and model change runs its eval suite on PR; a regression that breaks an eval blocks merge.
  • Wire the observability plane: traces for every agent step, prompts and tool calls captured with PII redaction, cost and latency SLOs per surface, drift monitors, on-call runbooks.
  • Operate the HITL queue infrastructure- routing, SLAs, audit, and the feedback loop back into evals and the KG.
  • Own incident response for AI surfaces: cross-tenant leakage, prompt injection, agent loop runaway, capability drift, KG poisoning. You write the runbooks and you carry the pager.
  • Manage cost, capacity, and model routing across LM tiers (frontier vs. cheap-and-fast)- agents should land on the right tier automatically, with budgets and circuit breakers.
  • Own secrets, identity, and AuthZ enforcement at the infra layer- tenant scoping must be enforced independently of the LLM, every time.
  • You will write a lot of code. You will not be a "platform PM".

How We Work

  • AI-first coding. Claude Code, Copilot, and successor tools are the default development surface. We expect you to author pipelines, IaC, runbooks, eval harnesses, and operators with agentic coding tools in the loop.
  • Build your own agentic workflows. Repetitive ops work- incident triage, drift investigation, eval failure root-cause, capacity forecasting- gets automated as an agentic workflow you author and own.
  • Every workflow is testable. Every pipeline, every agentic ops workflow, every runbook has tests: unit, integration, eval-on-PR, replay against a golden incident set.
  • Ambiguity is the job. Specs from the Architect will be 80% complete on purpose. You fill the last 20% by shipping, instrumenting, and reporting back what the operational reality is.
  • You estimate. Every workstream returns with a timeline, a confidence interval, an explicit list of dependencies, and the smallest version you could ship in a week.
  • You suggest the tools. Specific opinions on orchestration (Dagster vs. Airflow vs. Prefect), serving, tracing, feature store, registry, and vector indexer- and the willingness to defend them.

First 90 Days

  • Ship the deployment pipeline for one agent end-to-end: code eval-on-PR staged rollout traced production drift monitor rollback. Used by the first production agent.
  • Stand up eval-as-CI: PRs to any agent, prompt, or model run their suite automatically; failures block merge; results posted to the PR.
  • Bring up an online feature store for the MLR Pricing / Premium Tiering model with a freshness watchdog and a fail-closed posture on stale features.
  • Define and implement the cross-tenant leakage probe as a continuous CI check against the Customer Knowledge Store retrieval layer.
  • Publish the incident runbook set for the four catastrophic-tier risks (cross-tenant leakage, prompt injection, KG poisoning, agent loop runaway) and rehearse one of them with the team.

Requirements

Do you have experience in Tooling?, * 5+ years of production software / platform engineering; at least 3 years operating ML or LLM systems at meaningful scale.

  • Strong Python + at least one IaC stack (Terraform, Pulumi, Bicep). Comfortable in containers, Kubernetes, and a major cloud (Azure preferred).
  • Working fluency with agentic coding tools (Claude Code, Cursor, Copilot, or equivalent) as a daily driver, and the code commits, IaC, or runbooks to show for it.
  • Production experience operating at least one of: feature store, vector index, knowledge graph, LLM serving stack - with on-call responsibility.
  • Built and operated CI/CD for ML or LLM systems, including model/agent registries, eval gates, staged rollouts, and rollback.
  • Hands-on with observability for AI- traces, prompt/output capture, PII redaction, cost telemetry, drift detection.
  • Comfort designing and shipping testable agentic workflows- composing tools, writing the eval, defining the success criterion, gating on it.
  • Track record of carrying the pager and writing the postmortems.

Preferred Experience

  • Experience under HIPAA, SOC 2, or state-level payroll/tax compliance regimes.
  • PEO / HCM / payroll / benefits / insurance domain familiarity.
  • Production use of Databricks, Azure AI Search, Snowflake, Langfuse / Phoenix, Feast or Databricks Feature Store, Neo4j or TigerGraph.
  • Experience automating ops work with agentic workflows you authored (not just consumed).
  • OSS contributions to MLOps / LLMOps tooling.

Benefits & conditions

4.14.1 out of 5 stars Bradenton, FL Remote $80,000 - $90,000 a year - Full-time, Pulled from the full job description

  • Health insurance
  • 401(k) matching
  • Paid time off
  • Vision insurance
  • Dental insurance
  • Paid holidays, Health Insurance

Dental Insurance

Vision Insurance

401(k) Matching

Paid Time Off (PTO)

Paid Holidays

Remote Work

Bonus

Life Insurance

Apply for this position