Kubernetes MLOps Engineer

OpenKyber LLC

3 days ago

Role details

Contract type

Temporary to permanent

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

Remote

Tech stack

Java

Microsoft Windows

Artificial Intelligence

Azure

Bash

Cloud Computing

Continuous Integration

Linux

Github

Monitoring of Systems

Medical Software

Python

Lynx

Reliability Engineering

Ansible

Software Engineering

Systems Architecture

Java Application Server

Reliability of Systems

Kubernetes

Infrastructure Automation Frameworks

Machine Learning Operations

Dynatrace

Human in the Loop

Docker

Job description

Location: San Francisco CA or Irving TX 75039 (Largely Remote) Duration: 6 weeks contract w/ possibility of extension or conversion to FTE role Notes: While this position is primarily remote, occasional onsite presence may be required in the future in either San Francisco, CA or Irving, TX, depending on the candidate s location. Work hours: Mon Fri, 8:00am 5:00pm Pacific Time., Description: Lynx is transforming from a traditional production support model to an automation-first, AI-assisted reliability platform following its migration to Azure cloud. This senior/staff-level Site Reliability Engineer role focuses on operating, stabilizing, and improving highly available systems while driving reliability automation using agentic AI across services. You will design, operate, and support scalable, observable production systems in Azure, while participating in and leading on-call rotations and high-severity incident response.

Responsibilities include root cause analysis, blameless post-incident reviews, and implementing corrective actions. You will own and enhance observability using Dynatrace (dashboards, alerts, SLIs/SLOs), troubleshoot production issues across Java-based services, Kubernetes, and cloud infrastructure, and collaborate with cross-functional teams to reduce risk and operational toil. A key focus of this role is designing and building AI-driven automation for incident ingestion, triage, investigation, and remediation using multi-agent patterns, with appropriate guardrails and human-in-the-loop controls. You will also develop automation for incident communication, reporting, and continuous improvement, while remaining accountable for system reliability and AI-driven operations in production.

The role combines software engineering and systems expertise to automate workflows, improve performance, and enhance system resilience using tools such as Azure, Kubernetes, Docker, GitHub Actions, Dynatrace, Python, Bash, and Ansible. Additional responsibilities include developing CI/CD pipelines, managing infrastructure, improving monitoring and observability, and supporting Java applications in production environments.

As a senior leader, you will define reliability standards, influence system architecture, lead incident response efforts, and serve as an escalation point for critical issues. You will drive proactive reliability improvements through monitoring and reporting, communicate system health and risks to leadership, and mentor team members while supporting hiring and onboarding efforts. Ensure compliance with HIPAA and organizational security and regulatory requirements.

Requirements

Do you have experience in Windows?, Qualifications: Participation in a scheduled on-call rotation is required. 7+ years of Site Reliability Engineering or Production Engineering experience. Strong experience with Azure cloud infrastructure, Kubernetes, Docker, Java production systems, CI/CD (GitHub Actions), and observability platforms (Dynatrace preferred). Demonstrated experience automating infrastructure and operational workflows. Deep understanding of SRE principles (SLIs, SLOs, error budgets). Experience with Ansible. Solid understanding of Linux and Windows system administration. Experience working with onsite and offshore teams. Strong communication skills (written and verbal). Strong organizational skills and attention to detail. Experience in healthcare software or compliance solutions is a plus. Strong analytical and problem-solving skills.

Preferred / Differentiating Qualifications : Experience designing automation that replaces or materially reduces on call toil. Experience building or orchestrating AI agents applied to operational workflows. Familiarity with multi agent architectures or distributed automation systems. Strong judgment around risk management, safety boundaries, and human in the loop design. Experience working in healthcare or regulated environments.

I'd love to talk to you if you think this position is right up your alley, and assure a prompt communication, whichever direction. If you're looking for rewarding employment and a company that puts its employees first, we'd like to work with you.

About the company

Company Overview: Amerit Consulting is an extremely fast-growing staffing and consulting firm. Amerit Consulting was founded in 2002 to provide consulting, temporary staffing, direct hire, and payrolling services to Fortune 500 companies nationally; as well as small to mid-sized organizations on a local & regional level. Currently, Amerit has over 2,000 employees in 47 states. We develop and implement solutions that help our clients operate more efficiently, deliver greater customer satisfaction, and see a positive impact on their bottom line. We create value by bringing together the right people to achieve results. Our clients and employees say they choose to work with Amerit because of how we work with them - with service that exceeds their expectations and a personal commitment to their success.

Role details

Job location

Tech stack

Job description

Requirements

About the company

Apply for this position

Good distractions

Moments

Videos View all