ML Engineer (LLM Systems)

CYNNOVATIVE, LLC
8 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Tech stack

API
Amazon Web Services (AWS)
Azure
Continuous Integration
Software Debugging
Distributed Systems
Memory Management
Fault Tolerance
Python
Machine Learning
TensorFlow
Azure
Software Engineering
Strategies of Testing
Data Logging
PyTorch
Large Language Models
Caching
GIT
Build Management
Containerization
Kubernetes
Information Technology
Low Latency
HuggingFace
Machine Learning Operations
Software Version Control
Docker

Job description

As a Senior ML Engineer (LLM Systems) at Cynnovative, you will be responsible for developing and managing tools that facilitate LLM experimentation and deployment. This role is crucial in ensuring seamless integration and operation of machine learning models in various environments, supporting U.S. national security efforts.

NOTE: This role requires an active TS/SCI security clearance and is located on-site in Northern Virginia.

Responsibilities May Include

Design and build scalable LLM systems for high-throughput experimentation and inference

  • Optimize inference performance (latency, throughput)
  • Batching, caching, and request scheduling
  • Efficient GPU/CPU utilization and memory management
  • Design and deploy containerized ML services (e.g., Docker, Kubernetes)

Lead development of experimentation infrastructure

  • Build frameworks for large-scale experiment sweeps and parallel execution
  • Support distributed execution across compute environments
  • Ensure fault tolerance, retry logic, and reproducibility

Ensure production readiness and operational reliability of LLM systems

  • Implement testing strategies and validation pipelines
  • Design APIs and model serving systems
  • Support deployment across dev/staging/prod environments
  • Maintain observability (logging, monitoring, tracing)
  • Debug and resolve issues in production systems
  • Deploy systems in secure or constrained environments

Collaborate cross-functionally

  • Work closely with applied mathematicians and research engineers
  • Provide technical leadership and mentorship
  • Establish engineering best practices

Requirements

  • B.S. in Computer Science, Software Engineering, or related field (M.S. or Ph.D. preferred)
  • Strong communication skills and cross-functional collaboration
  • Deep understanding of transformer architectures and LLM inference workflows
  • Hands-on experience building scalable ML systems
  • Proficiency in Python and ML frameworks (e.g., PyTorch, Hugging Face)
  • Experience with distributed systems or large-scale compute environments
  • Experience with containerization and cloud platforms (Docker, Kubernetes, AWS/GCP/Azure)
  • Familiarity with CI/CD workflows for ML systems
  • Experience with version control systems (Git)
  • U.S. Citizenship and active TS/SCI security clearance, * Experience with high-performance LLM inference frameworks (vLLM, SGLang, etc.)
  • Experience deploying real-time inference APIs
  • Understanding of ML system tradeoffs (latency vs throughput vs cost)
  • Experience bridging research and production systems
  • Familiarity with cyber-related data, tools, and techniques

About the company

At Cynnovative, we leverage machine learning, computer science, and software engineering to address high-impact problems in the cyber domain, specifically those which are critical to U.S. national security. We primarily extend fundamental research to invent, design, develop, and deploy prototype solutions that support persistent problems in this domain.

Apply for this position