AI Engineer: Agentic Systems & On-Premise Infrastructure - MADISON WISCONSIN, NO REMOTE

EVO TECHNOLOGY INC
Madison, United States of America
14 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior
Compensation
$ 84K

Job location

Madison, United States of America

Tech stack

Artificial Intelligence
Systems Engineering
Computer Clusters
Python
Performance Tuning
Retrieval-Augmented Generation
Large Language Models
Multi-Agent Systems
State Machines
Backend
FastAPI
Kubernetes
Bare Metal
Build Tools
TensorRT
Hardware Infrastructure

Job description

As a Senior AI Engineer, you will own the end-to-end lifecycle of our internal AI ecosystem-from optimizing local inference engines to orchestrating multi-agent swarms. You will build systems that don't just "chat," but reason, execute tools, and maintain state over long-running business processes. This is a role for a "full-stack" AI engineer: someone who understands both the mathematical foundations of LLMs and the systems engineering required to run them at scale on-premise., * Agentic Orchestration: Architect multi-agent systems using LangGraph or PydanticAI, implementing advanced patterns like Plan-and-Execute, Self-Reflection, and Multi-Agent Handoffs.

  • Production RAG: Build high-performance Retrieval-Augmented Generation pipelines utilizing hybrid search, cross-encoders, and re-ranking within our private vector stores.
  • On-Premise Optimization: Manage the full inference stack. You will be responsible for optimizing model performance (latency vs. throughput) on local GPU clusters using vLLM, TGI, or TensorRT-LLM.
  • Stateful Architecture: Design resilient workflows that handle long-running tasks, error recovery, and complex "human-in-the-loop" interactions.
  • Evaluation & Benchmarking: Develop rigorous, automated "LLM-as-a-judge" frameworks to evaluate agentic performance, ensuring reliability and grounding.

Requirements

  • Agentic Frameworks: Expert-level mastery of LangGraph (state machines/persistence) or PydanticAI (type-safe logic/structured outputs).
  • Python Mastery: Deep experience in asynchronous Python and building scalable, production-grade backend services (FastAPI, Pydantic).
  • Infrastructure & Deployment: Proven experience deploying open-weights models (Llama, Mistral, DeepSeek) on bare-metal or private Kubernetes environments.
  • The "Inner Workings": A strong grasp of LLM fundamentals, including attention mechanisms, KV caching, and the impact of tokenization on performance/cost., * Pretraining: Knowledge of data curation, pruning, and continued pretraining on domain-specific datasets.
  • Post-training: Hands-on experience with Supervised Fine-Tuning (SFT) and alignment techniques like DPO, ORPO, or RLHF.
  • Model Compression & Efficiency: Deep knowledge of quantization (GGUF, AWQ, EXL2), model distillation, and graph optimization to squeeze maximum performance out of local hardware.
  • VRAM Management: Ability to perform "VRAM math"-calculating memory requirements for various model sizes (e.g., 70B vs 8B) relative to context window length, batch sizes, and KV cache pressure.
  • Data Sovereignty: A passion for building private systems where data privacy and infrastructure control are paramount.

Apply for this position