AI Engineer: Agentic Systems & On-Premise Infrastructure - MADISON WISCONSIN, NO REMOTE
EVO TECHNOLOGY INC
Madison, United States of America
14 days ago
Role details
Contract type
Permanent contract Employment type
Full-time (> 32 hours) Working hours
Regular working hours Languages
English Experience level
Senior Compensation
$ 84KJob location
Madison, United States of America
Tech stack
Artificial Intelligence
Systems Engineering
Computer Clusters
Python
Performance Tuning
Retrieval-Augmented Generation
Large Language Models
Multi-Agent Systems
State Machines
Backend
FastAPI
Kubernetes
Bare Metal
Build Tools
TensorRT
Hardware Infrastructure
Job description
As a Senior AI Engineer, you will own the end-to-end lifecycle of our internal AI ecosystem-from optimizing local inference engines to orchestrating multi-agent swarms. You will build systems that don't just "chat," but reason, execute tools, and maintain state over long-running business processes. This is a role for a "full-stack" AI engineer: someone who understands both the mathematical foundations of LLMs and the systems engineering required to run them at scale on-premise., * Agentic Orchestration: Architect multi-agent systems using LangGraph or PydanticAI, implementing advanced patterns like Plan-and-Execute, Self-Reflection, and Multi-Agent Handoffs.
- Production RAG: Build high-performance Retrieval-Augmented Generation pipelines utilizing hybrid search, cross-encoders, and re-ranking within our private vector stores.
- On-Premise Optimization: Manage the full inference stack. You will be responsible for optimizing model performance (latency vs. throughput) on local GPU clusters using vLLM, TGI, or TensorRT-LLM.
- Stateful Architecture: Design resilient workflows that handle long-running tasks, error recovery, and complex "human-in-the-loop" interactions.
- Evaluation & Benchmarking: Develop rigorous, automated "LLM-as-a-judge" frameworks to evaluate agentic performance, ensuring reliability and grounding.
Requirements
- Agentic Frameworks: Expert-level mastery of LangGraph (state machines/persistence) or PydanticAI (type-safe logic/structured outputs).
- Python Mastery: Deep experience in asynchronous Python and building scalable, production-grade backend services (FastAPI, Pydantic).
- Infrastructure & Deployment: Proven experience deploying open-weights models (Llama, Mistral, DeepSeek) on bare-metal or private Kubernetes environments.
- The "Inner Workings": A strong grasp of LLM fundamentals, including attention mechanisms, KV caching, and the impact of tokenization on performance/cost., * Pretraining: Knowledge of data curation, pruning, and continued pretraining on domain-specific datasets.
- Post-training: Hands-on experience with Supervised Fine-Tuning (SFT) and alignment techniques like DPO, ORPO, or RLHF.
- Model Compression & Efficiency: Deep knowledge of quantization (GGUF, AWQ, EXL2), model distillation, and graph optimization to squeeze maximum performance out of local hardware.
- VRAM Management: Ability to perform "VRAM math"-calculating memory requirements for various model sizes (e.g., 70B vs 8B) relative to context window length, batch sizes, and KV cache pressure.
- Data Sovereignty: A passion for building private systems where data privacy and infrastructure control are paramount.