AI DevOps Engineer
Role details
Job location
Tech stack
Job description
We are looking for a hands-on AI DevOps Engineer to own and build out the operational backbone of our legal case management platform. You will be the go-to person for infrastructure across traditional systems and modern AI workloads, including LLMs, RAG pipelines, vector databases, and agent-based systems.
We are looking for a hands-on Infrastructure & Operations Engineer to own and build out the operational backbone of our legal case management platform. You will be the go-to person for everything infrastructure - from development environments to production deployments across on-premises and cloud-hosted client sites.
This is a high-impact, high-autonomy role. You will be the primary Ops resource, working alongside developers who currently handle infrastructure part-time. Your mission is to bring structure, reliability, and observability to our operations - establishing proper CI/CD pipelines, monitoring, alerting, and incident response processes., * Design, build, and maintain CI/CD pipelines using Azure DevOps and Jenkins
- Manage build configurations, artifact publishing, and release orchestration
- Coordinate deployments across multiple client environments (on-prem and cloud)
- Maintain and improve source control workflows using Git
Infrastructure Management
- Provision, configure, and maintain Windows Server environments (dev, test, staging, production)
- Administer IIS web servers - application pools, bindings, SSL certificates, performance tuning
- Manage SQL Server instances - installation, configuration, backups, high availability (Always On)
- Maintain networking fundamentals - DNS, firewalls, load balancers, VPN connectivity
- Handle patch management and security hardening across all environments
Monitoring, Observability & AI Systems
- Stand up and maintain monitoring infrastructure using Zabbix, Grafana, and Loki
- Define and implement alerting rules for system health, performance, and availability
- Build dashboards that give the team real-time visibility into all environments
- Establish baseline metrics and SLAs for system performance
Incident Response & Troubleshooting
- Serve as the primary point of contact for production infrastructure issues
- Diagnose and resolve system outages, performance degradation, and deployment failures
- Conduct root cause analysis and implement preventive measures
- Document runbooks and operational procedures for common issues
Security & Compliance
- Implement and maintain access controls, following the principle of least privilege
- Manage SSL/TLS certificates across all environments
- Ensure backup and disaster recovery procedures are in place and regularly tested
- Support security audits and maintain awareness of data protection requirements (legal industry handles sensitive PII)
Requirements
Do you have experience in Windows Server administration?, * 5+ years of Windows Server administration - this is a Windows shop and you must be an expert
- Expert-level Microsoft SQL Server - installation, configuration, backup/restore, performance tuning, Always On availability groups, index maintenance
- Expert-level IIS administration - application pools, URL rewrite, SSL bindings, troubleshooting, performance optimization
- CI/CD pipeline experience - Azure DevOps Pipelines and/or Jenkins, build automation, release management
- Scripting with PowerShell - automation of routine tasks, deployment scripts, system administration
- Source control - Git workflows, branching strategies, merge management
- Monitoring tools - hands-on experience with at least one observability stack (Zabbix, Grafana, Prometheus, or similar)
- Networking fundamentals - DNS, TCP/IP, firewalls, load balancers, VPN, SSL/TLS
- Backup & disaster recovery - designing and testing backup strategies, point-in-time recovery
- LLM Integration: OpenAI Chat Completions, Assistants API, Realtime API, function calling, streaming. Just knowing the Chat API is not sufficient
- RAG Systems: Vector databases (Chroma or equivalent), embedding models (HuggingFace/OpenAI), chunking strategies, retrieval pipelines
- Agentic Patterns: Tool-calling agents, multi-step reasoning, agent orchestration frameworks (LangChain or equivalent), * Microsoft Certification (MCSA, MCSE, or Azure equivalent) - strongly preferred
- Oracle Cloud Infrastructure (OCI) experience - compute, networking, storage, block volumes
- Grafana + Loki experience for log aggregation and visualization
- Zabbix experience for infrastructure monitoring
- Python scripting for automation and tooling
- Docker / containerization basics
- Linux administration fundamentals
- AWS EC2 experience
- Familiarity with compliance frameworks (SOC 2 or similar)
- Experience supporting multi-tenant or client-deployed software products
- What Makes You a Great Fit
- Ownership mentality - you will be building this function, not slotting into an existing team. You see gaps and fill them without being asked.
- Calm under pressure - production issues happen. You diagnose methodically, communicate clearly, and fix things fast.
- Automation-first mindset - if you do something twice, you script it. Manual processes are temporary, automation is the goal.
- Clear communicator - you can explain infrastructure issues to developers and stakeholders in plain language.
- Documentation habit - you write things down so the team doesn't depend solely on your memory.
- Pragmatic problem solver - you find the right solution for the situation, not the theoretically perfect one., * Microsoft SQL Server: 3 years (Preferred)
- CI/CD: 4 years (Preferred)
- PowerShell: 3 years (Preferred)
- Disaster recovery: 3 years (Preferred)
- Python: 4 years (Preferred)
- Microsoft Windows Server: 5 years (Preferred)
- AI: 3 years (Preferred)
- LLM: 3 years (Preferred)
- Agentic AI: 1 year (Preferred)
Benefits & conditions
Pulled from the full job description
- 401(k)
- Health insurance
- Paid time off
- Vision insurance
- Dental insurance, * 401(k)
- Dental insurance
- Health insurance
- Paid time off
- Vision insurance