Senior Site Reliability Engineer
Role details
Job location
Tech stack
Job description
Crytek is looking for an experienced Senior Site Reliability Engineer to support Hunt: Showdown's NetOps department in our Frankfurt Studio. The person in this position will serve as the key liaison between development teams and the network operations team. They will drive operational excellence, lead infrastructure initiatives, and work closely with production and architecture to ensure systems are highly available, scalable, and efficient. This position includes both operational and strategic responsibilities. This role is based on-site at our headquarters in Frankfurt, Germany, where you'll collaborate with world-class developers and benefit from our attractive relocation package.
Responsibilities
- Lead initiatives to improve reliability, scalability, and performance across our live game infrastructure.
- Serve as subject matter expert and mentor to junior and mid-level engineers.
- Daily operation and maintenance of hosted/cloud data-center environments.
- Installation, configuration, and patching of system and game software.
- Define, monitor, and improve SLIs/SLOs to maintain 99.9%+ uptime.
- Own incident response and root cause analysis processes; create and maintain runbooks and playbooks.
- Evaluate and implement new technologies, conducting POCs and driving them to production.
- Maintain accurate, up-to-date documentation for systems, workflows, and processes.
- Lead capacity planning, scaling strategies, and disaster recovery efforts.
- Continuously optimize the reliability, observability, and cost efficiency of critical infrastructure.
Requirements
Do you have experience in Terraform?, * Previous experience as a Site Reliability Engineer, Platform Engineer or similar
- Proven experience designing and operating large-scale, high-availability systems.
- Strong Linux administration skills.
- Experienced with containerization and orchestration technologies.
- Experience in CI/CD pipelines, automated deployment, and infrastructure as code.
- Solid understanding of network security principles.
- Hands-on experience with both bare-metal and cloud (preferably AWS).
- Proficient in automation tools such as Ansible and Terraform.
- Skilled with observability tools like Open Telemetry, Prometheus, Mimir, and Grafana.
- Deep understanding of scalability, profiling, debugging, and performance testing.
- Strong grasp of web stack fundamentals (REST, HTTP, CDN, caching).
- Experience setting up monitoring, metrics, and proactive alerting for production systems (Go, Java, C++).
- Proficient scripting in Shell and Python.
- Excellent communication and documentation skills in English.
- Willing to relocate to Frankfurt.
Pluses
- Experience with Zero Trust Networks, WireGuard, Nomad, MaaS, Foreman.
- Knowledge of capacity forecasting and cost optimization for large-scale systems.