Senior Site Reliability Engineer

Nexthink
Frankfurt (Oder), Germany
1 month ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Remote

Tech stack

Amazon Web Services (AWS)
Amazon Web Services (AWS)
Automation of Tests
Azure
Bash
Software as a Service
Cloud Computing
Protocol Stack
Computer Programming
Linux
Fault Tolerance
Github
Iterative and Incremental Development
Subnetting
Virtual Private Networks (VPN)
Python
Reliability Engineering
Service Design
Software Engineering
TCP/IP
Datadog
Scripting (Bash/Python/Go/Ruby)
Google Cloud Platform
Load Balancing
Istio
Mttr
Firewalls (Computer Science)
Amazon Web Services (AWS)
Gitlab-ci
Kubernetes
Information Technology
Terraform
Docker
Jenkins
Go
Microservices

Job description

We are looking for an experienced, proactive and innovative professional that is keen to join as a Senior Site Reliability Engineer! The mission of Nexthink's SRE team is to strengthen our infrastructure and enhance our ability to deploy, monitor, and scale systems effectively and reliably. They work closely with over 50 Product Engineering teams that develop our products and services, as well as with the Technical Platform Engineering, Security and Architecture teams to understand the reliability requirements, design and implement solutions, and promote them for adoption and usage., * Implement and manage cloud-native systems (AWS) using best-in-class tools and automation.

  • Operate and enhance Kubernetes clusters, deployment pipelines, and service meshes to support rapid delivery cycles.
  • Design, build, and maintain the infrastructure powering our multi-tenant SaaS platform with reliability, security, and scalability in mind.
  • Define and maintain SLOs, SLAs, and error budgets, and proactively address availability and performance issues.
  • Develop infrastructure-as-code (Terraform or similar) for repeatable and auditable provisioning.
  • Build internal platform tools and automation to support provisioning, monitoring, and operational efficiency.
  • Monitor infrastructure and applications ensuring high-quality user experiences.
  • Participate in a shared on-call rotation, responding to incidents, troubleshooting outages, and driving timely resolution and communication.
  • Act as an Incident Commander during the on-call duty and coordinate cross-team responses effectively to maintain an SLA.
  • Drive and refine incident response processes, reducing Mean Time to Detect (MTTD) and Mean Time to Recovery (MTTR).
  • Diagnose and resolve complex issues independently, minimizing the need for external escalation.
  • Work closely with software engineers to embed observability, fault tolerance, and reliability principles into service design.
  • Automate runbooks, health checks, and alerting to support reliable operations with minimal manual intervention.
  • Support automated testing, canary deployments, and rollback strategies to ensure safe, fast, and reliable releases.
  • Contribute to security best practices, compliance automation, and cost optimization.

Requirements

  • Minimum Bachelor's degree in Computer Science or equivalent practical experience.
  • 5+ years of experience as a Site Reliability Engineer or Platform Engineer with strong knowledge of software development best practices.
  • Strong hands-on experience with public cloud services (AWS, GCP, Azure) and supporting SaaS product.
  • Strong programming or scripting skills (e.g., Python, Go, Bash...), and experience with infrastructure-as-code (e.g. Terraform).
  • Proficiency with Kubernetes, container-based deployment (e.g., Docker) and related ecosystems (e.g., Helm).
  • Experience supporting multi-tenant microservices architectures.
  • Experience with CI/CD pipelines & tools (e.g., Jenkins, GitHub Actions, GitLab CI, FluxCD, Crossplane).
  • Experience with managing monitoring solutions (e.g. Datadog).
  • Comfortable participating in a rotating on-call schedule, managing critical incidents, and leading post-incident reviews.
  • At ease with operating and managing production systems, striking the right balance between urgency and methodology.
  • Strong system-level troubleshooting skills and a proactive mindset toward incident prevention.
  • Deep understanding of Linux systems, networking, and common troubleshooting practices.
  • Solid understanding of the network stack (e.g., TCP/IP, VPN, etc.), cloud architectures (VPC, subnets, firewalls, load balancers), service mesh (e.g., Istio) and storage (e.g., S3, EBS, etc).
  • Knowledge of zero-downtime deployment strategies, blue/green and canary releases.
  • Exposure to compliance standards such as SOC 2, ISO 27001, or HIPAA. FedRAMP experience is a big plus.
  • Experience with chaos engineering or resilience testing practices.
  • Excellent problem-solving skills, collaborative mindset, and a strong grasp of agile, iterative development.
  • Self-driven, highly organised, and capable of independently managing priorities.
  • Curiosity to learn new things and discover new technologies.
  • Strong communication, presentation, and team collaboration skills.
  • Excellent written and verbal skills in English.

The prior experience with any of the above-mentioned tools is a bonus, but not a must! We encourage you to apply even if you do not meet every single requirement. We welcome candidates with different level of background and experience. If you are excited about this role, please apply and our recruiters will assess your application.

Benefits & conditions

  • Permanent Contract and a competitive compensation package (Stock Options also included).
  • Amazing centrally located offices near the Bernabeu Stadium.
  • Private Health Insurance (Sanitas) and daily meal vouchers of 11 EUR will be entirely covered by us.
  • Hybrid work model balancing office and remote work, with a structured approach for new hires to foster connections and onboarding.
  • ️ Flexible Hours and unlimited vacation (employees have unlimited paid time off on top of the 23 days of holidays we offer) plus 3 company-paid volunteer days.
  • Up to 25 EUR per month for a gym subscription.
  • Flexible retribution plan for kindergarten & transport tickets.
  • Reimbursement of up to 50% of the cost of English & Spanish classes.
  • Fresh fruit, cookies, and occasionally some soft drinks as well.
  • Regular company and team events like Pizza talks, Team Building activities, Christmas parties, hosting Meetups at the office and more!
  • Bonuses for referring successful hires after three months of continuous employment.
  • We offer a relocation package to people who are coming from another country.

Please note that not all the benefits listed above are available for temporary, contract, and internship roles. To ensure you have the most up-to-date information, we recommend checking with your Recruitment Partner. #J-18808-Ljbffr

About the company

Nexthink is the global leader in digital employee experience management. Our products allow enterprises to create highly productive digital workplaces for their employees by delivering optimal end-user experiences. Through a unique combination of real-time analytics, automation and employee feedback, Nexthink gives IT teams the insight they need to empower and even delight people at work.

Headquartered in Switzerland with US headquarters in Boston, Nexthink also has offices in France, UK, Germany, Spain and UAE.

Apply for this position