[CH] Site Reliability Engineer (Monitoring & Incident Response Focus)
Role details
Job location
Tech stack
Job description
WellD is looking for a Site Reliability Engineer to join a key enterprise project.
In this role, you will be responsible for monitoring and ensuring the reliability of mission-critical applications, specifically authentication and user management systems used by millions of B2C and B2B customers.
As an SRE, you will act as the technical point of reference for service stability: not only monitoring system health, but also taking ownership of incident response, troubleshooting, and root cause analysis, while contributing to automation and improvements that increase resilience and reduce downtime.
You will work in a complex enterprise environment with state-of-the-art observability tools and have the opportunity to propose and implement new monitoring solutions.
Where you'll work
The position is based in Southern Switzerland, with a hybrid on-site/remote setup (4 days per week in the office and 1 day working remotely). While WellD values workplace flexibility, our priority remains our clients and teams. Please note that our remote work policy may be subject to change in accordance with Swiss and cross-border legislation
What you'll do
- Monitor and ensure the availability and reliability of critical Java-based applications.
- Participate in incident response and root cause analysis, reducing downtime and preventing recurrence.
- Manage and optimize observability platforms (Splunk, Grafana, Dynatrace).
- Automate monitoring and diagnostic tasks using scripting and tools (Bash, Ansible).
- Administer Linux environments, including network configuration, firewalls, and certificates.
- During non-monitoring weeks, contribute to development or operational tasks (Java, Spring Boot or automations) to improve applications and tools.
- Act as a technical point of contact for internal and external teams (developers, business stakeholders, infrastructure).
- Propose and implement new monitoring and alerting solutions to improve overall system resilience.
Requirements
Do you have experience in Spring Framework?, Must-have skills
- Strong experience with Linux system administration, including networking, firewall, and certificate management.
- Hands-on experience with monitoring/observability tools (Splunk, Grafana, Dynatrace).
- Software development skills, ideally in Java (Spring Boot, Maven).
- Scripting expertise (Bash, Ansible) for automation and diagnostics.
- Ability to analyze and resolve complex incidents in enterprise environments.
- An SRE mindset: focus on reliability, prevention, resilience, and automation.
- Fluency in both Italian and English.
Nice-to-have skills
- Knowledge of authentication protocols (SAML, OAuth/OIDC).
- Experience with Jenkins, Docker, Kafka, ArgoCD.
- Familiarity with distributed and secure system architectures.
- Proficiency in German is a plus.
Benefits & conditions
- Full-time permanent contract with competitive Swiss salaries.
- MacBook Pro and all the equipment you need to work comfortably.
- Swiss public transportation pass.
- Annual training budget for conferences, certifications, and courses.
- Access to our technical library (we'll order the books you need).
- Hack Days, Tech Lunches, MeetUps, team-building activities - fully supported by WellD.
- A stimulating environment with enterprise-grade monitoring tools and room to grow your skills.
Note: This position is open to Swiss or EU/Schengen citizens with valid work/residency permits.
Important to know
- This position is open to Swiss citizens, Swiss residents, or Schengen EU citizens (cross-border eligible)
- The role is hybrid remote with office presence required in Lugano
- Italian and English are mandatory for this role
Compensation range
- Agno (Lugano): CHF65,000 - CHF75,000 gross annually*
- Other locations: Compensation will be discussed during the interview process
- Final compensation will be determined based on the candidate's qualifications, skills, and previous experience