Senior IT Infrastructure Resilience Engineer
Role details
Job location
Tech stack
Job description
Are you looking to joinas IT Infrastructure Resilience Engineer? We are currently seeking a hands-on IT Infrastructure Resilience Engineer to strengthen the resilience, recoverability, and operational stability of our core infrastructure and critical banking services., You will work across Windows and Linux, physical and virtual infrastructure, network & security fundamentals, backup/restore and disaster recovery, monitoring/observability, and automation. You will collaborate with application teams, Security, , business lines, and managed service partners to design, build, automate, and continuously improve technical and operational controls that keep services available, recoverable, observable, and secure , while ensuring evidence and documentation meet DORA / DNB / EBA expectations.
You will drive Business Continuity and DR by coordinating recovery planning and exercises, aligning RTO/RPO and runbooks with business requirements, and steering vendors to deliver high-quality outcomes against SLAs/KPIs in a regulated environment., Network Resilience & Security Fundamentals
- Support and improve network resilience: LAN/WAN basics, segmentation, firewalls, VPN, load balancers, redundancy concepts.( e.g Cisco switches , F5 , Palo Alto , Fortinet)
- Apply security best practices: hardening, patching discipline, logging/monitoring integration, and audit evidence mindset.
Resilience Engineering
- Design and improve resilient infrastructure patterns (redundancy, failover, clustering, active-active/active-passive, elimination of SPOFs).
- Participate in major incidents, lead structured troubleshooting, perform RCA, and drive permanent improvements.
Virtualization & Compute Platform Resilience
- Operate and improve virtualization platforms and HA concepts (e.g., VMware vSphere/ESXi/vCenter; Hyper-V;).
- Manage clusters, templates, migration (vMotion/Live Migration), snapshot policies, patching/upgrades, and performance troubleshooting.
- Improve recovery procedures for critical workloads on virtual platforms.
Backup, Restore & Cyber Recovery
- Continuously improve backup/restore capability across infrastructure and databases (e.g Commvault, Azure; MS SQL, PostgreSQL, MySQL): policies, job health/monitoring, retention, encryption and immutability and operational procedures.
- Ensure restore validation beyond "backup successful" by performing periodic recovery tests for systems and databases with documented results and audit-ready evidence.
- Partner with Corporate Security Team on cyber recovery: ransomware-oriented restore validation, secure recovery procedures for both infrastructure and database platforms, and hardening of backup access and recovery workflows.
DR & Business Continuity
- Own end-to-end DR & BC readiness for critical services: define recovery steps, validate dependencies, maintain runbooks/playbooks and operational procedures, and keep documentation current and usable during real incidents.
- Drive governance and impact assessment: coordinate BC committees/forums, lead/support BIAs with business/service owners, validate RTO/RPO targets, and translate outcomes into recovery priorities, controls, and tracked actions.
- Plan and execute DR tests and crisis simulations (technical/operational and tabletop): coordinate stakeholders, scenarios and communications, and ensure results, gaps, evidence, and remediation are traceable, measurable, and audit-ready.
Banking Application Infrastructure
- Support onboarding and lifecycle of critical banking applications (SWIFT, screening, payments) from an infrastructure perspective.
- Ensure non-functional requirements: availability, recoverability, monitoring, patching readiness, secure access patterns, and change readiness.
Physical Server & Datacenter Technologies
- Hands-on support for physical server lifecycle: provisioning, firmware/BIOS baselines, RAID/storage configuration, diagnostics, break/fix coordination, and decommissioning.
- Work with enterprise server/desktops platforms (e.g., HPE ProLiant) and management interfaces.
- Improve datacenter resilience patterns: dual power, redundant networking, out-of-band management, and hardware health monitoring.
Observability, Reporting & Automation
- Build and maintain actionable monitoring/alerting/observability (availability, performance, capacity, correlation).( e.g Dynatrace)
- Automate operational controls using PowerShell / Python / Bash (health checks, backup verification, config drift checks, reporting).
- Standardize processes and reduce key-person dependency via repeatable practices (e.g GoAnywhere).
Requirements
Do you have experience in vSphere?, A colleague open-minded, very curious by nature and passionate about your job. You are not afraid to handle various tasks at the same time and meet tight deadlines. You are thinking proactively and always a step ahead, finding solutions with both internal and external stakeholders. You are a cooperating person who listens and invests in the work & persons to achieve common goals. You must naturally think out of the box and navigate in a fast changing and complex environment by questioning the how & the why., * Min. 10+ years in Infrastructure Engineering / Systems Engineering / Platform Engineering / SRE , ideally in enterprise or regulated environments.
- Calm and focused under pressure.
- Comfortable working in a distributed team.
- Self-driven and team-oriented.
- Excellent communication and collaboration skills.
- Relevant certification ( CCNA/P , Microsoft Certified Expert (AI-102, Azure Solutions Architect (AZ-305) , Microsoft Azure Fundamentals (AZ-900), Microsoft Azure Networking (AZ-700), Microsoft Azure Administrator (AZ-104))
Nice to Have
- OpenShift/Kubernetes platform operations experience
- Citrix Virtual Apps & Desktops and/or RDS experience
- Azure Experience
- Basic understanding of M365 Suite / Exchange (infrastructure-side support)
- Familiarity with DORA-aligned resilience expectations and audit readines
Benefits & conditions
- Becoming part of a dynamic team in an international working environment.
- 30 vacation days.
- 13th Month.
- 8% holiday payment
- Laptop and Mobile phone.
- Annual extra appreciation payment.
- Pension Plan - Defined contribution scheme
- Collective Health Insurance - discount on additional health insurance.
- Educational budget and access to Coursera trainings