SRE (Site Reliability Engineer)
Role details
Job location
Tech stack
Job description
We are seeking a highly skilled Site Reliability Engineer to join our dynamic IT team. The successful candidate will be responsible for ensuring the stability, scalability, and performance of our cloud-based and on-premise systems. This role involves developing automation solutions, managing infrastructure, and supporting software deployment processes across diverse environments. The ideal applicant will possess a strong background in system administration, cloud computing, and software development, with a keen eye for troubleshooting and incident management. This is an excellent opportunity for professionals passionate about maintaining high-availability systems and driving continuous improvement in system reliability., * Design, implement, and maintain scalable and reliable infrastructure using tools such as Kubernetes, Terraform, Ansible, Puppet, Chef, and VMware.
- Monitor system performance with tools like New Relic, Splunk, Elasticsearch, and Nagios to proactively identify issues before they impact users.
- Automate deployment pipelines leveraging Jenkins, GitLab CI/CD, TFS, and other continuous integration tools to streamline software releases.
- Manage cloud environments including AWS, Azure, Google Cloud Platform (GCP), and OpenStack to optimise resource utilisation and cost-efficiency.
- Develop scripts using PowerShell, Bash (Unix shell), Python, Ruby, Perl, Groovy, or Go to automate routine tasks and improve operational efficiency.
- Troubleshoot complex issues related to web services such as REST APIs, web servers like NGINX or WebSphere, application servers including Weblogic or JBoss.
- Implement disaster recovery plans and perform incident response activities to minimise downtime during outages or security breaches.
- Collaborate with development teams on requirements gathering for new features or system upgrades following SDLC best practices.
- Maintain comprehensive documentation of system configurations and procedures aligned with ITIL standards for release management and change control.
Requirements
Do you have experience in Weblogic?, * Proven experience in a Site Reliability Engineering or DevOps role within a large-scale enterprise environment.
- Extensive knowledge of containerisation technologies such as Docker and Kubernetes.
- Hands-on experience with cloud platforms including AWS (Amazon S3, EC2), Azure (Virtual Machines), Google Cloud Platform (GCP), or OpenStack.
- Strong proficiency in scripting languages such as Python, PowerShell, Bash (Unix shell), Ruby on Rails or Groovy for automation tasks.
- Familiarity with configuration management tools like Ansible, Puppet, Chef; version control systems including GitHub or GitLab; and CI/CD pipelines using Jenkins or TFS.
- Experience managing distributed systems architecture involving microservices and APIs over TCP/IP networks.
- Knowledge of databases including MySQL, Microsoft SQL Server (T-SQL), Oracle DBMS; along with experience in SQL optimisation and disaster recovery planning.
- Understanding of computer networking concepts such as DNS, TCP/IP protocols, firewalls, LAN/WAN configurations.
- Ability to troubleshoot software issues across various platforms including Linux (CentOS/Ubuntu) and Windows Server environments. This role offers an engaging environment where technical expertise is valued and professional growth is encouraged through exposure to cutting-edge technology stacks and best practices in system reliability engineering.