Site Reliability Engineer

Alianza, Inc.
Charing Cross, United Kingdom
27 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English

Job location

Charing Cross, United Kingdom

Tech stack

Java
JavaScript
Amazon Web Services (AWS)
Software Applications
Azure
C++
Cloud Computing
Disaster Recovery
Distributed Data Store
Distributed Systems
Python
Performance Tuning
Reliability Engineering
Ruby
Software Systems
Cloud Platform System
System Availability
Reliability of Systems
Kubernetes
Kafka
Operational Systems

Job description

This position can be Hybrid or remote in the UK.

You must currently have the right to work in the UK without requiring sponsorship, either now or in the future.

A Site Reliability Engineer (SRE) is responsible for ensuring the reliability, performance, and scalability for Alianza's Cloud Platform systems and infrastructure.

Key Objectives include:

  • Run the production environment by monitoring availability and taking a holistic view of system health.
  • Improve reliability, quality, and time-to-market of software solutions.
  • Balance feature development speed and reliability with well-defined service-level objectives.

Key Responsibilities:

  • Monitoring and Maintenance:
  • Continuously monitor system health and performance, ensuring high availability and reliability of applications.
  • Detect and automatically handle failures, preparing disaster recovery plans.
  • Automation and Improvement:

  • Build and maintain software and systems to manage platform infrastructure and applications.

  • Implement automation to reduce manual intervention and improve system efficiency.

  • Performance Optimization:

  • Measure and optimize system performance, pushing capabilities forward and innovating for continual improvement.
  • Gather and analyze metrics from operating systems and applications to assist in performance tuning and fault finding.
  • Collaboration and Consulting:
  • Partner with development teams to improve services through rigorous testing and release procedures.
  • Participate in system design consulting, platform management, and capacity planning.
  • Incident Management:
  • Provide primary operational support and engineering for multiple large-scale distributed software applications.
  • Participate in on-call rotations to respond to incidents and ensure system reliability.

Requirements

Do you have experience in Ruby?, * Attention to Detail: The ability to perform tasks with thoroughness and accuracy, ensuring all aspects of the system are meticulously managed. Problem-Solving Skills: The capability to analyze complex issues, identify root causes, and develop effective solutions to ensure system reliability and performance.

  • Technical Expertise: Proficiency in understanding and applying technical knowledge related to infrastructure, code, and tools, which can be enhanced through continuous learning and experience.

  • Automation Skills: The ability to design and implement automation processes to reduce manual intervention and improve system efficiency.

  • Communication Skills: The ability to clearly convey ideas, strategies, and updates to various stakeholders, ensuring alignment and transparency across the organization., An inherent tendency to be precise and conscientious, ensuring high standards are maintained in all aspects of work.

  • Resilience: The innate ability to remain calm and composed under pressure, effectively managing stressful situations and leading the team through challenges.

  • Curiosity: A natural inclination to explore and learn new technologies and methodologies, driving innovation and continuous improvement.

  • Empathy:

An inherent quality of understanding and valuing the perspectives and needs of team members and stakeholders, fostering a supportive and inclusive environment.

  • Adaptability: The ability to naturally adjust to changing circumstances and environments, ensuring effective responses to new challenges and opportunities.

Desired Skills/Qualifications

  • Technical Proficiency:
  • Understanding of high-level languages such as Python, Java, C/C++, Ruby, and JavaScript.
  • Experience with distributed storage technologies and dynamic resource management frameworks.
  • Experience of Telco technology and Metaswitch software as a bonus.
  • Problem-Solving Skills:
  • Strong analytical skills to diagnose and resolve complex technical issues.
  • Communication Skills:
  • Excellent communication skills to collaborate effectively with cross-functional teams and convey technical concepts.
  • Experience with Cloud Platforms:
  • Hands-on experience with cloud platforms like AWS, GCP, or Azure. Understanding cloud-native applications and services is vital for modern SRE roles.
  • Knowledge of Networking and Distributed Systems:
  • Strong understanding of networking fundamentals and experience with distributed systems such as Kafka, Kubernetes, and other stream-processing technologies. This helps in managing large-scale, complex systems.

Apply for this position