Site Reliability Engineer

Alianza, Inc.

Charing Cross, United Kingdom

27 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Job location

Charing Cross, United Kingdom

Tech stack

Java

JavaScript

Amazon Web Services (AWS)

Software Applications

Azure

C++

Cloud Computing

Disaster Recovery

Distributed Data Store

Distributed Systems

Python

Performance Tuning

Reliability Engineering

Ruby

Software Systems

Cloud Platform System

System Availability

Reliability of Systems

Kubernetes

Kafka

Operational Systems

Job description

This position can be Hybrid or remote in the UK.

You must currently have the right to work in the UK without requiring sponsorship, either now or in the future.

A Site Reliability Engineer (SRE) is responsible for ensuring the reliability, performance, and scalability for Alianza's Cloud Platform systems and infrastructure.

Key Objectives include:

Run the production environment by monitoring availability and taking a holistic view of system health.
Improve reliability, quality, and time-to-market of software solutions.
Balance feature development speed and reliability with well-defined service-level objectives.

Key Responsibilities:

Monitoring and Maintenance:

Continuously monitor system health and performance, ensuring high availability and reliability of applications.
Detect and automatically handle failures, preparing disaster recovery plans.

Automation and Improvement:
Build and maintain software and systems to manage platform infrastructure and applications.
Implement automation to reduce manual intervention and improve system efficiency.
Performance Optimization:

Measure and optimize system performance, pushing capabilities forward and innovating for continual improvement.
Gather and analyze metrics from operating systems and applications to assist in performance tuning and fault finding.

Collaboration and Consulting:

Partner with development teams to improve services through rigorous testing and release procedures.
Participate in system design consulting, platform management, and capacity planning.

Incident Management:

Provide primary operational support and engineering for multiple large-scale distributed software applications.
Participate in on-call rotations to respond to incidents and ensure system reliability.

Requirements

Do you have experience in Ruby?, * Attention to Detail: The ability to perform tasks with thoroughness and accuracy, ensuring all aspects of the system are meticulously managed. Problem-Solving Skills: The capability to analyze complex issues, identify root causes, and develop effective solutions to ensure system reliability and performance.

Technical Expertise: Proficiency in understanding and applying technical knowledge related to infrastructure, code, and tools, which can be enhanced through continuous learning and experience.
Automation Skills: The ability to design and implement automation processes to reduce manual intervention and improve system efficiency.
Communication Skills: The ability to clearly convey ideas, strategies, and updates to various stakeholders, ensuring alignment and transparency across the organization., An inherent tendency to be precise and conscientious, ensuring high standards are maintained in all aspects of work.
Resilience: The innate ability to remain calm and composed under pressure, effectively managing stressful situations and leading the team through challenges.
Curiosity: A natural inclination to explore and learn new technologies and methodologies, driving innovation and continuous improvement.
Empathy:

An inherent quality of understanding and valuing the perspectives and needs of team members and stakeholders, fostering a supportive and inclusive environment.

Adaptability: The ability to naturally adjust to changing circumstances and environments, ensuring effective responses to new challenges and opportunities.

Desired Skills/Qualifications

Technical Proficiency:

Understanding of high-level languages such as Python, Java, C/C++, Ruby, and JavaScript.
Experience with distributed storage technologies and dynamic resource management frameworks.
Experience of Telco technology and Metaswitch software as a bonus.

Problem-Solving Skills:

Strong analytical skills to diagnose and resolve complex technical issues.

Communication Skills:

Excellent communication skills to collaborate effectively with cross-functional teams and convey technical concepts.

Experience with Cloud Platforms:
Hands-on experience with cloud platforms like AWS, GCP, or Azure. Understanding cloud-native applications and services is vital for modern SRE roles.
Knowledge of Networking and Distributed Systems:

Strong understanding of networking fundamentals and experience with distributed systems such as Kafka, Kubernetes, and other stream-processing technologies. This helps in managing large-scale, complex systems.