Site Reliability Engineer
Role details
Job location
Tech stack
Job description
Collaborate with software engineering and operations teams to design, build, and maintain cloud-based infrastructure using AWS and Terraform.
- Implement and enhance infrastructure-as-code (IaC) practices using Terraform to ensure reproducibility and scalability of infrastructure components.
- Monitoring and Incident Management:
-
Develop and maintain monitoring solutions to proactively identify performance bottlenecks, system outages, and other potential issues.
-
Participate in incident response and root cause analysis efforts to drive continuous improvement and prevent future incidents.
- Reliability and Performance Optimization:
-
Optimise system performance, reliability, and cost efficiency through continuous monitoring, performance tuning, and capacity planning.
-
Identify opportunities to automate manual processes and improve system resilience.
- Scripting and Automation:
-
Utilise Python or Bash scripting to create and maintain automation tools for various operational tasks and deployments.
-
Implement and improve continuous integration and continuous deployment (CI/CD) pipelines.
- Security and Compliance:
-
Collaborate with security teams to implement best practices for securing cloud infrastructure and services.
-
Ensure compliance with relevant industry standards and regulations.
- Deployment and Release Management:
-
Support CI/CD pipelines for application deployments and updates.
-
Contribute to the design and implementation of deployment strategies that promote zero-downtime releases.
- Documentation and Knowledge Sharing:
-
Maintain clear and up-to-date documentation for infrastructure configurations, processes, and incident resolution procedures.
-
Participate in knowledge sharing with team members to enhance overall expertise and skill sets.
Requirements
We are seeking a talented and experienced Site Reliability Engineer (SRE) to join our team. As an SRE, you will play a crucial role in ensuring the reliability, scalability, and performance of our cloud-based infrastructure and services, primarily hosted on AWS. If you have a passion for problem-solving, a deep understanding of AWS services, hands-on experience with Terraform, and proficiency in scripting with Python or Bash, we invite you to apply for this exciting opportunity., Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent practical experience).
- Proven experience as a Site Reliability Engineer or similar role.
- Technical Skills:
-
Extensive experience with Amazon Web Services (AWS) and its core services (EC2, S3, RDS, IAM, etc.).
-
Strong proficiency in infrastructure-as-code (IaC) tools, with a focus on Terraform.
-
Proficient in scripting with Python or Bash for automation and operational tasks.
-
Solid understanding of networking principles and protocols.
-
Knowledge of CI/CD pipelines and related tools.
- Problem-Solving and Analytical Abilities:
-
Ability to diagnose and resolve complex technical issues in a fast-paced environment.
-
Analytical mindset to proactively identify potential system weaknesses and performance bottlenecks.
- Collaboration and Communication:
-
Strong teamwork and collaboration skills to work effectively with cross-functional teams.
-
Excellent verbal and written communication skills.
Benefits & conditions
This is an equity-only position, offering a unique opportunity to gain a stake in a rapidly growing company and contribute directly to its success.