Cloud Site Reliability Engineer (SRE)

Insight Global
Alpharetta, United States of America
yesterday

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Compensation
$ 166K

Job location

Alpharetta, United States of America

Tech stack

Amazon Web Services (AWS)
Computing Platforms
Azure
Bash
Cloud Computing
Cloud Engineering
DevOps
Distributed Computing Environment
Fault Tolerance
Identity and Access Management
Python
Powershell
Reliability Engineering
Ansible
Software Engineering
Scripting (Bash/Python/Go/Ruby)
Google Cloud Platform
Load Balancing
Cloud Platform System
Reliability of Systems
Infrastructure as Code (IaC)
Low Latency
Terraform
Splunk
Dynatrace
Serverless Computing
Docker

Job description

This role is primarily an operations incident response role for cloud issues in AWS, Azure and GPC and includes cloud infrastructure management. This role will troubleshoot and performance analysis from a cloud perspective with the goal of reducing time to resolution. This role is focused on responding as well as strategizing and designing a solution to prevent incidents from happening in the future in the Cloud environment. They will collaborate with the NOC, Network engineering teams, platform teams and application support teams in addition to working with the cloud provider. Our goal is to modernize and stabilize our infrastructure. As we get pulled into incidents and issues, we want to resolve the issues quickly then address solving this and preventing.

We are seeking a Cloud Site Reliability Engineer (SRE) to drive the reliability, scalability, and performance of our cloud-based infrastructure. The ideal candidate combines software engineering expertise with advanced systems operations skills to maintain highly available systems while reducing operational toil. This role involves automation, monitoring, capacity planning, incident response, and cloud platform management across a dynamic, distributed environment. As a Cloud SRE, you will work closely with Engineering, Architecture, DevOps, and security teams to ensure seamless service experiences for our customers while contributing to platform design and operational efficiency.

Contract/Contract-to-Hire Roles

Requirements

experience Azure, AWS, or GCP (experience in 2 of the 3 cloud platforms) (# of years doesn't matter, needs to be a person who can think through complex issues) Experience with Splunk terraform experience for Infrastructure as Code (IaC) for Cloud Infrastructure Management: Deploy, manage, and optimize cloud resources Python, PowerShell, Bash, or equivalent for automation and system management. (one scripting language is fine) VPCs, IAM, serverless architectures very collaborative team - work with other teams - platforms, engineering, networking. How do we resolve this in other areas. Scaling, sizability, performance. very strong with infrastructure and system analyst in the cloud problem solver - ability to get to bottom of issues to quickly remediate problems and then also think about how to fix it for good/for the future. Self-starter Ability to work in a high pressure environment. automation, monitoring, capacity planning, incident response, and cloud platform management across a dynamic, distributed environment. System Reliability & Availability: Design and maintain fault-tolerant, high-availability architectures across AWS, Azure, and GCP. Implement redundancy, load balancing, and automated failover strategies. On call rotation every 5-6 weeks Must be able to be onsite 5 days a week

Our Engineers play a critical role in the success of our clients and are expected to effectively communicate our recommended solutions in a consultative role for each client. Therefore, a successful candidate will possess a high degree of self-management, personal accountability, strong communication skills, and teamwork. The ability to interact, engineer, and communicate collaboratively at the highest technical levels with customers, vendors, partners, and all members of staff is required.

Nice to Have Skills & Experience

Moogsoft to automate (like disk latency and CPU Utilization) Dynatrace - reading logs Containers & Orchestration: Experience with Docker and Kubernetes. Cloud FinOps and utilization experience Ansible playbooks (strong plus)

Benefits & conditions

$75/hr to $80/hr. Exact compensation may vary based on several factors, including skills, experience, and education. Employees in this role will enjoy a comprehensive benefits package starting on day one of employment, including options for medical, dental, and vision insurance. Eligibility to enroll in the 401(k) retirement plan begins after 90 days of employment. Additionally, employees in this role will have access to paid sick leave and other paid time off benefits as required under the applicable law of the worksite location., Benefit packages for this role will start on the 1st day of employment and include medical, dental, and vision insurance, as well as HSA, FSA, and DCFSA account options, and 401k retirement account access with employer matching. Employees in this role are also entitled to paid sick leave and/or other paid time off as provided by applicable law.

Apply for this position