Site Reliability Engineer
Role details
Job location
Tech stack
Job description
-
Maximize system uptime and availability, ensuring functional and performance SLAs.
-
Establish end-to-end monitoring and alerting on all critical aspects.
-
Solve complex problems for critical services and build automation to prevent problem recurrence.
-
Influence and create new designs, architectures, standards, and methods for supporting the platform.
-
Initiate and lead scripting and automation to streamline system updates and upgrades.
-
Set up critical infrastructure, tools, and framework to streamline the deployment cycle.
Requirements
-
Demonstrated experience in deploying, managing, and operating scalable and fault-tolerant Linux/Kubernetes/JVM-based infrastructure in AWS, GCP, and other public clouds.
-
Expertise in Linux Operating Systems, Networking, and Database concepts.
-
Experience with Cassandra (or another NoSQL alternative).
-
Expertise in cloud providers, such as Amazon Web Services, Azure, and GCP.
-
Experience with configuration management systems such as Ansible or Puppet.
-
Experience in Ruby or Python; to automate and monitor systems.
-
Excellent problem-solving, critical thinking, and communication skills.
-
Experience supporting as a DevOps or sys admin for commercial SaaS solutions.
-
BS or MS in Computer Science, related field, or equivalent professional experience.