Site Reliability Engineer

Zoho Corporation

yesterday

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Job location

Remote

Tech stack

Amazon Web Services (AWS)

Application Performance Management

Automation of Tests

Azure

Bash

Cloud Engineering

Configuration Management

System Configuration

Continuous Delivery

Continuous Integration

Information Engineering

Linux

File Systems

Distributed Systems

DNS

Memory Management

Elasticsearch

Perl

Monitoring of Systems

Hypertext Transfer Protocols (HTTP)

Identity and Access Management

Python

Kernel-Based Virtual Machine

Load Testing

NoSQL

Reliability Engineering

Ansible

Prometheus

Ruby

Zero Trust Network Access

Server Administration

SQL Databases

TCP/IP

Virtualization Technology

Workflow Management Systems

Datadog

CircleCI

Scripting (Bash/Python/Go/Ruby)

Google Cloud Platform

Load Balancing

System Availability

Saltstack

Grafana

Cloudformation

Containerization

Gitlab-ci

Kubernetes

Infrastructure Automation Frameworks

Deployment Automation

Cassandra

Kafka

Terraform

Splunk

Dynatrace

Docker

ELK

Jenkins

VMware

Job description

Design and implement platform on the cloud to support OXIO backend services
Automate technical operations: deployments, scaling, recovery, etc.
Monitor and maintain mission-critical production infrastructure to ensure maximum uptime
Participate in an on-call rotation and culture of continuous improvement through blameless postmortems
Enable the Engineering/Telecom/Data Engineering teams by providing them the tools to operate the service they build

Requirements

Do you have experience in Server management automation?, * Understanding of Linux/Unix systems (most systems are Linux-based).

Familiarity with Linux/Unix system internals like process management, filesystems, memory management, and networking.
Proficiency in at least one programming language (Python, Go, or Ruby) and strong skills in scripting (Bash, Perl).
Experience with infrastructure provisioning tools such as Terraform, CloudFormation, or Ansible.
Familiarity with containerization (Docker) and orchestration tools (Kubernetes).
Familiarity with monitoring tools like Prometheus, Grafana, or Datadog.
Knowledge of setting up alerts, analyzing logs, and creating dashboards for observability.
Familiarity with incident management practices (e.g., runbooks, postmortems).
Experience in being part of an on-call rotation and handling incidents.
Experience in setting up and maintaining Continuous Integration/Continuous Delivery pipelines (Jenkins, GitLab CI, CircleCI, etc.).
Hands-on experience with cloud providers (AWS, Google Cloud, Azure).
Knowledge of virtualization technologies (VMware, KVM) and cloud-native architecture.
Understanding of TCP/IP, DNS, HTTP/HTTPS, load balancing, and firewalls.

Nice to have

Strong understanding of deployment strategies (canary releases, blue-green deployments, etc.).
Familiarity with high availability and understanding failover mechanisms.
Familiarity with IAM (Identity and Access Management) and zero trust principles.
Experience working with distributed systems (e.g., Kafka, Cassandra, Elasticsearch).
Building custom monitoring tools or writing complex automation scripts.
Functional knowledge of database management (SQL and NoSQL).
Familiarity with distributed tracing (Jaeger, OpenTelemetry) and advanced log aggregation strategies (ELK stack, Splunk).
Familiarity with performance profiling tools and optimizing application performance under heavy load.
Familiarity in load testing and identifying bottlenecks.
Familiarity with Configuration Managment using SaltStack for maintaining server configurations.

Role details

Job location

Tech stack

Job description

Requirements

Apply for this position

Good distractions

Moments

Videos View all