Site Reliability Engineer

Zoho Corporation
yesterday

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English

Job location

Remote

Tech stack

Amazon Web Services (AWS)
Application Performance Management
Automation of Tests
Azure
Bash
Cloud Engineering
Configuration Management
System Configuration
Continuous Delivery
Continuous Integration
Information Engineering
Linux
File Systems
Distributed Systems
DNS
Memory Management
Elasticsearch
Perl
Monitoring of Systems
Hypertext Transfer Protocols (HTTP)
Identity and Access Management
Python
Kernel-Based Virtual Machine
Load Testing
NoSQL
Reliability Engineering
Ansible
Prometheus
Ruby
Zero Trust Network Access
Server Administration
SQL Databases
TCP/IP
Virtualization Technology
Workflow Management Systems
Datadog
CircleCI
Scripting (Bash/Python/Go/Ruby)
Google Cloud Platform
Load Balancing
System Availability
Saltstack
Grafana
Cloudformation
Containerization
Gitlab-ci
Kubernetes
Infrastructure Automation Frameworks
Deployment Automation
Cassandra
Kafka
Terraform
Splunk
Dynatrace
Docker
ELK
Jenkins
VMware

Job description

  • Design and implement platform on the cloud to support OXIO backend services
  • Automate technical operations: deployments, scaling, recovery, etc.
  • Monitor and maintain mission-critical production infrastructure to ensure maximum uptime
  • Participate in an on-call rotation and culture of continuous improvement through blameless postmortems
  • Enable the Engineering/Telecom/Data Engineering teams by providing them the tools to operate the service they build

Requirements

Do you have experience in Server management automation?, * Understanding of Linux/Unix systems (most systems are Linux-based).

  • Familiarity with Linux/Unix system internals like process management, filesystems, memory management, and networking.
  • Proficiency in at least one programming language (Python, Go, or Ruby) and strong skills in scripting (Bash, Perl).
  • Experience with infrastructure provisioning tools such as Terraform, CloudFormation, or Ansible.
  • Familiarity with containerization (Docker) and orchestration tools (Kubernetes).
  • Familiarity with monitoring tools like Prometheus, Grafana, or Datadog.
  • Knowledge of setting up alerts, analyzing logs, and creating dashboards for observability.
  • Familiarity with incident management practices (e.g., runbooks, postmortems).
  • Experience in being part of an on-call rotation and handling incidents.
  • Experience in setting up and maintaining Continuous Integration/Continuous Delivery pipelines (Jenkins, GitLab CI, CircleCI, etc.).
  • Hands-on experience with cloud providers (AWS, Google Cloud, Azure).
  • Knowledge of virtualization technologies (VMware, KVM) and cloud-native architecture.
  • Understanding of TCP/IP, DNS, HTTP/HTTPS, load balancing, and firewalls.

Nice to have

  • Strong understanding of deployment strategies (canary releases, blue-green deployments, etc.).

  • Familiarity with high availability and understanding failover mechanisms.

  • Familiarity with IAM (Identity and Access Management) and zero trust principles.

  • Experience working with distributed systems (e.g., Kafka, Cassandra, Elasticsearch).

  • Building custom monitoring tools or writing complex automation scripts.

  • Functional knowledge of database management (SQL and NoSQL).

  • Familiarity with distributed tracing (Jaeger, OpenTelemetry) and advanced log aggregation strategies (ELK stack, Splunk).

  • Familiarity with performance profiling tools and optimizing application performance under heavy load.

  • Familiarity in load testing and identifying bottlenecks.

  • Familiarity with Configuration Managment using SaltStack for maintaining server configurations.

Apply for this position