Site Reliability Engineer (SRE) / Platform Engineer

Perfict Global, Inc.

Reston, United States of America

1 month ago

Role details

Contract type

Temporary to permanent

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Compensation

$ 133K

Job location

Remote

Reston, United States of America

Tech stack

Azure

Bash

Continuous Integration

Data as a Services

Github

Network Topologies

Identity and Access Management

Python

Key Management

Nginx

Octopus Deploy

Openshift

Performance Tuning

Role-Based Access Control

Reliability Engineering

Ansible

Prometheus

Datadog

Scripting (Bash/Python/Go/Ruby)

Istio

Delivery Pipeline

Grafana

Git Flow

Kubernetes

Hashicorp

Kafka

Terraform

Jenkins

Job description

Operate, tune, and optimize OpenShift/Kubernetes clusters (scheduling, ingress, upgrades, quotas, policies).
Stand up and/or refine observability (Datadog, Prometheus, Grafana)-dashboards, alerts, SLOs, runbooks.
Map current hybrid topology and critical delivery pipelines; identify toil and prioritize automation (Terraform/Ansible).
Begin supporting Azure environments (compute, networking, storage, data services) used by analytics teams.
Drive GitOps-first workflows; harden CI/CD with ArgoCD/Jenkins/GitHub Actions and policy-as-code guardrails.
Implement or enhance platform services (Vault, Kafka/AMQ, ingress, service mesh) for dev and data teams.
Lead incident response and postmortems; institutionalize RCA, blameless learning, and continuous improvement.
Advance the hybrid service model-migrations, integrations, reliability/latency tuning, cost and performance optimization.

Day-to-Day Responsibilities

Operate and optimize OpenShift/Kubernetes clusters, ingress (e.g., Nginx), and container networking/service mesh.
Manage Azure services (compute, VNet, storage, data services) supporting analytics workloads.
Build and maintain automated infrastructure with Terraform, Ansible, and GitOps workflows.
Implement and evolve observability (Datadog, Prometheus, Grafana): metrics, traces, logs, alerting, SLOs, runbooks.
Design, harden, and support delivery pipelines with ArgoCD/Jenkins/GitHub Actions.
Provide platform tooling and enablement for application developers, data engineers, and operations teams.
Ensure security and access management (HashiCorp Vault, secrets management, least privilege).
Lead incident response, coordinate cross-functional resolution, and drive corrective actions and platform improvements.
Script or develop tools in Bash, Python, or Go to eliminate toil and improve developer experience.

Tech You'll Work With

Kubernetes / OpenShift
Azure (compute, networking, storage, and data services)
Automation & IaC: Terraform, Ansible, GitOps
Observability: Datadog, Prometheus, Grafana
Networking & Ingress: Nginx, service meshes, container networking
Messaging: Kafka, AMQ
Secrets & Access: HashiCorp Vault
CI/CD: ArgoCD, Jenkins, GitHub Actions
Scripting/Coding: Bash, Python, Go

Requirements

5+ years hands-on operating and managing Kubernetes and OpenShift clusters.
Strong experience with Microsoft Azure (compute, networking, storage, and data services).
Proven skills in automation and Infrastructure-as-Code (Terraform, Ansible, GitOps).
Proficiency with observability tooling (Datadog, Prometheus, Grafana).
Scripting/coding ability in Bash, Python, or Go.

Preferred / Stand-Out Skills

Experience bridging on-prem and cloud in a hybrid service model (migration, integration, optimization).
Expertise with Kafka/AMQ, HashiCorp Vault, and ArgoCD/Jenkins/GitHub Actions.
Background leading incident response and postmortems with strong RCA and continuous improvement practices.

Work Model & Team

Hybrid: 2 days onsite in Reston, VA; 3 days remote.
You'll be part of the IT organization, collaborating daily with developers, data engineers, infrastructure operations, and security.

How to Succeed In This Role

You're a hands-on engineer who thrives in regulated, high-impact environments.
You favor automation over repetition, and observability over guesswork.
You collaborate openly, communicate clearly, and leave systems better than you found them.

Site Reliability Engineer (SRE) / Platform Engineer

Role details

Job location

Tech stack

Job description

Requirements

About the company

Apply for this position

Role details

Job location

Tech stack

Job description

Requirements

About the company

Apply for this position

Good distractions

Moments

Videos View all