Site Reliability Engineer (SRE) / Platform Engineer
Perfict Global, Inc.
Reston, United States of America
8 days ago
Role details
Contract type
Temporary to permanent Employment type
Full-time (> 32 hours) Working hours
Regular working hours Languages
English Experience level
Senior Compensation
$ 133KJob location
Remote
Reston, United States of America
Tech stack
Azure
Bash
Continuous Integration
Data as a Services
Github
Network Topologies
Identity and Access Management
Python
Key Management
Nginx
Octopus Deploy
Openshift
Performance Tuning
Role-Based Access Control
Reliability Engineering
Ansible
Prometheus
Datadog
Scripting (Bash/Python/Go/Ruby)
Istio
Delivery Pipeline
Grafana
Git Flow
Kubernetes
Hashicorp
Kafka
Terraform
Jenkins
Go
Job description
- Operate, tune, and optimize OpenShift/Kubernetes clusters (scheduling, ingress, upgrades, quotas, policies).
- Stand up and/or refine observability (Datadog, Prometheus, Grafana)-dashboards, alerts, SLOs, runbooks.
- Map current hybrid topology and critical delivery pipelines; identify toil and prioritize automation (Terraform/Ansible).
- Begin supporting Azure environments (compute, networking, storage, data services) used by analytics teams.
- Drive GitOps-first workflows; harden CI/CD with ArgoCD/Jenkins/GitHub Actions and policy-as-code guardrails.
- Implement or enhance platform services (Vault, Kafka/AMQ, ingress, service mesh) for dev and data teams.
- Lead incident response and postmortems; institutionalize RCA, blameless learning, and continuous improvement.
- Advance the hybrid service model-migrations, integrations, reliability/latency tuning, cost and performance optimization.
Day-to-Day Responsibilities
- Operate and optimize OpenShift/Kubernetes clusters, ingress (e.g., Nginx), and container networking/service mesh.
- Manage Azure services (compute, VNet, storage, data services) supporting analytics workloads.
- Build and maintain automated infrastructure with Terraform, Ansible, and GitOps workflows.
- Implement and evolve observability (Datadog, Prometheus, Grafana): metrics, traces, logs, alerting, SLOs, runbooks.
- Design, harden, and support delivery pipelines with ArgoCD/Jenkins/GitHub Actions.
- Provide platform tooling and enablement for application developers, data engineers, and operations teams.
- Ensure security and access management (HashiCorp Vault, secrets management, least privilege).
- Lead incident response, coordinate cross-functional resolution, and drive corrective actions and platform improvements.
- Script or develop tools in Bash, Python, or Go to eliminate toil and improve developer experience.
Tech You'll Work With
- Kubernetes / OpenShift
- Azure (compute, networking, storage, and data services)
- Automation & IaC: Terraform, Ansible, GitOps
- Observability: Datadog, Prometheus, Grafana
- Networking & Ingress: Nginx, service meshes, container networking
- Messaging: Kafka, AMQ
- Secrets & Access: HashiCorp Vault
- CI/CD: ArgoCD, Jenkins, GitHub Actions
- Scripting/Coding: Bash, Python, Go
Requirements
- 5+ years hands-on operating and managing Kubernetes and OpenShift clusters.
- Strong experience with Microsoft Azure (compute, networking, storage, and data services).
- Proven skills in automation and Infrastructure-as-Code (Terraform, Ansible, GitOps).
- Proficiency with observability tooling (Datadog, Prometheus, Grafana).
- Scripting/coding ability in Bash, Python, or Go.
Preferred / Stand-Out Skills
- Experience bridging on-prem and cloud in a hybrid service model (migration, integration, optimization).
- Expertise with Kafka/AMQ, HashiCorp Vault, and ArgoCD/Jenkins/GitHub Actions.
- Background leading incident response and postmortems with strong RCA and continuous improvement practices.
Work Model & Team
- Hybrid: 2 days onsite in Reston, VA; 3 days remote.
- You'll be part of the IT organization, collaborating daily with developers, data engineers, infrastructure operations, and security.
How to Succeed In This Role
- You're a hands-on engineer who thrives in regulated, high-impact environments.
- You favor automation over repetition, and observability over guesswork.
- You collaborate openly, communicate clearly, and leave systems better than you found them.
About the company
© 2026 Careerjet All rights reserved