Remote
Role details
Job location
Tech stack
Job description
We are seeking a Staff Site Reliability Engineer (SRE) to lead our global platform reliability and drive our next-generation observability strategy on Google Cloud Platform (GCP). In this role, you will leverage Grafana Labs' complete telemetry stack and AIOps methodologies to build intelligent, self-healing infrastructure. You will bring deep expertise in scaling enterprise-grade Google Kubernetes Engine (GKE) topologies, managing high-throughput Kafka event streams, and maintaining high-performance PostgreSQL, AlloyDB, and BigQuery ecosystems at massive scale. Crucially, you will provide deep technical leadership across the entire networking stack, diagnosing complex issues from physical-layer transport up to application-layer protocols., * Full-Stack Network Architecture: Architect, optimize, and troubleshoot complex networking infrastructure spanning Layer 1 through Layer 7, ensuring low-latency data transport, secure edge routing, and seamless service mesh integration.
-
Grafana Stack Architecture: Design, scale, and optimize our unified observability platform using the Grafana Labs suite (Grafana, Mimir, Loki, Tempo, and Beyla).
-
AIOps & Intelligent Alerting: Deploy machine learning models and automated anomaly detection to cut through telemetry noise, reduce alert fatigue, and predict network or data pipeline bottlenecks.
-
GKE Platform Engineering: Drive the architecture, scaling, security, and networking of production Google Kubernetes Engine (GKE) clusters.
-
Data & Event Streaming Reliability: Tune, and maintain high-throughput Apache Kafka clusters to guarantee low-latency event delivery and high availability.
-
Large-Scale Database Management: Ensure the performance, scalability, and disaster recovery readiness of our transactional and analytical data tiers across PostgreSQL, AlloyDB, and BigQuery.
-
Automated Incident Response: Integrate AIOps insights with Grafana workflows to automate triage, accelerate root-cause analysis, and trigger auto-remediation scripts.
-
Technical Leadership: Champion the long-term technical roadmap for distributed infrastructure engineering and GCP cloud-native observability standards.
-
Mentorship: Coach senior and junior engineers on advanced debugging techniques, distributed systems thinking, and intelligent operations across a distributed workforce.
Requirements
-
Location/Work Style: Proven track record of high autonomy and successful delivery in a 100% remote engineering environment.
-
Experience: 8+ years in SRE, Production Engineering, or Distributed Systems infrastructure roles.
-
Networking Expertise (L1-L7): Deep technical knowledge and debugging mastery across all OSI layers, including:
-
L1-L3: Physical/fiber infrastructure awareness, switching, and advanced routing protocols (BGP, OSPF).
-
L4: Transport layer tuning (TCP congestion control algorithms, UDP, QUIC).
-
L5-L7: Session management, TLS termination, DNS architecture, and advanced application protocols (HTTP/3, gRPC).
-
Orchestration & Containerization: Expert-level mastery of Google Kubernetes Engine (GKE) internals, custom controllers, multi-cluster networking, and GitOps workflows.
-
Data Infrastructure: Proven track record managing high-throughput Apache Kafka pipelines and large-scale data environments across PostgreSQL, AlloyDB, and BigQuery.
-
Grafana Ecosystem: Deep, hands-on experience deploying and managing Grafana Enterprise/Cloud, Prometheus/Mimir, Loki, and Tempo at scale.
-
AIOps Implementation: Track record applying AI/ML techniques for time-series anomaly detection, log clustering, and correlation (e.g., Grafana Adaptive Metrics, BigPanda).
-
Infrastructure as Code: Advanced, production-scale expertise utilizing HashiCorp Terraform exclusively to provision and manage multi-region GCP cloud architectures.
-
Programming: High proficiency in Go and Python for building custom infrastructure tooling, Kubernetes operators, and data integration scripts.
Preferred Attributes
-
Remote Communicator: Exceptional written and verbal communication skills, with an emphasis on creating clear documentation for asynchronous alignment.
-
GCP Expert: Deep knowledge of Google Cloud architectural best practices, Cloud SDN, Cloud Armor, Interconnect, Identity and Access Management (IAM), and cost optimization.
-
Systems Thinker: Deep understanding of Linux internals, eBPF-based monitoring, kernel-level networking, and packet analysis tools (Wireshark, tcpdump).
Benefits & conditions
This position is 100% fully remote. You can work from anywhere in the United States or Canada with a reliable internet connection, collaborating with a distributed engineering organization across multiple time zones.