java, Kafka and Grafana Product Software Arch and Eng Senior Staff
Role details
Job location
Tech stack
Job description
The Senior Staff Engineer for NPE Observability is the preeminent technical strategist for Equinix's global telemetry fabric. In this senior contract role, you will bridge the gap between high-scale distributed software and global network hardware, driving the architectural standards for our most complex data-intensive initiatives. You will own the technical integrity of our streaming pipelines, ensuring telemetry from the global fleet is ingested, normalized, and processed with sub-second latency. As a master of our tech stack (Java, Kafka, Postgres, Grafana), you will define the "Gold Standard" for technical excellence within the Network Platform Engineering (NPE) group.
Responsibilities
Architectural Strategy & Technical Vision
-
Core Stack Evolution: Architect and optimize our primary ingestion and storage engines utilizing Java and PostgreSQL, ensuring high availability and performance at scale.
-
Real-Time Data Orchestration: Lead the design of high-throughput messaging systems using Apache Kafka to handle trillions of telemetry points with sub-second latency.
-
Unified Visibility: Define the global standard for observability visualization in Grafana, building complex, high-performance dashboards that aggregate data from diverse telemetry sources.
High-Scale Engineering & Innovation
-
Stream Processing Mastery: Architect massively parallel processing pipelines and stateful stream processing frameworks (utilizing tools like Apache Flink) to enable real-time anomaly detection.
-
Advanced R&D: Evaluate and prototype emerging technologies such as Model-Driven Telemetry (MDT) and ClickHouse/Thanos for long-term metric storage and high-cardinality data analysis.
-
Technical Roadmap Ownership: Drive the engineering team toward key milestones, ensuring the code we ship aligns with the 3-5 year long-term NPE vision.
Reliability & Systemic Leadership
-
Service Standards: Define and monitor critical SLI/SLO metrics (e.g., P95 response times) to ensure the platform maintains world-class performance and global ITIL compliance.
-
Incident Authority: Serve as the senior point of contact for complex root-cause analysis, identifying architectural weaknesses in the Java/Kafka/Postgres stack to prevent future outages.
-
Stakeholder Synthesis: Translate complex product requirements into deep technical specifications, managing relationships with both internal software teams and external network vendors.
Requirements
-
Tenure: 10+ years of professional experience in software engineering and distributed systems.
-
Domain Expertise: 5+ years of experience specifically in large-scale network engineering, telemetry, or observability platforms.
-
Java Expert: Mastery of Java for building high-performance, scalable backend services