Principal, Cloud Engineer - Observability in Hudson
Role details
Job location
Tech stack
Job description
You will work in a collaborative, transparent, and innovation-driven environment where engineering excellence, continuous learning, and open-source contribution are core to how we operate. This is a high-impact role where your expertise will influence platform architecture, engineering practices, and the developer experience across the organization.
What You'll Do
- Lead the design and implementation of cloud- observability platforms across AWS and Azure environments
- Develop and operate highly scalable systems on AWS (EKS, core services) with strong focus on reliability, performance, and automation
- Own the end-to-end lifecycle of observability tooling, including hosting, maintenance, scaling, and optimization
- Drive adoption of OpenTelemetry (OTel) standards for metrics, traces, logs, and profiling
- Build and enhance platform capabilities using Python and/or Go
- Architect and optimize CI/CD pipelines enabling rapid, secure, and reliable deployments
- Collaborate with cross-functional teams to improve system visibility, debugging capabilities, and performance insights
- Define and promote best practices for cloud engineering, observability, and platform reliability
- Mentor engineers and provide technical leadership across squads
Requirements
We are seeking a highly motivated Principal Cloud Engineer to join our Observability Platform team within Fidelity Architecture and Engineering. In this role, you will help design, build, and operate scalable, cloud- observability solutions that support our most critical digital services., * 10+ years of software engineering or cloud engineering experience
- Deep expertise in AWS cloud stack, especially:
- EKS (Kubernetes on AWS)
- Core services (IAM, EC2, networking, storage, etc.)
- Strong experience working with Kubernetes and cloud- ecosystems
- Hands-on experience with observability tools/platforms (e.g., Prometheus, Grafana, Datadog, OpenTelemetry, etc.), including hosting and operational ownership
- Proficiency in Python and/or Go for platform and tooling development
- Strong understanding of CI/CD practices and tools
- Experience working with Infrastructure as Code (Terraform, CloudFormation, etc.)
- Familiarity with the Azure cloud stack and hybrid/multi-cloud environments
- Working knowledge of OpenTelemetry (OTel) concepts and implementation
Bonus Skills
- Experience building or operating large-scale internal platforms
- Exposure to eBPF-based observability or advanced profiling solutions
- Experience integrating observability across multi-region / multi-cloud environments
- Experience or interest in applying AI/ML techniques to observability (e.g., anomaly detection, predictive insights, intelligent alerting, or AIOps)
- Active participation in or contributions to open-source projects
- Strong background in performance optimization and distributed systems