Site Reliability Engineer, iCloud
Role details
Job location
Tech stack
Job description
Apple Services' scale is BIG. Operating at our scale, across multiple geographies and servicing hundreds of millions of users presents unique challenges. As a Software Developer in SRE at Apple, you'll need to solve these problems using data, teamwork, and your own expertise. ASE Products Site Reliability teams are responsible for the reliability and performance of the server software stack that powers products like iCloud Photos, Mail, Drive, Backup and many more. We do that by focusing on reliability best practices from service inception to production, collaborating deeply with product development teams to deliver a superlative product and shared vision while leveraging data and automation as first principles. We run a mix of open source, vendor licensed, and internally developed tools to manage the end to end SDLC of our products. You'll learn these tools and have opportunities to improve them., * Egage with our product teams to understand requirements, design and implement resilient and scalable infrastructure solutions.
- Operate, monitor, and triage all aspects of our production and non-production environments.
- Collaborate on code, infrastructure, design reviews, and process enhancements
- Evaluate and integrate new technologies to improve system reliability, security, and performance.
- Develop and implement automation to provision, configure, deploy, and monitor Apple services.
- Participate in an oncall rotation providing hands-on technical expertise during service impacting events.
- Contribute to capacity planning, scale testing, and disaster recovery exercises
- Approach operational problems with a software engineering mindset.
Requirements
- Strong sense of ownership, customer service, and integrity proven through clear communication.
- BS in Computer Science or related field, or equivalent employment
- 5 + years experience in managing and scaling distributed systems in a public, private, or hybrid cloud environment
- Strong experience with deploying, supporting and supervising new and existing services, platforms, and application stacks
- Experience with scale testing, disaster recovery, and capacity planning
- Experience with observability platforms with Splunk, Grafana, Prometheus.
- Demonstrable fluency in at least one of the following languages: Java, Python, or Go.
- Experience with Kubernetes, Nginx, Envoy, Prometheus, and/or Docker., * Understanding of standard networking protocols and components such as: HTTP, DNS, ECMP, TCP/IP, ICMP, the OSI Model, Subnetting and Load Balancing strategies.
- Understanding of the Linux Operating System, including Kernel, Memory, Process, Threads, Static / Shared Libraries, IPC, Signals.
- Experience in developing iOS apps using Xcode and Swift.
- Experience in OpenTelemetry Standards / distributed tracing like jaeger