Site Reliability Engineer
Role details
Job location
Tech stack
Job description
Are you ready to take your skills to the next level as a self-motivated and enthusiastic Site Reliability Engineer with hands-on experience supporting multiple connected Cloud-based products? Joining Trimble's Project Delivery Cloud Platform, you will take responsibility for the infrastructure of our cutting-edge reality capture solution running on Microsoft Azure, driving the reliability, scalability, and security of our service and infrastructure. Trimble is a global technology company that connects the physical and digital worlds, transforming the ways work gets done. With relentless innovation in precise positioning, modeling and data analytics, Trimble enables essential industries including construction, geospatial and transportation. Whether it's helping customers build and maintain infrastructure, design and construct buildings, optimize global supply chains or map the world, Trimble is at the forefront, driving productivity and progress. The Trimble AECO segment provides digital construction solutions that increase precision and productivity for Architecture, Engineering, Construction, and Owners. What Makes This Role Great: In this role, you will be the backbone of our Project Delivery Cloud Platform, directly influencing the reliability of cutting-edge reality capture solutions. Key Exciting Responsibilities
- Develop and maintain scalable infrastructure as code (IaC) using Terraform to ensure reliable and scalable cloud environments.
- Implement and enhance observability solutions using tools like New Relic, DataDog, Sumologic and Splunk for monitoring, logging, and alerting.
- Perform code deployments and manage CI/CD pipelines using Jenkins, Github, and related tooling to ensure smooth and efficient delivery processes.
- Automate routine tasks and workflows to increase operational efficiency and reduce manual intervention.
- Evaluate system designs and architectures for reliability, performance, security, and efficiency, ensuring best practices are followed.
- Lead incident response efforts and conduct deep-dive root cause analysis to implement long-term, innovative technical solutions.
- Develop and maintain comprehensive runbooks and procedures for incident response and operational tasks.
- Collaborate with cross-functional teams to review and provide feedback on technical designs, ensuring alignment with SRE principles.
- Participate in on-call rotations and handle critical incidents with confidence and expertise.
- Continuously improve documentation for systems and services, contributing to a knowledge-sharing culture within the team.
Requirements
- Bachelor's or Master's degree in Computer Engineering or a related field.
- At least 5 years of technical experience with a proven ability to take full ownership of production infrastructure.
- Excellent collaboration skills with leading cross-functional work.
- Demonstrated success in managing infrastructure in production environments.
- Expertise in capacity planning and cost optimisation for efficient operations.
- Extensive expertise managing cloud provider hosted infrastructure, specifically with Microsoft Azure or AWS.
- Proficient in high-level scripting languages like Python and Infrastructure as Code tools (Terraform), along with containerisation.
- Demonstrated success with Kubernetes or other containerization technologies
- Familiarity with CI/CD pipelines and tools such as Azure DevOps, Jenkins, Argo CD, Helm, GitHub.
- Experience with monitoring tools and incident management processes like Prometheus, Grafana, New Relic, DataDog, Splunk, Cloudwatch, Sumologic etc.
- Extensive understanding of networking and security concepts.
Bonus Points For:
- Specialized SRE observability experience with New Relic or DataDog.
- Familiarity with OpenTelemetry, AIOps, MLOps, or SecOps.