Manager Software Engineer- US
Role details
Job location
Tech stack
Job description
The Software Engineer is a senior, hands-on technical leader responsible for improving the reliability, availability, scalability, and performance of complex enterprise systems across both legacy and modern cloud platforms. This role emphasizes execution, deep technical expertise, and independent operation in high-stakes environments.
This individual is expected to lead reliability and performance initiatives, make informed technical decisions under pressure, and partner with engineering and operations teams to define and meet service level objectives (SLOs). The role drives improvements in observability (metrics, logs, traces), incident response and post-incident review, capacity planning, and automation to reduce operational toil. Additionally, the role requires the ability to analyze enterprise deployments and recommend architectural hardening, targeted application and infrastructure changes, and performance optimizations to improve end-to-end customer experience. The Performance and SRE leader also interface with software development, delivery, and customer-facing teams to coordinate rapid triage, root-cause analysis, and durable remediation of issues impacting system stability and performance
What You´ll Do
- Provides leadership in improving reliability and performance for targeted services, platforms, or programs
- Operates across hybrid environments (including mainframe workloads and AWS-hosted services) with diverse integrations and dependencies
- Drives incident reduction through measurable reliability goals (SLIs/SLOs), runbooks, and automation within defined engagement scopes
- Influences technical standards, operational practices, and architecture within local domains, partnering with teams to implement durable reliability improvements
Requirements
Do you have experience in Software development?, Do you have a Bachelor's degree?, * 7+ years' hands on experience as Software Engineer - Performance and Reliability Engineer
- Bachelors degree in an IT related field or minimum 7+ years of experience
- Must be legally authorized to work in the United States without requiring sponsorship now or in the future.
- Mainframe and COBOL application experience
- Modern enterprise systems (e.g., Java-based, cloud-native, and other contemporary application platforms ) and their operational characteristics
- AWS environments, including QA, UAT, and Production , with strong understanding of networking, compute, and storage fundamentals
- Performance engineering: profiling, load/stress testing, latency analysis, capacity planning, and tuning across application and infrastructure layers
- Ability to guide performance profiling with application teams (e.g., CPU/memory profiling, thread/heap analysis, database/query tuning), identify problem code paths, and recommend targeted fixes, translating findings into prioritized remediation work
- Regarded as a technical authority among peer engineers; strong communicator during incidents and while driving cross-team remediation
- Demonstrates calm, methodical execution in high-pressure situations; leads incident response, escalation, and post-incident reviews (blameless postmortems)
- Improves operational readiness through runbooks and on-call execution, including responding whenever there is a fire
- Expert in troubleshooting with observability data-metrics, logs, and traces-to isolate failure modes, quantify impact, and validate fixes
- Ability to use observability and monitoring tools (such as Dynatrace) to instrument services, analyze end-to-end transactions, and pinpoint bottlenecks or failures in urgent and non-urgent scenarios
- Skilled in diagnosing integration and dependency issues across enterprise platforms, identifying root causes in interconnected systems, and recommending durable fixes to restore and harden functionality
- Experienced with hosting and infrastructure components (virtual machines, container orchestration, CI/CD pipelines, and network appliances), enabling comprehensive troubleshooting, resilience improvements, and automation
- Frequently sought out for critical reliability or performance issues requiring immediate attention and clear leadership
- Improves outcomes by reducing mean time to restore (MTTR) and recurring incidents through automation, runbooks, and durable fixes
- Renowned for delivering measurable reliability and performance improvements (e.g., SLO attainment, latency reduction), not just frameworks or process artifacts