Software Engineering IC5

Microsoft
Redmond, United States of America
2 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior
Compensation
$ 275K

Job location

Redmond, United States of America

Tech stack

C
Java
JavaScript
Azure
C Sharp (Programming Language)
C++
Cloud Computing
Nvidia CUDA
Software Debugging
InfiniBand
Python
Remote Direct Memory Access
Prometheus
Software Engineering
Virtual Machines
Virtualization Technology
Grafana
Build Management
Containerization
Kubernetes
Information Technology
Bare Metal
Azure
Machine Learning Operations

Job description

The CoreAI Infrastructure team builds the foundational accelerated compute platforms that power largescale AI training and inference across Azure. Our mission is to deliver secure, reliable, and highly efficient GPU and CPU infrastructure that enables multitenant AI systems at global scale while maximizing utilization, performance, and developer productivity.

This role sits at the intersection of cloud infrastructure, systems software, virtualization, and container platforms, working closely with CoreAI, Azure Infrastructure, OS, Networking, and Hardware teams to deliver end-to-end platform capabilities.

Microsoft's mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.

In alignment with our Microsoft values, we are committed to cultivating an inclusive work environment for all employees to positively impact our culture every day.

#AIPLATFORM# #AIP #FIT #o11y

Responsibilities

As the Principal engineer on the team, your responsibilities include:

  • Design and build GPU and CPU accelerated infrastructure for training and inference workloads, spanning bare metal, virtual machines, and containerized environments with focus on observability key metrics at scale.

  • Develop End to End Observability operational excellence systems for GPU/CPU device management, scheduling, isolation, and sharing (e.g., partial GPU allocation, multitenant usage).

  • Build and operate advanced orchestration and resource governance and management scenarios using platforms such as AKS, Dynamic Resource Allocation (DRA), and related Kubernetes ecosystem capabilities to enable fair sharing, isolation, and efficient utilization of accelerated resources.

  • Build and evolve virtualization and container stacks to support modern AI workloads, including secure and confidential compute scenarios.

  • Optimize performance, reliability, and utilization across large GPU/CPU fleets, including scaleup and scale out configurations.

  • Partner with networking and storage teams to enable high performance interconnects (e.g., RDMA/InfiniBand class networking) for distributed workloads.

  • Drive end-to-end platform features from design through production, including observability, diagnostics, and operational excellence.

  • Influence platform architecture and technical direction across teams through design reviews and technical leadership.

Requirements

  • Bachelor's Degree in Computer Science or related technical field and 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, Python or equivalent experience.

Other Requirements:

  • Proven ability to design and operate largescale, production infrastructure with high reliability and performance requirements using Azure Kubernetes Service (AKS).

  • Strong problem-solving skills and the ability to debug complex, cross layer systems issues.

  • Demonstrated technical leadership, including mentoring engineers and driving cross team architectural alignment.

  • Hands-on experience with virtualization and/or container platforms (e.g., VMs, Kubernetes, container runtimes).

  • Strong collaboration and communication skills, with the ability to work across organizational boundaries.

  • Expertise with distributed observability technologies (e.g., Prometheus, OpenTelemetry, Grafana) and experience designing or scaling telemetry pipelines for high-throughput production systems.

  • Advanced, hands-on experience with production ML systems, large-scale training infrastructure, NCCL, CUDA libraries and tools.

Software Engineering IC5 - The typical base pay range for this role across the U.S. is USD $142,800 - $274,800 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $188,000 - $304,200 per year.

About the company

Microsoft is a global technology company headquartered in Redmond, Washington. Our mission is to empower every person and every organization on the planet to achieve more. We develop, license, and support a wide range of software products, services, and devices that help individuals and businesses realize their full potential.

Our flagship products include the Microsoft 365 productivity cloud, Windows operating system, Azure cloud platform, and Dynamics 365 business applications. We are also a leader in areas such as artificial intelligence, cybersecurity, developer tools, and gaming through Xbox and Game Pass.

With operations in more than 190 countries and over 220,000 employees worldwide, Microsoft is committed to responsible innovation, inclusive economic growth, and sustainability. We work closely with governments, industries, and communities to ensure that technology serves the public good and helps address some of the world’s most pressing challenges.

As we celebrate our 50th anniversary in 2025, we continue to look forward—investing in AI, cloud, and quantum computing to shape the future of work, education, and society at large scale.

Apply for this position