Software Engineering IC5
Role details
Job location
Tech stack
Job description
Help build the infrastructure that powers training, evaluation, and data platforms for reliable deployment of world-class foundational AI models. We are on a mission to create state-of-the-art AI models and deploy them across Microsoft products at an unprecedented scale.
You'll collaborate across engineering and research to design, evolve, and operate core research infrastructure, so that product teams can train faster, evaluate more rigorously, and ship with confidence. You'll work closely with the teams that transform pre-trained models into the consumer Copilot experience.
Microsoft's mission is to empower every person and every organization to achieve more, and we build on values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive.
Microsoft Superintelligence Team
This role is part of Microsoft AI's Superintelligence Team. The MAIST is a startup-like team inside Microsoft AI, created to push the boundaries of AI toward Humanist Superintelligence-ultra-capable systems that remain controllable, safety-aligned, and anchored to human values. Our mission is to create AI that amplifies human potential while ensuring humanity remains firmly in control. We aim to deliver breakthroughs that benefit society-advancing science, education, and global well-being.
We're also fortunate to partner with incredible product teams giving our models the chance to reach billions of users and create immense positive impact. If you're a brilliant, highly-ambitious and low ego individual, you'll fit right in-come and join us as we work on our next generation of models!
Responsibilities
- Design and build core platform services for scalable training and evaluation, including cluster orchestration, job scheduling, data and compute pipelines, and artifact management.
- Standardize containerized workflows by maintaining Docker images, CI/CD, and runtime configurations; advocate for best practices in security, reproducibility, and cost efficiency.
- Implement end-to-end observability and operations through metrics, tracing, logging, dashboard development, monitoring, and automated alerts for model training and platform health (using Prometheus, Grafana, OpenTelemetry).
- Architect and operate services on Azure cloud platforms, managing infrastructure-as-code (Terraform/Helm), secrets, networking, and storage.
- Enhance developer experience by creating tools, CLIs, and portals that simplify job submission, metrics analysis, and experiment management for generalist software engineering and research teams.
- Enforce security and compliance policies for data access, container hardening, and supply-chain integrity, and partner with security and privacy teams to maintain robust practices in multi-tenant environments and secret management.
- Collaborate cross-functionally with data, model, and product teams to align infrastructure roadmaps with training needs, evaluation protocols, and Copilot product goals., * Develop internal portals and CLIs for job lifecycle management, experiment tracking, and metrics visualization to support operational efficiency.
- Manage GPU cluster operations (scheduling, isolation, utilization), high-performance computing (HPC), and experiment orchestration for machine learning training.
- Implement container security practices and maintain CI/CD pipelines to support robust, reproducible deployments.
Software Engineering IC4 - The typical base pay range for this role across the U.S. is USD $119,800 - $234,700 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $158,400 - $258,000 per year.
Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: ;br> Software Engineering IC5 - The typical base pay range for this role across the U.S. is USD $139,900 - $274,800 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $188,000 - $304,200 per year.
Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: ;br> This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled.
Requirements
Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience., * Master's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
- Apply strong software engineering fundamentals in distributed systems, networking, and storage while building large-scale distributed applications on cloud platforms.
- Build systems for AI research teams, with a solid understanding of training and evaluating large language models (LLMs).
- Leverage hands-on experience with Kubernetes, Docker, and the Linux container ecosystem to drive platform reliability and scalability.
- Orchestrate data and compute pipelines using tools like Airflow or Argo, manage streaming systems (Kafka/Event Hubs), and handle object storage (Azure Blob/S3-compatible).
About the company
Microsoft is a global technology company headquartered in Redmond, Washington. Our mission is to empower every person and every organization on the planet to achieve more. We develop, license, and support a wide range of software products, services, and devices that help individuals and businesses realize their full potential.
Our flagship products include the Microsoft 365 productivity cloud, Windows operating system, Azure cloud platform, and Dynamics 365 business applications. We are also a leader in areas such as artificial intelligence, cybersecurity, developer tools, and gaming through Xbox and Game Pass.
With operations in more than 190 countries and over 220,000 employees worldwide, Microsoft is committed to responsible innovation, inclusive economic growth, and sustainability. We work closely with governments, industries, and communities to ensure that technology serves the public good and helps address some of the world’s most pressing challenges.
As we celebrate our 50th anniversary in 2025, we continue to look forward—investing in AI, cloud, and quantum computing to shape the future of work, education, and society at large scale.