Senior Researcher - Efficient AI

Microsoft
Redmond, United States of America
2 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior
Compensation
$ 235K

Job location

Redmond, United States of America

Tech stack

Microsoft Windows
Artificial Intelligence
Microsoft Online Services
C++
Cloud Computing
Software Quality
Nvidia CUDA
Software Debugging
Python
Routing
Open Source Technology
TensorFlow
Graphics Processing Unit (GPU)
PyTorch
Large Language Models
Caching
Parallel Computation
Generative AI
Gpu Programming
ONNX (Open Neural Network Exchange) Format
TensorRT
Unified Endpoint Management

Job description

Generative AI is transforming how people create, collaborate, and communicate-redefining productivity across Microsoft 365 for customers worldwide. At Microsoft, we operate one of the largest collaboration and productivity platforms in the world, serving hundreds of millions of consumer and enterprise users. Delivering these AI experiences at scale requires solving some of the hardest efficiency challenges in modern AI systems.

We are an applied research team focused on advancing efficiency across the AI stack, spanning models, ML frameworks, cloud infrastructure, and hardware. We drive mid- and long-term product innovation through close collaboration with research and product teams across the company. We communicate our research both internally and externally through internal technical reports, academic conference publications, open-source releases, and patents. Beyond producing research, we take responsibility for driving ideas through prototyping, validation, and production, with a bias toward real-world impact.

The candidate will work across the full stack-from large-scale serving systems to hardware- and kernel-level optimizations-exploring algorithmic, systems, and hardware/software co-design techniques. Areas of focus include batching, routing, scheduling, caching, endpoint configuration, and GPU architecture-aware optimizations. This role emphasizes end-to-end ownership, with responsibility for identifying high-impact problems and driving research ideas through prototyping, validation, and deployment to deliver measurable customer impact.

For more see: https://aka.ms/efficient-ai

Responsibilities

  • Formulate, develop, and evaluate new algorithmic and system-level approaches for end-to-end AI serving, using analytical modeling and large-scale measurement to study token-level latency, tail latency (p95/p99), throughput-per-dollar, cold-start behavior, warm pool strategies, and capacity planning under multi-tenant SLOs and variable sequence lengths.
  • Design and experimentally evaluate endpoint configuration and execution policies, including batching, routing, and scheduling strategies, tensor and pipeline parallelism, quantization and precision profiles, speculative decoding, and chunked or streaming generation, and drive the most promising approaches through robust rollout and validation into production.
  • Perform hardware- and kernel-aware optimization by collaborating closely with model, kernel, compiler, and hardware teams to align serving algorithms with attention/KV innovations and accelerator capabilities.
  • Build and benchmark experimental prototypes and large-scale measurements to validate research ideas and drive them toward production readiness; produce clear technical documentation, design reviews, and operational playbooks.
  • Publish research results, file patents, and, where appropriate, contribute to open-source systems and serving frameworks

Requirements

  • Doctorate in relevant field OR Master's Degree in relevant field AND 3+ years related research experience OR Bachelor's Degree in relevant field AND 4+ years related research experience OR equivalent experience.

Other Requirements: Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include but are not limited to the following specialized security screenings:

  • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter., * Demonstrated experience in designing and optimizing efficient inference systems, combining foundations in algorithmic optimization, parallel computing, and request orchestration under strict SLO constraints with deep knowledge of attention and KVcache optimizations, batching and scheduling strategies, and costaware deployment.
  • 3+ years of experience with machine learning frameworks (e.g., PyTorch, TensorFlow) and inference serving frameworks (e.g., vLLM, Triton Inference Server, TensorRT-LLM, ONNX Runtime, Ray Serve, DeepSpeed-MII).
  • 3+ years of experience in GPU programming and optimization, with expert knowledge of CUDA, ROCm, Triton, PTX, CUTLASS, or similar GPU programming frameworks.
  • Proficiency in C++ and Python for high-performance systems, with code quality and profiling/debugging skills
  • Research impact through publications and/or patents, coupled with handson experience taking research ideas through execution and delivery in production.

About the company

Microsoft is a global technology company headquartered in Redmond, Washington. Our mission is to empower every person and every organization on the planet to achieve more. We develop, license, and support a wide range of software products, services, and devices that help individuals and businesses realize their full potential.

Our flagship products include the Microsoft 365 productivity cloud, Windows operating system, Azure cloud platform, and Dynamics 365 business applications. We are also a leader in areas such as artificial intelligence, cybersecurity, developer tools, and gaming through Xbox and Game Pass.

With operations in more than 190 countries and over 220,000 employees worldwide, Microsoft is committed to responsible innovation, inclusive economic growth, and sustainability. We work closely with governments, industries, and communities to ensure that technology serves the public good and helps address some of the world’s most pressing challenges.

As we celebrate our 50th anniversary in 2025, we continue to look forward—investing in AI, cloud, and quantum computing to shape the future of work, education, and society at large scale.

Apply for this position