Systems Research Engineers
Role details
Job location
Tech stack
Job description
Distributed Systems Research & Development
- Architect, implement, and evaluate distributed system components for emerging AI and data-intensive workloads.
- Design modular and scalable infrastructure spanning heterogeneous clusters (CPU, GPU, accelerators).
- Develop efficient serving and scheduling systems optimized for large-scale AI workloads.
Performance Optimization & Profiling
- Conduct deep profiling and performance tuning of large-scale inference and data pipelines.
- Optimize key-value cache management and heterogeneous memory scheduling.
- Improve high-throughput inference serving using modern distributed ML frameworks.
- Apply systematic performance analysis methodologies to identify bottlenecks and scalability constraints.
Scalable Model Serving Infrastructure
- Develop frameworks enabling multi-tenant, low-latency, and fault-tolerant AI serving across distributed environments.
- Research techniques for:
- Cache sharing
- Data locality optimization
- Resource orchestration
- Cluster-level scheduling
- Prototype and evaluate new serving and inference architectures.
Research & Publications
- Translate novel system designs into publishable research contributions at leading systems and ML venues.
- Drive internal adoption of innovative methods and architectural improvements.
Cross-Team Collaboration
- Communicate technical insights and evaluation results clearly to multidisciplinary engineering and research teams.
- Collaborate across global research groups to align on long-term infrastructure strategy.
Requirements
We are seeking Systems Research Engineers with strong interest in computer systems, distributed AI infrastructure, and performance optimization. These roles are well suited to recent PhD graduates or outstanding BSc/MSc engineers aiming to develop research-driven engineering expertise in operating systems, distributed systems, AI model serving, and machine learning infrastructure., * Bachelor's or Master's degree in Computer Science, Electrical Engineering, or related field.
- Strong knowledge of:
- Distributed systems
- Operating systems
- Machine learning systems
- AI inference serving infrastructure
- Hands-on experience with LLM serving frameworks and distributed cache optimization.
- Proficiency in C/C++ for systems development.
- Experience using Python for research prototyping.
- Solid understanding of distributed algorithms and systems research methodology.
- Familiarity with profiling and performance analysis tools.
- Strong communication skills and collaborative mindset., * PhD in systems, distributed computing, or large-scale AI infrastructure.
- Publications in top-tier systems or ML conferences.
- Experience with:
- Load balancing
- State management
- Fault tolerance
- Resource scheduling in inference clusters
- Practical experience designing, deploying, or profiling high-performance cloud or AI infrastructure.