Cloud Site Reliability Engineer - DCS Cloud New

BYTEDANCE INC.

San Jose, United States of America

yesterday

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Intermediate

Compensation

$ 156K

Job location

San Jose, United States of America

Tech stack

Amazon Web Services (AWS)

Azure

C++

Cloud Computing

Cloud Engineering

Computer Clusters

Computer Security

Nvidia CUDA

Computer Networks

Databases

Linux

DevOps

Monitoring of Systems

Python

Kernel-Based Virtual Machine

Quick EMUlator (QEMU)

Reliability Engineering

Cloud Services

Software Engineering

Graphics Processing Unit (GPU)

Google Cloud Platform

Delivery Pipeline

Kubernetes

Information Technology

Free and Open-Source Software

Oracle Cloud Infrastructure

Docker

Job description

Our Infrastructure Engineering team supports the company's fast growth by building and operating hyper-scale datacenters, managing the life cycle of server fleet, providing cloud solutions, and developing various infrastructure services and making sure they are scalable and are reliable. We have three subgroups for this role: - Cloud Host Delivery, Delivery & Standardization - Cloud Host Operation, Operation Efficiency & Reliability - Cloud Management & Security Responsibilities - What You'll Do - Design, build, scale, and operate ByteDance's global infrastructure, including large-scale systems spanning public and private clouds. - Develop tools, automation frameworks, visualizations, and monitoring systems to streamline operations and drive optimization of global infrastructure. - Create, manage, and standardize cloud AMIs/images for use across multiple environments, ensuring strict alignment with the company's global compliance standards. - Thrive in a fast-paced environment, engaging in technical operations and on-call rotations to address incidents related to cloud, OS, network, performance, and reliability. - Drive improvements across the entire infrastructure lifecycle, from ideation and design through development, deployment, user support, and continuous refinement.

Requirements

Minimal Qualifications - Bachelor's degree or above in Computer Science, Software Engineering, Information Security, or a related field. - 2+ years of experience in Linux operations, SRE, or DevOps - Proficient in at least one programming language such as Go, Python, or C++, with solid engineering capabilities in platform development, system tooling, and automation. - Strong computer science fundamentals, with deep understanding of Linux OS principles, computer networks, storage systems, GPU systems, and databases, along with systematic troubleshooting and root-cause analysis skills. - Familiar with core reliability practices, including monitoring and alerting, capacity management, change management, canary/gray releases, incident response, and postmortem processes. - Strong communication and collaboration skills, with the ability to proactively identify problems, drive cross-team execution, and demonstrate strong ownership and results-oriented mindset. Preferred Qualifications - Hands-on experience operating public cloud platforms, or deep familiarity with major cloud providers such as OCI, AWS, Azure, GCP, etc, including understanding of their underlying mechanisms. - Experience with large-scale cloud host delivery, image/AMI systems, resource scheduling, network adaptation, and virtualization technologies such as KVM/QEMU. - Familiar with containers and cloud-native ecosystems, including Docker, Kubernetes, and containerd, with a solid understanding of isolation mechanisms like cgroups and namespaces. - Experience maintaining GPU clusters, including drivers, CUDA, MIG, topology awareness, troubleshooting, stress testing, and GPU delivery pipelines. - Proven experience in reliability-focused initiatives such as failure drill systems, capacity governance, change governance, observability platforms, and resource cost optimization. - Open-source contributions, technical blogs, patents, or technical sharing experience are highly preferred. - Experience operating large-scale production environments is a strong plus.

Benefits & conditions

The base salary range for this position in the selected city is $156000 - $387600 annually.

Compensation may vary outside of this range depending on a number of factors, including a candidate's qualifications, skills, competencies and experience, and location. Base pay is one part of the Total Package that is provided to compensate and recognize employees for their work, and this role may be eligible for additional discretionary bonuses/incentives, and restricted stock units.

Benefits may vary depending on the nature of employment and the country work location. Employees have day one access to medical, dental, and vision insurance, a 401(k) savings plan with company match, paid parental leave, short-term and long-term disability coverage, life insurance, wellbeing benefits, among others. Employees also receive 10 paid holidays per year, 10 paid sick days per year and 17 days of Paid Personal Time (prorated upon hire with increasing accruals by tenure).

The Company reserves the right to modify or change these benefits programs at any time, with or without notice.

For Los Angeles County (unincorporated) Candidates:

Qualified applicants with arrest or conviction records will be considered for employment in accordance with all federal, state, and local laws including the Los Angeles County Fair Chance Ordinance for Employers and the California Fair Chance Act. Our company believes that criminal history may have a direct, adverse and negative relationship on the following job duties, potentially resulting in the withdrawal of the conditional offer of employment:

Interacting and occasionally having unsupervised contact with internal/external clients and/or colleagues;
Appropriately handling and managing confidential information including proprietary and trade secret information and access to information technology systems; and
Exercising sound judgment., * Training Provided

Regular team and company events
Free drinks, fruit or food
Subsidized public transport
Flexible working
Free Gym or Gym Subsidy
Private Medical/Dental healthcare
Annual Health Check
Bonus/Reward Scheme
Childcare Vouchers
Cycle to work scheme
Paid Overtime
Stock Options
Language Classes
Game Jams
Four Day Workweek

About the company

Founded in 2012, ByteDance's mission is to inspire creativity and enrich life. With a suite of more than a dozen products, including TikTok, Lemon8, CapCut and Pico as well as platforms specific to the China market, including Toutiao, Douyin, and Xigua, ByteDance has made it easier and more fun for people to connect with, consume, and create content. Why Join ByteDance Inspiring creativity is at the core of ByteDance's mission. Our innovative products are built to help people authentically express themselves, discover and connect - and our global, diverse teams make that possible. Together, we create value for our communities, inspire creativity and enrich life - a mission we work towards every day. As ByteDancers, we strive to do great things with great people. We lead with curiosity, humility, and a desire to make impact in a rapidly growing tech company. By constantly iterating and fostering an "Always Day 1" mindset, we achieve meaningful breakthroughs for ourselves, our Company, and our users. When we create and grow together, the possibilities are limitless. Join us. Diversity & Inclusion ByteDance is committed to creating an inclusive space where employees are valued for their skills, experiences, and unique perspectives. Our platform connects people from across the globe and so does our workplace. At ByteDance, our mission is to inspire creativity and enrich life. To achieve that goal, we are committed to celebrating our diverse voices and to creating an environment that reflects the many communities we reach. We are passionate about this and hope you are too.

Role details

Job location

Tech stack

Job description

Requirements

Benefits & conditions

About the company

Apply for this position

Good distractions

Moments

Videos View all