Senior Staff Software Engineer - Cloud
Role details
Job location
Tech stack
Job description
The Sr Staff Software Engineer - Cloud (Technical Lead Manager) is a key contributor within Brain Corp's engineering organization leading the design and development of large-scale, high-availability systems powering Brain Corp's cloud platform. This platform connects our global fleet of autonomous robots, manages data ingestion from the field, and supports advanced machine learning pipelines for perception, analytics, and operational insights. This dual role will serve as both a technical leader and people manager, guiding a team of cloud engineers while contributing hands-on to the architecture, design, and implementation of next-generation cloud services. The engineer will work closely with ML engineers, data scientists, and infrastructure teams to build scalable cloud-based machine learning systems that handle massive volumes of image data and deliver efficient inference at scale., * Lead and manage a team of cloud software engineers, providing technical mentorship, career guidance, and performance management
- Define and execute the cloud technical roadmap, ensuring alignment with Brain Corp's business and product goals
- Architect and implement high-availability, scalable, and secure systems on Google Cloud Platform (GCP) to support machine learning workloads and data ingestion at scale
- Design, build, and operate ML pipelines that process hundreds of thousands of images daily, enabling rapid model iteration and deployment
- Develop and optimize GPU resource management strategies, improving model serving throughput, latency, and cost efficiency
- Build canary and staging environments to ensure safe, progressive deployments and system resilience
- Collaborate cross-functionally with ML, DevOps, and robotics teams to define APIs, data models, and operational workflows for cloud-robot communication
- Implement Infrastructure-as-Code (IaC) solutions using Pulumi, Terraform, or equivalent, ensuring repeatable and automated deployments
- Establish and maintain cloud observability systems, ensuring reliability, performance, and security compliance
- Drive technical excellence, setting coding standards, reviewing designs, and promoting best practices in distributed systems and cloud ML architectures
- Stay current with advancements in GCP, ML infrastructure, and MLOps to continuously improve platform capabilities and team practices
Requirements
Do you have experience in TypeScript?, Do you have a Master's degree?, * Bachelor's or Master's degree in Computer Science, Software Engineering, or a related field
- 10+ years of professional software engineering experience, including 3+ years in cloud architecture or large-scale distributed systems
- 3+ years of technical leadership or management experience, preferably in a Technical Lead Manager or team lead capacity
- Proven experience designing and operating GCP-based ML systems at scale
Required Knowledge, Skills, Abilities and Other Characteristics:
- Expert-level knowledge of Google Cloud Platform (GCP) services such as GKE, Dataflow, BigQuery, Cloud Run, Pub/Sub, Vertex AI, and Cloud Storage
- Strong proficiency in Go, Python, or TypeScript, with an emphasis on maintainable, production-quality code
- Deep understanding of machine learning pipelines: data ingestion, preprocessing, training, deployment, and inference
- Experience optimizing GPU workloads, autoscaling, and resource scheduling in cloud environments
- Proven success in designing high-availability and fault-tolerant distributed systems
- Hands-on experience with containerization and orchestration technologies (Docker, Kubernetes)
- Familiarity with infrastructure-as-code tools (Pulumi, Terraform) and CI/CD systems (e.g., Jenkins, GitHub Actions)
- Strong understanding of security, networking, and observability in cloud environments
- Excellent problem-solving, communication, and leadership skills
- Ability to balance hands-on technical work with people management responsibilities
- Passion for robotics, automation, and enabling intelligence at scale
Things that Make a Difference:
- Experience in robotics data pipelines, fleet management, or IoT-scale data ingestion
- Experience self-hosing ML inference
- Hands-on experience with Vertex AI, Kubeflow, or TensorFlow Serving in production
- Background in event-driven architectures and message streaming (e.g., Pub/Sub, Kafka)
- Experience with SOC2/ISO27001-compliant systems and secure cloud practices
- Familiarity with Agile methodologies and modern DevOps culture
Physical Demands:
The physical demands described here are representative of those that must be met by an employee to successfully perform the essential functions of this job. Reasonable accommodations may be made to enable individuals with disabilities to perform the essential functions. Essential functions may require maintaining the physical condition necessary for sitting, walking or standing for periods of time; operating a computer and keyboard; use of hands to finger and grasp; talk and hear at normal room levels; visual acuity to determine the accuracy, neatness, and thoroughness of the work assigned or to make general observations of facilities or structures; push or pull up to 20 pounds.
Benefits & conditions
Pulled from the full job description
- Flexible schedule, In addition to base pay, our competitive total rewards package consists of:
- Hybrid Work Schedule: We operate on a hybrid model, with three days in the office (Monday, Tuesday, and Thursday).
- Flexible Hours: We are not a traditional 9-5 company and offer flexibility. Please note that as our HQ is in San Diego, some coordination may occur outside of local business hours.
- Unlimited PTO: We offer an unlimited paid time off policy.
- Paid Lunch: Lunch is provided/paid for by the company.
- Holiday Observance: We recognize all national holidays.
- Office Environment & Location: We maintain an informal work environment, and our office is conveniently located directly on a major train station hub.