Software Engineering Manager 1 - Streaming & Cloud Platform Reliability
Role details
Job location
Tech stack
Job description
We're looking for a hands-on Software Engineering Manager to lead a small team (2-4 developers) focused on improving the reliability of Mist's cloud platform by driving concrete postmortem action items from our incident management process.
This team owns follow-ups from production incidents-especially those involving our streaming data pipelines (Kafka / Flink / Storm) and core APIs. You'll work closely with senior engineers to turn incident learnings into durable engineering improvements.
This is a hybrid role requiring on-site collaboration multiple days per week in Cupertino, California. Due to the requirements of this position, this role requires a US Citizen or Green Card holder.
What You'll Do
- Own and drive post-incident follow-ups from our Incident Management process, turning incident reports into design and implementation work.
- Lead, mentor, and grow a 2-4 person engineering team, while contributing hands-on code in production services.
- Design, implement, and harden streaming topologies using Kafka, Storm, and/or Flink (e.g., stats, telemetry, alerts, pcaps).
- Improve reliability of core APIs (REST API, WebSocket, Webhooks, etc.), including auth, rate limiting, and DR-sensitive flows.
- Enhance observability and runbooks: add metrics/alerts, define SLOs, and codify playbooks for recurring incident patterns.
- Collaborate with SRE, Platform, and Data teams on DR, multi-region, and multi-cloud behavior (AWS, GCP, DR regions).
- Ensure robust testing and deployment practices (unit/integration tests, regression tests for past incidents, safe rollout/rollback)., * Direct, visible impact on the stability and reliability of Mist's cloud platform and AI-driven networking products.
- A focused charter with real, concrete backlogs driven by incidents-not vague "platform work."
- Close collaboration with strong senior engineers and SREs, with room to shape both technical direction and team culture.
Additional Skills: Accountability, Accountability, Action Planning, Active Learning, Active Listening, Agile Methodology, Agile Scrum Development, Analytical Thinking, Bias, Coaching, Creativity, Critical Thinking, Cross-Functional Teamwork, Data Analysis Management, Data Collection Management (Inactive), Data Controls, Design, Design Thinking, Empathy, Follow-Through, Group Problem Solving, Growth Mindset, Intellectual Curiosity (Inactive), Long Term Planning, Managing Ambiguity {+ 5 more}
What We Can Offer You:
Health & Wellbeing
We strive to provide our team members and their loved ones with a comprehensive suite of benefits that supports their physical, financial and emotional wellbeing.
Requirements
- 7+ years total professional software engineering experience.
- This is a hybrid role requiring on-site collaboration multiple days per week in Cupertino, California. Due to the requirements of this position, this role requires a US Citizen or Green Card holder.
- 2+ years in a team lead role (mentors, performance feedback, prioritization), while remaining hands-on technically.
- 5+ years building backend or distributed systems in Python, Go, or Java proficiency in at least one of these languages to lead design reviews and contribute production code.
- 3+ years designing, implementing, and operating distributed, event-driven systems using:
- Kafka and at least one of Flink or Storm, or a comparable streaming framework.
- 3+ years building and operating RESTful APIs (service design, auth, rate limiting, client IP handling, versioning).
- 3+ years working with cloud-native infrastructure:
- Kubernetes, containerized microservices, CI/CD pipelines.
- 3+ years with production datastores such as:
- Redis, Postgres, Cassandra/Datastax, S3/GCS, or similar distributed storage systems.
- 2+ years directly involved in production incident response:
- On-call participation, postmortems, and driving remediation work through to completion.
- Proven ability to debug latency, throughput, data correctness, and availability issues in streaming pipelines and/or APIs.
- Experience adding or improving metrics, logging, tracing, and alerts for production services.
Preferred Qualifications
-
2+ years working with big-data / analytics or ETL systems (e.g., Apache Spark, Airflow, Snowflake, or similar).
-
Experience with webhook or event-delivery systems (idempotency, retries, ordering, DLQs).
-
Exposure to multi-region / DR design: cross-cloud migrations, DNS and certificate management, environment-driven configuration.
-
Familiarity with DevOps practices, CI/CD automation, and service ownership.
-
Experience with observability stacks such as Prometheus, Grafana, Kibana/Elasticsearch.
Benefits & conditions
"The expected salary/wage range for this position is provided below. Actual offer may vary from this range based upon geographic location, work experience, education/training, and/or skill level.
- United States of America: Annual Salary USD 155,500 - 315,000 in California The listed salary range reflects base salary. Variable incentives may also be offered."