Platform Engineer
Role details
Job location
Tech stack
Job description
with Terraform; implement repeatable environment provisioning, configuration management, and golden paths for teams. * Establish CI/CD workflows (GitHub Actions/Jenkins/GitLab CI), build/test standards, and progressive delivery patterns that keep releases fast and low-risk. * Implement logging, metrics, and tracing (e.g., Prometheus, Grafana, CloudWatch, Splunk/New Relic) with actionable SLOs, alerts, and runbooks; embed security and compliance by design. * Collaborate closely with product and science teams to remove bottlenecks, eliminate manual steps, and evolve service and data interfaces that make operating image pipelines simple and reliable. * Contribute to future-state architectures that improve scalability, resiliency, and operational efficiency; lead targeted refactors and platform improvements. * Manage core automation and tooling, and educate teams on platform capabilities, CI/CD, configuration management, and infrastructure automation best practices. Required (Must-have)
Requirements
- M.Sc. in Computer Science/Engineering (or equivalent) or comparable industry experience. * Practical, production experience operating Kubeflow Pipelines for reproducible ML workflows at scale. * Proven experience deploying and operating workloads on Kubernetes (EKS/GKE/AKS), including upgrades, autoscaling, RBAC, networking, and reliability; strong Unix/Linux fundamentals. * Hands-on experience with AWS services (EKS, EC2, S3, IAM, CloudWatch; RDS a plus) and the ability to design secure, cost-aware architectures. * Strong Terraform skills and Git-based workflows for repeatable infrastructure provisioning and configuration management. * Practical experience with CI/CD platforms (GitHub Actions/Jenkins/GitLab CI), including artifact management, environment promotion, and progressive delivery. * Solid Python and/or shell scripting for platform automation and toil reduction. * Experience implementing logging, metrics, and tracing with SLOs, alerts, and runbooks (e.g., Prometheus, Grafana, CloudWatch, Splunk/New Relic) and a security-first mindset. * Ability to lead technical initiatives, communicate trade-offs clearly, and collaborate effectively with engineering and science teams Desirabel (Nice to have): * Experience with MLflow, Feast, Argo, Airflow, Ray, and model versioning/monitoring. * Familiarity with S3/object storage, artifact registries, and handling large image datasets; basic SQL/NoSQL exposure. * Experience with digital pathology or large-scale image processing (e.g., whole-slide images) and tools like OpenSlide, scikit-image, or OpenCV. * Experience tuning high-throughput pipelines, concurrency, memory usage, and integrating GPUs/accelerators. * Experience with VPC design, ingress/egress, service meshes, secrets management, IAM, and policy as code. * Experience in regulated environments (e.g., GxP), including data governance, privacy, and building software under regulated processes. * Experience with Jira/Zendesk and with JavaScript/TypeScript for internal tools or dashboards.