Data Platform Engineer
Role details
Job location
Tech stack
Job description
We are looking for a skilled and experienced Data Platform Engineer to join our growing team. Data Platform Engineers take full ownership of delivering high-performing, high-impact data platform as products, and services, from a description of a problem customer Data Engineers are trying to solve all the way through to final delivery (and ongoing monitoring and operations). They are standard bearers for software engineering and quality coding practices within the team and are expected to mentor more junior engineers; they may even coordinate the work of more junior engineers on a large project. They devise useful metrics ensuring their services are meeting customer demand, having an impact, and iterate to deliver and improve on those metrics in an agile fashion. The Data Platform team builds and manages reusable components and architectures designed to make it both fast and easy to build robust, scalable, production-grade data products and services in the challenging biomedical data space. A Data Platform Engineer is a technical individual contributor, building modern, cloud-native systems for standardizing and templatizing data engineering with the following skills and experiences.
Key Responsibiltites
- Building a next-generation, metadata- and automation-driven data experience for GSK's scientists, engineers, and decision-makers, increasing productivity and reducing time spent on "data mechanics"
- Providing best-in-class AI/ML and data analysis environments to accelerate our predictive capabilities and attract top-tier talent.
- Aggressively engineering our data at scale, as one unified asset, to unlock the value of our unique collection of data and predictions in real-time.
- Automation of end-to-end data flows: Faster and reliable ingestion of high throughput data in genetics, genomics and multi-omics, to extract value of investments in new technology
- Enabling governance by design of external and internal data: with engineered practical solutions for controlled use and monitoring
- Innovative disease-specific and domain-expert specific data products: to enable computational scientists and their research unit collaborators to get faster to key insights leading to faster biopharmaceutical development cycles.
- Supporting e2e code traceability and data provenance: Increasing assurance of data integrity through automation, integration
- Improving engineering efficiency: Extensible, reusable, scalable, updateable, maintainable, virtualized traceable data and code would be driven by data engineering innovation and better resource utilization.
Requirements
- Proficiency in Google Cloud Platform (GCP) - including Cloud Run, GKE, Cloud Storage, Artifact Registry, IAM, and related services
- Strong Python development skills for scripting, automation, and tooling around pipeline infrastructure
- Hands-on experience building and optimizing Docker containers, including multi-stage builds, image optimization, and container security best practices
- Solid understanding of CI/CD pipelines for automated container builds and deployments
- Demonstrated expertise in debugging and observability - including structured logging, distributed tracing, metrics collection, and use of tools such as Cloud Logging, Cloud Monitoring, or equivalent
- Experience diagnosing performance and reliability issues in containerized, cloud-native environments
Preffered Skills:
- Familiarity with Nextflow for workflow orchestration
- Exposure to bioinformatics, genomics data, cell imaging and histopathology imaging processing workflows
- Experience with GCP Batch for running large-scale computational workloads