Data Engineer (SC Cleared)
Role details
Job location
Tech stack
Job description
A hands-on data engineering role within a large-scale cloud data programme, responsible for building, maintaining, and troubleshooting data pipelines using Apache Spark, PySpark, Apache Airflow, and a broad suite of AWS services. You will apply strong analytical and engineering skills to deliver trusted, well-governed data assets in a modern, cloud-native environment., You will work as a Data Engineer on a complex, cloud-based data programme - designing, building, and maintaining data pipelines that process large volumes of data across a modern AWS-native stack. Using Apache Spark and PySpark for distributed data processing, Apache Airflow for orchestration, and a range of AWS services for storage, compute, and analytics, you will help deliver reliable, well-governed data assets to downstream users., Build and maintain scalable data pipelines using Apache Spark and PySpark, processing and transforming large datasets across distributed cloud infrastructure.
Workflow orchestration
Configure and manage Apache Airflow DAGs for task orchestration, ensuring reliable scheduling, monitoring, and execution of data processing workflows.
Root cause analysis
Perform data analysis to identify and resolve root causes of pipeline failures and data quality issues - including reviewing EMR output logs and CloudWatch metrics.
Data modelling
Apply understanding of dimensional data models and slowly changing dimensions (SCD) to design and maintain well-structured, analytically trusted data assets.
Infrastructure as code
Provision and manage cloud infrastructure using Terraform. Containerise solutions using Docker and manage deployments through GitLab CI/CD pipelines and release tagging.
Security & encryption
Apply understanding of both server-side and client-side encryption patterns within AWS. Work within IAM policies and data governance standards appropriate to a regulated government environment.
Requirements
You will apply strong data analysis skills to identify root causes of data issues, work with dimensional data models and slowly changing dimensions, and implement infrastructure as code using Terraform. Familiarity with DWP engineering best practices and the ability to translate customer expectations into applied technical functionality are key to success in this role., Technical skills requiredLanguages & analytics
- Python - primary language for pipeline development and data processing
- SQL - used for querying, transformation, and validation across data stores
- PySpark - for distributed data processing using Apache Spark on AWS EMR
- Familiarity with basic data structures for constructing robust, scalable solutions
Data processing & orchestration
- Apache Spark - understanding of distributed data processing architecture and execution
- Apache Airflow - configuring DAGs and managing task orchestration at scale
- Jupyter Notebooks - for exploratory data analysis and pipeline prototyping
- Understanding of dimensional data models and slowly changing dimensions (SCD Types 1, 2, 3)
- Data analysis skills to identify root cause of issues within pipelines and data assets
AWS services
- Amazon EMR - running Spark workloads and reviewing output logs
- Amazon Athena - ad hoc querying of data in S3
- Amazon Textract and Comprehend - familiarity with AI/ML document extraction and NLP services
- AWS S3, IAM, CloudWatch, EC2, ECR - core platform services used day-to-day
- AWS console proficiency - navigating, configuring, and monitoring services
- Understanding of server-side and client-side encryption within AWS
Infrastructure, DevOps & delivery
- Terraform - Infrastructure as Code for provisioning and managing AWS environments
- Docker - containerisation of data engineering solutions
- GitLab - source code management, CI/CD pipeline configuration, release tagging, and component versioning
- Familiarity with DWP engineering best practices
- Ability to translate customer expectations into applied, functional technical solutions
Technology stack at a glance