Senior Data Engineer

Holcim

Municipality of Madrid, Spain

2 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Intermediate

Job location

Municipality of Madrid, Spain

Tech stack

Query Performance

Java

Agile Methodologies

Artificial Intelligence

Airflow

Amazon Web Services (AWS)

Automation of Tests

Big Data

Cloud Computing

Program Optimization

Software Quality

Continuous Integration

Data Infrastructure

ETL

Data Systems

Data Visualization

Data Warehousing

DevOps

Distributed Systems

Github

Identity and Access Management

Python

Openshift

Scrum

Software Deployment

Software Engineering

SQL Databases

Data Streaming

Parquet

Qliksense

Data Storage Technologies

System Availability

Snowflake

Spark

GIT

Cloudformation

Data Lake

Information Technology

Low Latency

Apache Flink

Avro

Amazon Web Services (AWS)

Data Analytics

QlikView

Amazon Web Services (AWS)

Real Time Data

Kafka

Terraform

Data Pipelines

Serverless Computing

Amazon Web Services (AWS)

Docker

Jenkins

Redshift

Databricks

Job description

We are seeking a seasoned Senior Data Engineer to design, build, and optimize our next-generation data platform. You will be responsible for architecting scalable data pipelines, managing large-scale distributed systems, and ensuring our data infrastructure in AWS and Databricks is robust and efficient. The ideal candidate is a Spark expert with a deep understanding of the AWS ecosystem and a passion for automation., * Pipeline Architecture: Design and implement complex batch and streaming ETL/ELT pipelines using Python, SQL, and Spark to process massive datasets.

Cloud Infrastructure: Leverage AWS Data Analytics services to build scalable, secure, and cost-effective data solutions.
Orchestration & DevOps: Manage and automate data workflows using Airflow, while utilizing Docker and ECS for containerized application deployment.
System Optimization: Monitor and tune the performance of distributed systems (Spark Cluster) to ensure high availability and low latency.
Infrastructure as Code: Utilize AWS CloudFormation or Terraform to manage data infrastructure, ensuring repeatable and version-controlled environments.
Cost Optimization: Monitor and optimize AWS spend by selecting appropriate instance types (Spot vs. On-Demand) and refining data storage strategies.
Security & Compliance: Implement IAM roles, bucket policies, and encryption (KMS) to ensure data is secure at rest and in transit.
Collaboration: Work within an Agile framework to deliver iterative value, collaborating closely with Data Scientists and Stakeholders to translate business needs into technical reality.

JOB DIMENSIONS

List of direct reports:

Up to 2 Direct Reports, and around 15 externals

Key interfaces, stakeholders and relationships:

Internal:

GDS: product manager, application manager, data & analytics & AI team
Country business stakeholders

External : 3rd party vendors, o Amazon S3: Implementing "Data Lake" best practices, including partitioning, compression (Parquet/Avro), and lifecycle policies. o Amazon Redshift: Designing star/snowflake schemas and optimizing query performance for high-volume data warehousing. o Amazon Athena: Performing ad-hoc SQL analysis directly on S3 data. o Experience with open table formats (iceberg/delta)

Orchestration & Integration: o Amazon MWAA (Managed Workflows for Apache Airflow): Deploying and scaling Airflow environments.

Requirements

Experience: Minimum 4+ years of hands-on experience in active Big Data environments and 2+ years specializing in Data Analytics within AWS.

Compute & Processing: Amazon EMR: Architecting and managing Spark clusters for large-scale distributed processing. o AWS Glue: Developing serverless ETL jobs, managing the Data Catalog, and implementing Glue Crawlers., + Streaming (Advantage): Amazon Kinesis or MSK (Managed Streaming for Kafka) for real-time data ingestion.

Core Engineering: Expert-level proficiency in Spark, Python, and SQL.
Infrastructure & Tooling: Proven experience with Airflow for orchestration and Docker/ECS for containerization.
Good knowledge in Databricks and data mesh architectures. Good understanding in how to implement and maintain Lakehouse data models (bronze / silver / gold layers) using Delta Lake for reliability, ACID transactions, time travel and schema evolution.
Solid software engineering practices: Git, CI/CD for data pipelines, automated testing, code quality and documentation.
Communication: Excellent written and oral English communication skills, with the ability to explain complex technical concepts to non-technical audiences.
Degree in Computer Science, Engineering, Mathematics or related field, or equivalent practical experience., * Real-time Processing: Experience with streaming and distributed messaging applications like Flink and Kafka.
Core Tech: Java programming.
Industrialise ML use cases
Data Visualization: Experience with QlikView or QlikSense to support BI initiatives.
Agile: Experience working in a fast-paced Scrum or Kanban environment.
Certifications: AWS Certified Data Engineer - Associate/Professional or AWS Certified Solutions Architect, Databricks Data engineer (Associated/Professional) certification
DevOps: Experience with Openshift, Github Actions or Jenkins for CI/CD of data workflows.