Data Engineer

Tyne & Wear
Boldon Colliery, United Kingdom
7 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Boldon Colliery, United Kingdom

Tech stack

API
Artificial Intelligence
Amazon Web Services (AWS)
Amazon Web Services (AWS)
Fluid
Cloud Computing
Information Engineering
Middleware
JSON
Python
Search Technologies
Siemens NX
Parquet
Multi-Cloud
Data Lake
PySpark
Amazon Web Services (AWS)
Ansys

Job description

Role summary:The overall technical lead and architect. Designs the metadata schema, builds the simulation onboarding pipeline, deploys metadata embedding pipeline and OpenSearch k-NN vector store, and authors data export format spec for AI/ML use case. This role is the deepest technical seat on the engagement: Key responsibilitiesRun the Sprint 1 architecture review of the existing UAT codebase (S3 + Glue + S3 Tables + OpenSearch + Athena) and deliver written gap findings.Design the metadata schema, taxonomy, and field catalogue (Light, Brain, Power).Tune data orchestration - Glue jobs, Athena queries, S3 Tables config, scheduling. Lead the deep-dive technical sessions with analysts on visualization requirements Build and validate the simulation data onboarding pipeline against real data - including the 30 GB-per-run acoustic spectra dataset.Configure and validate the OpenSearch k-NN vector store and the Bedrock embedding pipeline.Author the AI/ML data export format specification and

Requirements

the AI onboarding pattern document.Co-design the API middleware blueprint with the Cloud Infrastructure Architect. Must-have Principal-level hands-on data engineering on AWS - 7+ years Deep production experience with S3, S3 Tables, Glue, Athena, and OpenSearch (including k-NN / vector search) Built and shipped vector embedding workloads Strong metadata modelling and data taxonomy design experience for scientific or engineering domains Comfort working with Parquet, JSON-LD, and large binary scientific data formats (mesh, time-series, spectra) Python proficiency; PySpark / Glue job tuning experience Nice-to-have / differentiatorsPrior simulation / CAE / HPC data lake experience (Ansys, Siemens NX, BETA CAE, OpenFOAM, etc.)Familiarity with surrogate model training data pipelinesExperience with SageMaker Unified Studio or comparable governed data-mesh tooling (in case of required integration)Multi-cloud data engineering (AWS GCP) experiencePublished or contributed to AWS data architecture patterns or blueprints

Apply for this position