Data Engineer- Python, AI/ML

Motion Recruitment
Warren, United States of America
yesterday

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Intermediate

Job location

Warren, United States of America

Tech stack

Artificial Intelligence
Airflow
Amazon Web Services (AWS)
Automation of Tests
Azure
Code Review
Databases
Continuous Integration
Data Validation
Data Deduplication
Python
PostgreSQL
Microsoft SQL Server
MySQL
NumPy
Power BI
SciPy
SQL Databases
Tableau
Workflow Management Systems
Cloud Platform System
Data Classification
Feature Engineering
Large Language Models
Spark
GIT
Pandas
PySpark
Semi-structured Data
Scikit Learn
Information Technology
Data Lineage
Data Management
Looker Analytics
Data Pipelines

Job description

  • Build and maintain Python and SQL pipelines for governance-related ingestion, cleaning, transformation, and validation of structured and semi-structured data.
  • Implement and operate data quality checks, schema validation, and integrity rules across pipelines; investigate and resolve quality issues.
  • Contribute to master data workflows: standardization, deduplication, and consolidation of data from heterogeneous sources into consistent reference and golden-record datasets.
  • Instrument pipelines for data lineage, metadata, and catalog tooling.
  • Develop pipelines that feed governance dashboards and reporting in Tableau, Power BI, or Looker.
  • Build reproducible, well-documented pipelines for compliance and audit reporting.
  • Contribute to AI / ML-assisted governance use cases: embedding-based data classification, anomaly detection on quality metrics, LLM-assisted catalog search, and MCP-based exposure of governed datasets to AI assistants.
  • Partner with team leads, data stewards, and stakeholders to translate governance requirements into engineering work.
  • Follow team engineering practices: Git, code review, modular pipeline design, automated testing, CI/CD.

Requirements

  • Bachelor's or Master's degree in Computer Science, Data Science, Engineering, Statistics, or a related field.
  • 2+ years building data pipelines in Python (Pandas, NumPy, SciPy) and SQL.
  • Working experience with Apache Spark or PySpark and workflow orchestration (Apache Airflow).
  • Schema design across relational (PostgreSQL, MySQL, SQL Server) and analytical databases, including standardization across heterogeneous sources.
  • Experience implementing data quality validation, EDA, and integrity enforcement on production datasets.
  • Hands-on experience with at least one major cloud platform (AWS, Azure, or GCP).
  • Working familiarity with Python ML libraries (Scikit-Learn) for feature engineering and exploratory analysis.
  • Experience producing analytics-ready datasets for BI tools (Tableau, Power BI, or Looker).
  • Git, code review, and CI/CD practices.
  • Clear technical communication and collaborative working style.

Apply for this position