INTL Senior Data Engineer - AOR
Role details
Job location
Tech stack
Job description
An employer is seeking a Data Engineer II (ML Training & Multi-Source Integration) to join a large healthcare client supporting the AI Insight & Next Best Action platform. The project focuses on building and scaling the data layer that powers machine learning models, including integrating multiple data sources, developing feature pipelines, and enabling high-quality, production-ready ML datasets.
Responsibilities will include:
Build and maintain Feature Store pipelines that ingest and process behavioral, clinical, engagement, and Rx data signals
Design and develop ML training datasets, including batch and real-time feature pipelines, dataset versioning, and training/evaluation splits
Integrate and normalize multi-source data such as Kafka event streams, Adobe Analytics data, and healthcare datasets
Develop and optimize large-scale data processing jobs using Apache Spark (Dataproc) for feature engineering and model input preparation
Monitor and improve data quality for ML models, including tracking feature freshness, identifying data drift, and ensuring pipeline reliability
Partner with engineering teams to define data schemas and event structures that support downstream machine learning workflows
Ensure secure and compliant handling of sensitive data, including masking, de-identification, and maintaining auditability within data pipelines
Support data resiliency efforts, including disaster recovery planning, data replication strategies, and dataset lifecycle management
Maintain clear documentation of data pipelines, feature definitions, and lineage to support model transparency and operational efficiency
We are a company committed to creating diverse and inclusive environments where people can bring their full, authentic selves to work every day. We are an equal opportunity/affirmative action employer that believes everyone matters. Qualified candidates will receive consideration for employment regardless of their race, color, ethnicity, religion, sex (including pregnancy), sexual orientation, gender identity and expression, marital status, national origin, ancestry, genetic factors, age, disability, protected veteran status, military or uniformed service member status, or any other status or characteristic protected by applicable laws, regulations, and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application or recruiting process, please send a request to HR@insightglobal.com.To learn more about how we collect, keep, and process your private information, please review Insight Global's Workforce Privacy Policy: https://insightglobal.com/workforce-privacy-policy/.
Requirements
-
5-7 years of data or ML data engineering experience (production environment, ideally GCP)
-
Strong programming experience in Python, Java, or Node.js for building data pipelines and feature engineering
-
Hands-on experience building ML training pipelines and Feature Stores (GCP Feature Store preferred)
-
Deep experience with Apache Spark (PySpark/DataSpark) for large-scale data processing and feature engineering
-
Strong experience working with BigQuery (complex SQL, data modeling, performance optimization)
-
Experience with Kafka (streaming ingestion / event-driven pipelines) Experience working with multi-source feature stores (behavioral, clinical, transactional data)
Knowledge of healthcare data domains (Rx, clinical, benefits)
Experience integrating Adobe Analytics or similar behavioral data platforms
Familiarity with Adobe Experience Platform APIs
Exposure to NIST / HITRUST frameworks in regulated data environments
Experience with GCP encryption (CMEK) for secure datasets