Data engineer

Prabhav Services Inc
Hanover, United States of America
3 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Hanover, United States of America

Tech stack

Query Performance
Java
API
Airflow
Amazon Web Services (AWS)
Amazon Web Services (AWS)
Application Release Automation
Azure
Big Data
Google BigQuery
Databases
Data Governance
ETL
Data Loss
Data Profiling
Data Warehousing
Hadoop
Python
PostgreSQL
MongoDB
MySQL
Scala
SQL Databases
Data Streaming
Talend
Unstructured Data
Workflow Management Systems
Google Cloud Platform
Informatica Powercenter
Snowflake
Spark
Indexer
Data Lake
PySpark
Gitlab-ci
Cassandra
Star Schema
Kafka
Stream Processing
Data Pipelines
Redshift

Job description

  • Pipeline Development: Build and maintain ETL/ELT pipelines for ingesting and transforming data.

  • Data Warehousing: Design and manage data warehouses and lakes (Snowflake, BigQuery, Redshift).

  • Big Data Processing: Optimize large-scale data workflows using Apache Spark or Hadoop.

  • Data Governance: Ensure data quality, lineage, and compliance with regulations.

  • Workflow Orchestration: Use Airflow or similar tools to schedule and monitor pipelines.

  • Integration: Connect APIs, databases, and streaming sources (Kafka).

  • Collaboration: Partner with analysts, data scientists, and business teams to deliver usable datasets., Programming Python, SQL, Scala, Java Core for building pipelines and transformations Databases MySQL, PostgreSQL, MongoDB, Cassandra Supports structured and unstructured data Big Data Apache Spark, Hadoop, Kafka Enables processing of massive datasets ETL Tools Airflow, dbt, Talend, Informatica Automates and manages workflows Cloud Platforms AWS (Glue, Redshift, S3), Azure (Synapse, Data Lake), Google Cloud Platform (BigQuery) Provides scalability and cost efficiency Data Modeling Star/Snowflake schemas, partitioning Ensures optimized storage and query performance Security Role-based access, encryption Critical for compliance and governance Risks & Challenges

  • Data Quality Issues: Poor validation can lead to unreliable analytics.

  • Pipeline Failures: Inadequate monitoring may cause downtime and data loss.

  • Cost Overruns: Inefficient queries or storage can inflate cloud costs.

  • Compliance Risks: Missing GDPR/DPDP controls can lead to legal exposure.

Best Practices

  • Automate pipeline monitoring with Airflow/Kafka.
  • Use data profiling before ingestion to detect anomalies.
  • Implement partitioning and indexing for performance.
  • Collaborate closely with data science teams to align schema design.

Requirements

· Hands-on senior engineer, working directly with developers on design and implementation of modernization initiatives.

· Strong Data engineer with more than 8 years of experience

· Strong hand on experience in Python

· Strong handoff experience in Pyspark and stream processing with Kafka.

· Lead containerization and cloud onboarding of services.

· Drive adoption of GitLab CI/CD, M1 pipelines, and release automation.

· Champion modern testing practices

· Drive Kafka adoption for event driven design standards

Apply for this position