Associate Principal - Data Engineering
Role details
Job location
Tech stack
Job description
PySpark Development Primary Focus
Design and develop productiongrade PySpark applications for largescale batch and streaming data processing
Implement advanced PySpark DataFrame API operations
oComplex transformations Window functions PivotUnpivot and nested struct handling
oMultidataset joins Broadcast joins SortMerge joins and skewhandling strategies
oCustom UDFs User Defined Functions and Pandas UDFs Vectorized UDFs for performancecritical transformations
oAggregations and GroupBy operations optimized for large FMCG datasets
Implement PySpark Structured Streaming for realtime data processing
oKafka Azure Event Hubs GCP PubSub as streaming sources
oWatermarking and windowing strategies for latearriving data
oStateful streaming operations using mapGroupsWithState
oExactlyonce and atleastonce delivery semantics
Apply advanced Spark Performance Tuning techniques
oPartition optimization repartition vs coalesce strategies
oHandling data skew using salting and custom partitioners
oBroadcast variable management and accumulator usage
oCatalyst optimizer hints and AQE Adaptive Query Execution tuning
oExecutor sizing memory fractions and parallelism configuration
Develop and maintain reusable PySpark libraries for shared data processing capabilities
Python Engineering Primary Focus
Build Pythonbased data services automation scripts and utility frameworks supporting the data platform
Develop REST API integrations using Python requests httpx for consuming SAP OData Salesforce and thirdparty FMCG APIs
Implement data validation and reconciliation frameworks using Python Great Expectations Pandera
Build Pythonbased orchestration scripts and helper utilities for Airflow DAGs and Databricks Workflows
Apply software engineering best practices
oUnit testing with pytest and integration testing with Testcontainers
oType hints docstrings and modular design patterns
oVirtual environments dependency management Poetrypip and packaging
Implement Pythonbased data quality checks Completeness consistency and conformity validations
Data Lakehouse Cloud Platform Primary Focus
Build and manage Data Lakehouse architectures on hyperscaler platforms
oAzure Databricks GCP Dataproc AWS EMR for Spark cluster management
oDelta Lake Apache Iceberg Apache Hudi for ACIDcompliant data lake storage
oMedallion Architecture BronzeSilverGold for progressive data refinement
Implement Delta Lake features
oACID transactions and schema enforcement
oTime Travel for data versioning and rollback
oDelta Live Tables DLT for declarative pipeline development
oOptimize and ZOrder for query performance acceleration
oChange Data Feed CDF for incremental data propagation
Manage Databricks Workflows and Job Clusters for production pipeline execution
Implement Databricks Auto Loader for incremental scalable data ingestion from cloud storage
Utilize Unity Catalog for data governance lineage and access control
Data Ingestion Integration
Build data ingestion pipelines from diverse FMCG data sources
oSAP S4HANA OData APIs BAPI extracts and IDocbased feeds
oSalesforce REST API Bulk API and Platform Events
oOperational Databases Oracle Cloud SQL Azure SQL and Cloud Spanner
oStreaming Sources Apache Kafka Azure Event Hubs and GCP PubSub
oFilebased Sources SFTP Azure Blob GCS and S3 CSV Parquet Avro JSON
Implement Change Data Capture CDC patterns for realtime database synchronization
Design schema evolution strategies to handle upstream data model changes gracefully
Publish processed data to downstream consumers
oBigQuery Azure Synapse Snowflake for BI and analytics
oFeature Stores FeastDatabricks for AIML model training
oPower BI Looker for business reporting
SQL Data Modeling
Write and optimize complex SQL queries for data extraction transformation and validation
Design data warehouse schemas Star and Snowflake models for FMCG analytics domains
Implement Spark SQL for largescale analytical query processing
Develop data quality SQL checks and reconciliation frameworks
Optimize SQL performance Query plans partition pruning and predicate pushdown
Requirements
Do you have experience in Unit testing?
Benefits & conditions
(part of Larsen and Toubro (L&T)) 3.73.7 out of 5 stars Cincinnati, OH $100,000 - $120,000 a year, Pulled from the full job description
- Paid parental leave
- Parental leave
- Health insurance
- 401(k) matching
- Vision insurance
- Dental insurance
- Life insurance, Benefits/perks listed below may vary depending on the nature of your employment with LTIMindtree ("LTIM"):
Benefits and Perks:
- Comprehensive Medical Plan Covering Medical, Dental, Vision
- Short Term and Long-Term Disability Coverage
- 401(k) Plan with Company match
- Life Insurance
- Vacation Time, Sick Leave, Paid Holidays
- Paid Paternity and Maternity Leave
The range displayed on each job posting reflects the minimum and maximum salary target for the position across all US locations. Within the range, individual pay is determined by work location and job level and additional factors including job-related skills, experience, and relevant education or training. Depending on the position offered, other forms of compensation may be provided as part of overall compensation like an annual performance-based bonus, sales incentive pay and other forms of bonus or variable compensation., Compensation range: $100,000.00 to $120,000.00 per year