Senior Data Engineer (Chinese Mandarin Speaker)

Bitus Labs

Irvine, United States of America

8 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English, Chinese

Experience level

Senior

Compensation

$ 130K

Job location

Irvine, United States of America

Tech stack

Query Performance

Java

Airflow

Amazon Web Services (AWS)

Apache HTTP Server

Code Review

Continuous Integration

Data Validation

Information Engineering

Data Governance

Data Infrastructure

ETL

Data Transformation

Data Security

DevOps

Digital Architecture

Memory Management

Github

Gradle

Identity and Access Management

Python

Maven

Online Analytical Processing

Operational Databases

Pair Programming

Performance Tuning

Query Optimization

Azure

SQL Databases

Data Streaming

Amazon Web Services (AWS)

Sql Optimization

Delivery Pipeline

Spark

Boto3

Amazon Web Services (AWS)

Cloudformation

Pandas

Build Management

Data Lake

PySpark

Data Lineage

Druid

Apache Flink

Production Code

Amazon Web Services (AWS)

Integration Frameworks

Kafka

Build Tools

Spark Streaming

Machine Learning Operations

Data Lakehouse

Vertica

Terraform

Data Pipelines

Programming Languages

Job description

We are looking for a Senior Data Engineer to join our Data Platform team and take ownership of building and scaling our AWS-based data lakehouse. You will architect and deliver robust, production-grade data pipelines, work closely with data scientists, analytics engineers, and product teams, and set the technical direction for how data flows across the organization. This is a hands-on engineering role - you will write production code in Java and Python every day, while also contributing to platform design decisions, mentoring junior engineers, and driving best practices around data quality, reliability, and governance., Data Lakehouse Architecture & Development

Design and build scalable medallion-architecture data lakehouses (Bronze / Silver / Gold) on AWS S3 using Apache Iceberg table format.
Develop and maintain high-throughput ETL/ELT pipelines using AWS Glue, EMR (Spark), and Lambda.
Implement schema evolution, partitioning strategies, and compaction processes for Iceberg tables to optimize storage and query performance.
Write production-quality pipeline code in both Java and Python, selecting the appropriate language per performance and maintainability requirements.

Real-Time & Batch Streaming

Build and operate event-driven data pipelines using Amazon Kinesis Data Streams, Kinesis Firehose, or Apache Kafka (MSK).
Design exactly-once and at-least-once processing semantics for streaming workloads using Apache Flink or Spark Structured Streaming on EMR.

AWS Platform Engineering

Manage infrastructure as code using AWS CDK or Terraform for repeatable, auditable data platform deployments.
Optimize cost and performance across AWS services including S3, Glue, Athena, Redshift Spectrum, EMR, Lambda, Step Functions, and EventBridge.
Implement data security best practices: IAM least-privilege policies, KMS encryption, VPC networking, and Lake Formation fine-grained access control.
Build and maintain CI/CD pipelines for data workloads using AWS CodePipeline, GitHub Actions, or equivalent.

Data Quality & Governance

Implement data quality frameworks (e.g., Great Expectations, Deequ) and integrate validation steps into pipeline orchestration.
Define and enforce data contracts between producing and consuming systems.
Contribute to data cataloguing and lineage tracking using AWS Glue Data Catalog or Apache Atlas.

Collaboration & Technical Leadership

Partner with data scientists, ML engineers, and analysts to understand data requirements and deliver performant, well-documented datasets.
Mentor mid-level and junior engineers through code reviews, design discussions, and pair programming.
Document architecture decisions (ADRs) and contribute to internal engineering knowledge base.

Requirements

5+ years of professional data engineering experience, with at least 3 years on AWS cloud platforms.
Proven track record of delivering production data pipelines at scale (TB+ datasets, highthroughput SLAs).
Experience with data lakehouse architectures - medallion pattern, open table formats (Iceberg preferred; Delta Lake or Hudi acceptable).

Programming Languages

Java: Strong command of Java (8+) for Spark jobs, custom Iceberg connectors, and performance-critical pipeline components. Familiarity with Maven/Gradle build systems.
Python: Proficient in Python 3 for AWS Glue scripts, orchestration logic, data quality checks, and automation tooling. Experience with pandas, PySpark, boto3, and packaging best practices.

AWS Core Services

Storage & Compute: S3, Glue (jobs, crawlers, Data Catalog), EMR (Spark/Flink), Lambda, EC2.
Streaming: Kinesis Data Streams, Kinesis Firehose, or MSK (Managed Kafka).
Orchestration: Step Functions, MWAA (Managed Airflow), or EventBridge Scheduler.
Querying: Athena, Redshift, or Redshift Spectrum.
Security & Governance: IAM, KMS, Lake Formation, Secrets Manager, VPC.
DevOps: AWS CDK or CloudFormation; CodePipeline or equivalent CI/CD tools.

Data Processing Frameworks

Apache Spark (PySpark and/or Spark Java API) - distributed transformations, performance tuning, memory management.
Apache Iceberg - table maintenance, time travel, snapshot management, partition evolution.
SQL - advanced SQL for data transformation, window functions, CTEs, query optimization.

Preferred / Nice to Have

AWS Certified Data Engineer - Associate or AWS Certified Solutions Architect certification.
Experience with dbt for SQL-based transformation layers on top of the lakehouse.
Familiarity with ML platform integration: feature stores (SageMaker Feature Store), model serving data needs, or MLflow experiment tracking.
Experience with real-time OLAP engines such as Apache Druid or ClickHouse.
Contributions to open-source data tooling or internal platform libraries.
Exposure to data mesh or data product thinking - defining domain ownership and data contracts.

Tech Stack at a Glance

Languages

Java (8+), Python 3

Cloud Platform

AWS (S3, Glue, EMR, Kinesis, Athena, Lambda, Step Functions, Lake Formation, CDK)

Processing

Apache Spark, Apache Flink, Spark Structured Streaming

Table Format

Apache Iceberg (primary), Delta Lake / Hudi (familiarity)

Streaming

Amazon Kinesis, MSK (Kafka), Kinesis Firehose

Orchestration

Apache Airflow (MWAA), AWS Step Functions

IaC & CI/CD

AWS CDK / Terraform, GitHub Actions / CodePipeline

Benefits & conditions

Parental leave, 401(k), Health insurance, Retirement plan, 401(k) matching, Paid time off, Vision insurance, Dental insurance Full-time Irvine, CA 92618, * 401(k)

401(k) matching
Dental insurance
Health insurance
Life insurance
Paid time off
Parental leave
Retirement plan
Vision insurance

Language:

Chinese (Required)

Role details

Job location

Tech stack

Job description

Requirements

Benefits & conditions

Apply for this position

Good distractions

Moments

Videos View all