Senior Data Engineer (Chinese Mandarin Speaker)

Bitus Labs
Irvine, United States of America
8 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English, Chinese
Experience level
Senior
Compensation
$ 130K

Job location

Irvine, United States of America

Tech stack

Query Performance
Java
Airflow
Amazon Web Services (AWS)
Amazon Web Services (AWS)
Amazon Web Services (AWS)
Apache HTTP Server
Code Review
Continuous Integration
Data Validation
Information Engineering
Data Governance
Data Infrastructure
ETL
Data Transformation
Data Security
DevOps
Digital Architecture
Memory Management
Github
Gradle
Identity and Access Management
Python
Maven
Online Analytical Processing
Operational Databases
Pair Programming
Performance Tuning
Query Optimization
Azure
SQL Databases
Data Streaming
Amazon Web Services (AWS)
Sql Optimization
Delivery Pipeline
Spark
Boto3
Amazon Web Services (AWS)
Cloudformation
Pandas
Build Management
Data Lake
PySpark
Data Lineage
Druid
Apache Flink
Production Code
Amazon Web Services (AWS)
Integration Frameworks
Kafka
Build Tools
Spark Streaming
Machine Learning Operations
Data Lakehouse
Vertica
Terraform
Data Pipelines
Programming Languages

Job description

We are looking for a Senior Data Engineer to join our Data Platform team and take ownership of building and scaling our AWS-based data lakehouse. You will architect and deliver robust, production-grade data pipelines, work closely with data scientists, analytics engineers, and product teams, and set the technical direction for how data flows across the organization. This is a hands-on engineering role - you will write production code in Java and Python every day, while also contributing to platform design decisions, mentoring junior engineers, and driving best practices around data quality, reliability, and governance., Data Lakehouse Architecture & Development

  • Design and build scalable medallion-architecture data lakehouses (Bronze / Silver / Gold) on AWS S3 using Apache Iceberg table format.
  • Develop and maintain high-throughput ETL/ELT pipelines using AWS Glue, EMR (Spark), and Lambda.
  • Implement schema evolution, partitioning strategies, and compaction processes for Iceberg tables to optimize storage and query performance.
  • Write production-quality pipeline code in both Java and Python, selecting the appropriate language per performance and maintainability requirements.

Real-Time & Batch Streaming

  • Build and operate event-driven data pipelines using Amazon Kinesis Data Streams, Kinesis Firehose, or Apache Kafka (MSK).
  • Design exactly-once and at-least-once processing semantics for streaming workloads using Apache Flink or Spark Structured Streaming on EMR.

AWS Platform Engineering

  • Manage infrastructure as code using AWS CDK or Terraform for repeatable, auditable data platform deployments.
  • Optimize cost and performance across AWS services including S3, Glue, Athena, Redshift Spectrum, EMR, Lambda, Step Functions, and EventBridge.
  • Implement data security best practices: IAM least-privilege policies, KMS encryption, VPC networking, and Lake Formation fine-grained access control.
  • Build and maintain CI/CD pipelines for data workloads using AWS CodePipeline, GitHub Actions, or equivalent.

Data Quality & Governance

  • Implement data quality frameworks (e.g., Great Expectations, Deequ) and integrate validation steps into pipeline orchestration.
  • Define and enforce data contracts between producing and consuming systems.
  • Contribute to data cataloguing and lineage tracking using AWS Glue Data Catalog or Apache Atlas.

Collaboration & Technical Leadership

  • Partner with data scientists, ML engineers, and analysts to understand data requirements and deliver performant, well-documented datasets.
  • Mentor mid-level and junior engineers through code reviews, design discussions, and pair programming.
  • Document architecture decisions (ADRs) and contribute to internal engineering knowledge base.

Requirements

  • 5+ years of professional data engineering experience, with at least 3 years on AWS cloud platforms.
  • Proven track record of delivering production data pipelines at scale (TB+ datasets, highthroughput SLAs).
  • Experience with data lakehouse architectures - medallion pattern, open table formats (Iceberg preferred; Delta Lake or Hudi acceptable).

Programming Languages

  • Java: Strong command of Java (8+) for Spark jobs, custom Iceberg connectors, and performance-critical pipeline components. Familiarity with Maven/Gradle build systems.
  • Python: Proficient in Python 3 for AWS Glue scripts, orchestration logic, data quality checks, and automation tooling. Experience with pandas, PySpark, boto3, and packaging best practices.

AWS Core Services

  • Storage & Compute: S3, Glue (jobs, crawlers, Data Catalog), EMR (Spark/Flink), Lambda, EC2.
  • Streaming: Kinesis Data Streams, Kinesis Firehose, or MSK (Managed Kafka).
  • Orchestration: Step Functions, MWAA (Managed Airflow), or EventBridge Scheduler.
  • Querying: Athena, Redshift, or Redshift Spectrum.
  • Security & Governance: IAM, KMS, Lake Formation, Secrets Manager, VPC.
  • DevOps: AWS CDK or CloudFormation; CodePipeline or equivalent CI/CD tools.

Data Processing Frameworks

  • Apache Spark (PySpark and/or Spark Java API) - distributed transformations, performance tuning, memory management.
  • Apache Iceberg - table maintenance, time travel, snapshot management, partition evolution.
  • SQL - advanced SQL for data transformation, window functions, CTEs, query optimization.

Preferred / Nice to Have

  • AWS Certified Data Engineer - Associate or AWS Certified Solutions Architect certification.
  • Experience with dbt for SQL-based transformation layers on top of the lakehouse.
  • Familiarity with ML platform integration: feature stores (SageMaker Feature Store), model serving data needs, or MLflow experiment tracking.
  • Experience with real-time OLAP engines such as Apache Druid or ClickHouse.
  • Contributions to open-source data tooling or internal platform libraries.
  • Exposure to data mesh or data product thinking - defining domain ownership and data contracts.

Tech Stack at a Glance

Languages

Java (8+), Python 3

Cloud Platform

AWS (S3, Glue, EMR, Kinesis, Athena, Lambda, Step Functions, Lake Formation, CDK)

Processing

Apache Spark, Apache Flink, Spark Structured Streaming

Table Format

Apache Iceberg (primary), Delta Lake / Hudi (familiarity)

Streaming

Amazon Kinesis, MSK (Kafka), Kinesis Firehose

Orchestration

Apache Airflow (MWAA), AWS Step Functions

IaC & CI/CD

AWS CDK / Terraform, GitHub Actions / CodePipeline

Benefits & conditions

Parental leave, 401(k), Health insurance, Retirement plan, 401(k) matching, Paid time off, Vision insurance, Dental insurance Full-time Irvine, CA 92618, * 401(k)

  • 401(k) matching
  • Dental insurance
  • Health insurance
  • Life insurance
  • Paid time off
  • Parental leave
  • Retirement plan
  • Vision insurance

Language:

  • Chinese (Required)

Apply for this position