Databricks Data Engineer
Role details
Job location
Tech stack
Job description
We are seeking a highly skilled and motivated Databricks Certified Engineer to design, build, and optimize scalable data pipelines and ETL workflows using the Databricks Data Intelligence Platform. The ideal candidate will be responsible for writing robust Python and Spark code, ensuring data quality, and implementing data governance across cloud environments (AWS, Azure, or GCP). This role requires expertise in large-scale data processing, data warehousing principles, and cloud-native solutions. Roles & Responsibilities:
-
Pipeline Development: Design, build, and maintain scalable ETL/ELT data pipelines using PySpark, Delta Lake, Auto Loader, and Databricks Workflows.
-
Data Transformation & Processing: Design and process batch and streaming data to support the Medallion Architecture (Bronze, Silver, Gold layers).
-
Data Governance & Security: Implement access controls and data masking policies using Unity Catalog to secure Personally Identifiable Information (PII) and ensure compliance.
-
Performance Tuning: Optimize Spark jobs, troubleshoot memory bottlenecks, and adjust cluster configurations for cost and compute efficiency.
-
Proactive Risk Identification: Proactively identify and address underlying data complexities, hidden challenges, and potential risks within data pipelines and the Databricks ecosystem, ensuring robust, secure, and efficient data solutions.
-
Cross-Functional Collaboration: Partner with Data Scientists and Analysts to curate datasets, support machine learning models (MLflow), and provide integrated reporting.
-
Develop and maintain comprehensive documentation for data pipelines, data models, and ETL processes.
-
Participate in code reviews to maintain high-quality code standards.
-
Troubleshoot and resolve issues in data pipelines and Databricks clusters.
Requirements
Do you have experience in Spark?, Do you have a Bachelor's degree?, * Primary Skill Set:
o Databricks Platform Expertise: In-depth knowledge of the Databricks Data Intelligence Platform, including notebooks, Delta Lake, MLflow, Unity Catalog, Auto Loader, and Databricks Workflows. o Databricks Certification: Relevant Databricks certification (Associate or Professional level) validating foundational or advanced skills in the platform.
- Secondary Skill Set:
o PySpark: Strong proficiency in developing complex data transformations and analytics using PySpark. o Apache Iceberg: Experience with Apache Iceberg for open table format management.
- Programming Languages:
o Python: Expert-level proficiency in Python for data manipulation, scripting, and application development. o SQL: Advanced proficiency in SQL for data querying and manipulation. o Shell Scripting: Experience with shell scripting for automation and job orchestration.
-
Cloud Platforms: Hands-on experience with Databricks deployed on major cloud providers such as Amazon Web Services (AWS), Microsoft Azure, or Google Cloud Platform (GCP).
-
Big Data Concepts: Deep understanding of distributed computing, data warehousing principles, ETL/ELT processes, and data modeling.
Good to Have Skills
-
DevOps Basics: Familiarity with CI/CD tools (e.g., Databricks Asset Bundles, GitHub Actions, GitLab) and orchestration tools like Apache Airflow.
-
Data Warehousing: Knowledge of Hive for data storage and querying.
-
Container Orchestration: Familiarity with Kubernetes for deploying and managing containerized applications.
-
Version Control: Experience with Git or other version control systems.
Databricks Certification Levels Depending on seniority, candidates may possess different levels of Databricks credentials:
-
Associate Level: Validates foundational skills in writing Spark code, building SQL queries, and utilizing the Databricks workspace.
-
Professional Level: Validates advanced skills for production environments, focusing on complex streaming workloads, CI/CD, data governance (Unity Catalog), and high-level performance optimization., Qualifications : BACHELOR OF COMPUTER SCIENCE