Systems Engineer

Recursion Technologies, Inc.
Richardson, United States of America
7 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English

Job location

Richardson, United States of America

Tech stack

API
Data analysis
Systems Engineering
Big Data
Computer Engineering
Data Validation
Data Governance
ETL
Software Debugging
Document Management Systems
Distributed Data Store
Github
Hadoop
Hadoop Distributed File System
Monitoring of Systems
Identity and Access Management
Python
Kerberos (Protocol)
Network Troubleshooting
Log Analysis
Prometheus
SQL Databases
SQLAlchemy
Data Streaming
Data Logging
Scripting (Bash/Python/Go/Ruby)
Apache Yarn
System Availability
Spark
Kubernetes
Information Technology
Kafka
Data Pipelines

Job description

· Design, develop, and maintain large-scale data processing pipelines using Apache Spark.

· Monitor and troubleshoot Spark job failures, including driver/executor crashes and performance bottlenecks.

· Manage and optimize workloads running on Hadoop (HDFS, YARN) clusters.

· Provide support for onboarding new data pipelines and services into the platform.

· Analyze and resolve resource allocation issues such as CPU/memory quota exceedance in Kubernetes environments.

· Build and maintain ETL pipelines for ingesting, transforming, and loading large datasets.

· Ensure data quality, consistency, and integrity across distributed data systems.

· Implement alerting rules and thresholds using monitoring platforms (e.g., Prometheus-based systems).

· Work with Kafka to manage data streaming pipelines, including topic configuration and access control.

· Troubleshoot Kafka consumer/producer issues, including lag, permissions, and connectivity errors.

· Implement and maintain data retention policies in Lakehouse architectures.

· Perform log analysis and debugging using distributed logging tools.

· Coordinate with infrastructure teams to resolve cluster-level or networking issues.

· Configure and manage storage paths, table-level retention, and lifecycle policies for datasets.

· Develop and execute SQL queries for data analysis, validation, and reporting.

· Automate workflows and monitoring using Python scripts and APIs (e.g., GitHub API, SQLAlchemy).

· Continuously improve system efficiency, scalability, and cost optimization.

· Analyze alerts from monitoring systems and take proactive action to prevent outages.

· Investigate production incidents (P1/P2) and perform root cause analysis (RCA).

· Collaborate with cross-functional teams (developers, SREs, data engineers) to resolve system issues.

· Conduct data validation and reconciliation between upstream and downstream systems.

· Maintain dashboards and observability tools (e.g., Hubble, internal monitoring systems).

· Optimize performance of distributed jobs by tuning configurations and execution plans.

· Handle identity and access management issues across systems (Kerberos, service accounts, ACLs).

· Support migration and integration of new data technologies into existing ecosystems.

· Work on Kubernetes-based Spark deployments and troubleshoot pod scheduling and quota issues.

· Ensure high availability and reliability of data pipelines and streaming jobs.

· Participate in on-call rotations and respond to critical production alerts.

· Document system architecture, troubleshooting steps, and operational procedures.

· Ensure compliance with organizational data governance and security standards.

Requirements

Do you have experience in Python?, Do you have a Bachelor's degree?, Bachelor's Degree is required in Computer Science or Computer Engineering or Computer Information Systems or Information Technology or Data Science.

Apply for this position