Systems Engineer

Recursion Technologies, Inc.

Richardson, United States of America

7 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Job location

Richardson, United States of America

Tech stack

API

Data analysis

Systems Engineering

Big Data

Computer Engineering

Data Validation

Data Governance

ETL

Software Debugging

Document Management Systems

Distributed Data Store

Github

Hadoop

Hadoop Distributed File System

Monitoring of Systems

Identity and Access Management

Python

Kerberos (Protocol)

Network Troubleshooting

Log Analysis

Prometheus

SQL Databases

SQLAlchemy

Data Streaming

Data Logging

Scripting (Bash/Python/Go/Ruby)

Apache Yarn

System Availability

Spark

Kubernetes

Information Technology

Kafka

Data Pipelines

Job description

· Design, develop, and maintain large-scale data processing pipelines using Apache Spark.

· Monitor and troubleshoot Spark job failures, including driver/executor crashes and performance bottlenecks.

· Manage and optimize workloads running on Hadoop (HDFS, YARN) clusters.

· Provide support for onboarding new data pipelines and services into the platform.

· Analyze and resolve resource allocation issues such as CPU/memory quota exceedance in Kubernetes environments.

· Build and maintain ETL pipelines for ingesting, transforming, and loading large datasets.

· Ensure data quality, consistency, and integrity across distributed data systems.

· Implement alerting rules and thresholds using monitoring platforms (e.g., Prometheus-based systems).

· Work with Kafka to manage data streaming pipelines, including topic configuration and access control.

· Troubleshoot Kafka consumer/producer issues, including lag, permissions, and connectivity errors.

· Implement and maintain data retention policies in Lakehouse architectures.

· Perform log analysis and debugging using distributed logging tools.

· Coordinate with infrastructure teams to resolve cluster-level or networking issues.

· Configure and manage storage paths, table-level retention, and lifecycle policies for datasets.

· Develop and execute SQL queries for data analysis, validation, and reporting.

· Automate workflows and monitoring using Python scripts and APIs (e.g., GitHub API, SQLAlchemy).

· Continuously improve system efficiency, scalability, and cost optimization.

· Analyze alerts from monitoring systems and take proactive action to prevent outages.

· Investigate production incidents (P1/P2) and perform root cause analysis (RCA).

· Collaborate with cross-functional teams (developers, SREs, data engineers) to resolve system issues.

· Conduct data validation and reconciliation between upstream and downstream systems.

· Maintain dashboards and observability tools (e.g., Hubble, internal monitoring systems).

· Optimize performance of distributed jobs by tuning configurations and execution plans.

· Handle identity and access management issues across systems (Kerberos, service accounts, ACLs).

· Support migration and integration of new data technologies into existing ecosystems.

· Work on Kubernetes-based Spark deployments and troubleshoot pod scheduling and quota issues.

· Ensure high availability and reliability of data pipelines and streaming jobs.

· Participate in on-call rotations and respond to critical production alerts.

· Document system architecture, troubleshooting steps, and operational procedures.

· Ensure compliance with organizational data governance and security standards.

Requirements

Do you have experience in Python?, Do you have a Bachelor's degree?, Bachelor's Degree is required in Computer Science or Computer Engineering or Computer Information Systems or Information Technology or Data Science.

Role details

Job location

Tech stack

Job description

Requirements

Apply for this position

Good distractions

Moments

Videos View all