Senior Java Engineer - HPC Cluster Development Maintenance, 100%
Role details
Job location
Tech stack
Job description
We are seeking a strong Java engineer to develop and maintain a High-Performance Computing (HPC) cluster comprising hundreds of servers (on-premises and augmented by Microsoft Azure). This cluster provides critical computing power to a modern trading floor.
The cluster scheduling and control systems are primarily built in Java, with Apache Ignite as the clustering layer. Jobs are received from internal clients via legacy HTTP and ActiveMQ interfaces, while modern clients use an in-house API developed in Java and Python. System statistics are collected in MongoDB and Elastic/Kibana.
This role requires ensuring high availability for a mission-critical resource, developing a deep understanding of a large existing codebase, and proposing improvements. You will also maintain and enhance multiple monitoring systems for both infrastructure and client job submissions.
YOUR CHALLENGE
- Develop and maintain Java-based cluster scheduling and control systems.
- Ensure high availability and reliability of HPC resources.
- Maintain and improve monitoring systems for infrastructure and client submissions.
- Collaborate with internal stakeholders to balance resource allocation with business requirements.
- Troubleshoot and optimize performance across on-prem and Azure-augmented environments.
- Ensure compliance with banking regulatory requirements.
Requirements
- Distributed Execution Engines.
- NoSQL Databases
- ActiveMQ (messaging)
- Front-end Web Technologies (legacy UI maintenance)
- Python (for API and tooling)
Familiarity with:
- Azure Portal and Azure SDK for Java
- Red Hat Linux
- Understanding of banking systems and regulatory requirements
- Language: Fluent English; German is a plus
Bonus Skills:
- Scala and C++
- Kubernetes (container orchestration)