Master Thesis - Graph-based Modeling of Toxicogenomics Data: A Neo4j

Helmholtz Zentrum für Umweltforschung GmbH
Leipzig, Germany
2 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English

Job location

Leipzig, Germany

Tech stack

Agile Methodologies
Bioinformatics
Collaborative Software
Computational Biology
Databases
Information Engineering
Data Files
Graph Database
Python
Neo4j
Parsing
Software Engineering
SQLite
Data Import/Export
Data Ingestion
GIT
Information Technology
Software Coding

Job description

The UFZ

The Helmholtz Centre for Environmental Research (UFZ) with its 1,100 employees has gained an excellent reputation as an international competence centre for environmental sciences. We are part of the largest scientific organisation in Germany, the Helmholtz association. Our mission: Our research seeks to find a balance between social development and the long-term protection of our natural resources.

The job

The Comparative Toxicogenomics Database (CTD) contains millions of curated interactions between chemicals, genes, phenotypes, and diseases. Recent research has introduced CGPD tetramers, a structured four-step evidence path linking Chemical Gene Phenotype Disease to support chemical grouping and cumulative risk assessment in regulatory toxicology.Until now, these CGPD tetramers were generated using a relational SQLite database created from structured CTD data files using a custom workflow. However, CTD inherently describes a biological knowledge graph, and a graph database such as Neo4j is ideally suited for representing and querying these interconnected relationships. A graph-based implementation offers more flexible exploration, more intuitive visualization, and faster extraction of multi-step evidence paths.This thesis aims to design a flexible and extensible Neo4j framework, implement automated data import from CTD, and develop Cypher queries to extract CGPD tetramers. By creating this graph-based backbone for capturing molecular response pattern of chemicals, this project lays the groundwork for future extensions such as integrating transcriptomics data, directionality of effects, and advanced chemical grouping. You will not only work with cutting-edge technologies in data engineering and graph analytics but also contribute to an emerging research direction with scientific and regulatory impact. This project offers an excellent opportunity to combine software development with applied systems biology and toxicology.

Your tasks

  1. Graph Data Model & Database Setup
  • Design a Neo4j schema for relevant entities such as chemicals, genes, phenotypes (GO terms), diseases, tissues, and organisms
  • Implement node and relationship types representing CTD interactions
  1. Data Ingestion Pipeline
  • Build a Python-based workflow to download, parse, and import CTD data into the Neo4j instance
  • Ensure reproducible, versionized imports using indexes and stable identifiers
  1. CGPD Path Extraction
  • Develop Cypher queries to identify Chemical Gene Phenotype Disease tetramers
  • Implement filtering options for organism, tissue group, and evidence
  • Evaluation & Demonstration

The position to prepare the Master's thesis will be supervised at the site in Leipzig.

We offer

  • Excellent supervision that supports your personal and professional development

  • Exciting insights into the work of a leading research institute

  • The chance to work in interdisciplinary, international teams and benefit from a wide range of perspectives

  • The opportunity to contribute and actively shape your own ideas and impulses right from the start

  • Modern technical equipment and IT service to optimally support your work

Your profile

  • Background in Computer Science/Bioinformatics/Biology/Chemistry
  • Solid programming skills in Python
  • Basic familiarity with databases; experience with Neo4j or graph concepts is a plus
  • Experience with collaborative software development and agile project management with Git
  • Fluent in spoken and written English

The job

The Comparative Toxicogenomics Database (CTD) contains millions of curated interactions between chemicals, genes, phenotypes, and diseases. Recent research has introduced CGPD tetramers, a structured four-step evidence path linking Chemical Gene Phenotype Disease to support chemical grouping and cumulative risk assessment in regulatory toxicology.Until now, these CGPD tetramers were generated using a relational SQLite database created from structured CTD data files using a custom workflow. However, CTD inherently describes a biological knowledge graph, and a graph database such as Neo4j is ideally suited for representing and querying these interconnected relationships. A graph-based implementation offers more flexible exploration, more intuitive visualization, and faster extraction of multi-step evidence paths.This thesis aims to design a flexible and extensible Neo4j framework, implement automated data import from CTD, and develop Cypher queries to extract CGPD tetramers. By creating this graph-based backbone for capturing molecular response pattern of chemicals, this project lays the groundwork for future extensions such as integrating transcriptomics data, directionality of effects, and advanced chemical grouping. You will not only work with cutting-edge technologies in data engineering and graph analytics but also contribute to an emerging research direction with scientific and regulatory impact. This project offers an excellent opportunity to combine software development with applied systems biology and toxicology.

Contract limitations

limited contract

Contact

Your contact for any questions you may have about the job: Dr. Sebastian Canzler Department Computational Biology & Chemistry Computation Systems Biology Group

Your application

Please submit your application via our online portal with your cover letter, CV (please omit your photo, age, or marital status) and relevant attachments.

Diversity and Inclusion

The UFZ has a strong commitment to diversity and actively supports equal opportunities for all employees regardless of their origin, religion, ideology, disability, age or sexual identity. We look forward to applications from people who are open-minded and enjoy working in diverse teams.

Application deadline: 30.04.2026

Requirements

  • Background in Computer Science/Bioinformatics/Biology/Chemistry
  • Solid programming skills in Python
  • Basic familiarity with databases; experience with Neo4j or graph concepts is a plus
  • Experience with collaborative software development and agile project management with Git
  • Fluent in spoken and written English

Benefits & conditions

  • Excellent supervision that supports your personal and professional development

  • Exciting insights into the work of a leading research institute

  • The chance to work in interdisciplinary, international teams and benefit from a wide range of perspectives

  • The opportunity to contribute and actively shape your own ideas and impulses right from the start

  • Modern technical equipment and IT service to optimally support your work

About the company

The Helmholtz Centre for Environmental Research (UFZ) with its 1,100 employees has gained an excellent reputation as an international competence centre for environmental sciences. We are part of the largest scientific organisation in Germany, the Helmholtz association. Our mission: Our research seeks to find a balance between social development and the long-term protection of our natural resources.

Apply for this position