Legal Data Acquisition Engineer (Scraping & Extraction)

Omnilex

Zürich, Switzerland

3 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English, German

Compensation

CHF 144K

Job location

Zürich, Switzerland

Tech stack

HTML

JavaScript

API

Artificial Intelligence

Azure

Continuous Integration

Data Deduplication

Web Scraping

Software Debugging

PostgreSQL

Parsing

Software Engineering

SQL Databases

TypeScript

Management of Software Versions

XML

Indexer

Puppeteer (Software)

Backend

Information Technology

Playwright

Pagination

Data Pipelines

Docker

Job description

We're looking for an engineer who loves the messy reality of web data: dynamic pages, broken markup, inconsistent PDFs, changing source structures, missing metadata, rate limits, anti-bot protections, and jurisdiction-specific publishing habits.

This is a behind-the-scenes role with huge product impact. You'll build and maintain the systems that continuously collect and extract legal content from websites, APIs, bulk files, and document repositories and turn it into reliable inputs for our AI products.

If you enjoy scraping, parsing, reverse-engineering content structures, and designing robust ingestion pipelines that survive real-world change, this role is for you.

️ About Omnilex

Omnilex is a young, dynamic AI legal tech startup with roots at ETH Zurich. Our interdisciplinary team is building AI-native tools for legal research and answering complex legal questions across jurisdictions.

A core reason we stand out is our data foundation: combining external legal sources, customer-internal sources, and our own AI-first legal content. This role strengthens that foundation.

️ What You'll Work On

Your focus will be source acquisition, scraping, parsing, and extraction reliability for legal data.

Core responsibilities

Build and maintain resilient pipelines to ingest legal content from:

public websites
APIs
document portals
bulk datasets
PDFs / HTML / XML / DOCX-like formats

Design scraping systems that are robust to:

layout changes
pagination quirks
JavaScript-rendered sites
inconsistent metadata
rate limits and retry behavior
Implement parsers and extractors for legal documents (statutes, decisions, guidance, commentaries, etc.)

Extract and structure:

document text
headings/sections
citations and references
dates, courts, authorities, identifiers
language / jurisdiction metadata
Build source-specific adapters and reusable extraction components (rather than one-off scripts)
Monitor source health and detect breakage quickly (e.g., selector failures, coverage drops, schema drift)
Improve data quality with validation checks, deduplication, canonicalization, and content versioning
Work closely with AI/data/search teammates so extracted data is optimized for downstream indexing, RAG, and analytics
Document source behavior and operational playbooks so ingestion remains maintainable as we scale

What Success Looks Like

In this role, success is not "number of scrapers written." Success looks like:

high source coverage across target jurisdictions
fast detection and repair when sources change
clean, structured extractions with fewer downstream fixes
stable ingestion SLAs and predictable runtimes/costs
reusable tooling that makes adding new sources increasingly faster

Requirements

Degree in Computer Science, Data Science, Software Engineering, or related field - or equivalent practical experience
Strong hands-on engineering experience with TypeScript (backend/data pipeline context)
Real experience building web scraping / crawling / extraction pipelines in production
Strong understanding of HTML/DOM parsing, HTTP, pagination, sessions/cookies, and common web data edge cases
Experience working with messy document formats (especially PDFs) and text extraction challenges
Good SQL skills (PostgreSQL) and experience storing structured/unstructured content
Strong debugging skills and a pragmatic mindset: you can make unreliable sources reliable
Ability to work with ownership in a fast-moving startup
Availability full-time; on-site in Zurich at least two days per week (hybrid)

Preferred Qualifications

Familiarity with modern scraping and browser automation tools (e.g. Playwright, Puppeteer)
Experience with PDF/document tooling, OCR pipelines, and parsing libraries
Experience designing queue-based or worker-based ingestion systems
Experience with Azure (including storage/search services), Docker, and CI/CD
Working proficiency in German and proficiency in English
Swiss work permit or EU/EFTA citizenship
Experience with legal or regulatory document structures (Switzerland / Germany / EU / US is a plus)
Familiarity with downstream AI/search use cases (chunking, embeddings, indexing, citation traceability)

Nice-to-Have Strengths (But Not Required)

You enjoy source forensics: inspecting network calls, hidden endpoints, export formats, and content variants
You think in terms of reusable extraction architecture, not just one-off fixes
You care about observability and operational quality, not just "it ran once on my machine"
You like collaborating with product/AI teams to understand what metadata actually matters downstream

Benefits & conditions

High leverage impact: your work directly improves coverage, freshness, and trust in legal AI answers
Ownership: own the ingestion/scraping layer end-to-end for key legal sources
Real engineering challenges: dynamic websites, parsing complexity, document extraction, reliability at scale
Interdisciplinary team: work closely with engineers, legal experts, and AI specialists
Compensation: CHF 8'000-12'000 per month + ESOP (employee stock options), depending on experience and skills