AI Engineer - Synthetic Data Generation
Role details
Job location
Tech stack
Job description
- Build multi-step generation pipelines (10+ steps): From DB selection * pseudonymization * extraction * translation * normalization * deduplication* validation * classification * rating * export.
- LLM integration, production-grade: Design robust prompt suites for extraction, translation, classification, and rating; enforce structured JSON outputs; handle retries, partial failures, and weird model behavior.
- Quality assurance & filtering: Implement scoring systems (multi-criteria, consistent rubrics), dedup/near-dup suppression, and deterministic validators (especially for citations).
- Citation processing at legal-grade precision: Extract, normalize, and validate citations across languages and formats (e.g., Art. 336c Abs. 1 OR, BGE 137 III 266 E. 3.2), including abbreviation mapping and normalization rules.
- Cost & throughput optimization: Use batch APIs where appropriate, tune reasoning effort, control concurrency, count tokens, and keep runs cost-efficient (without sacrificing quality).
- Developer tooling & CLI workflows: Build CLIs with progress tracking, configurable concurrency, and solid ergonomics for long-running jobs.
- Testing across levels: Write unit/smoke/integration tests for pipelines and validators (including mocked LLMs where sensible and real API runs where needed).
- Cross-team collaboration: Work closely with legal experts to define what "good" looks like for exam questions/commentaries, and translate that into measurable QA checks.
Requirements
Do you have experience in SQL?, Do you have a Master's degree?, Do you get joy from turning messy legal texts into clean, structured, high-quality datasets that actually improve model behavior? Do you like building pipelines where every step is measurable: extraction quality, citation correctness, dedup rate, cost per item, throughput, and regression stability? Are you comfortable shipping pragmatic tooling (CLIs, validators, tests) around LLMs without hand-waving away edge cases? If so, we'd love to hear from you., * Experience building backend/data tooling with TypeScript/Node.js (strict typing, generics, async patterns).
- Hands-on experience integrating LLM APIs (OpenAI/Anthropic or similar), including structured outputs (JSON), prompt iteration, and failure handling.
- Strong data pipeline mindset: ETL workflows, transformation steps, validation, and reproducibility.
- Solid SQL/PostgreSQL skills and experience with an ORM (bonus if Drizzle).
- Experience writing reliable tests (e.g., Jest) and maintaining CI-friendly pipelines.
- Fluent English; willing to work hybrid in Zurich (on-site at least two days/week), full-time., * Familiarity with the Swiss legal system (court structure, citation norms, multilingual legal terminology).
- Working proficiency in German; plus French/Italian is a strong advantage.
- Experience with batch processing and cost-aware LLM operations (token budgeting, batching strategy, caching, early-exit).
- Practical text processing skills: regex-heavy parsing, dedup/near-dup detection, similarity search (e.g., BM25 / MiniSearch).
- Familiarity with our environment: Yarn workspaces/monorepos, NestJS, and pragmatic CLI tooling.
Benefits & conditions
- Direct impact: Your datasets will directly shape model quality and evaluation reliability in legal research and reasoning.
- Autonomy & ownership: Own the synthetic data pipeline end-to-end; prompts, validators, QA, exports, and cost controls.
- Team: Work with a sharp interdisciplinary group at the intersection of AI, engineering, and law.
- Compensation: CHF 7'000-11'000 per month + ESOP, depending on experience and skills.
We're excited to hear from candidates who love building robust, cost-aware LLM pipelines and care about precision (especially when citations and multilingual legal nuance matter). Apply today by pressing the Apply button.
Job Types: 100%, Permanent
Pay: CHF84'000.00 - CHF132'000.00 per year