Local Consultant (Expert in Information Collection and Database Production Systems)

United Nations
Vienna, Austria
24 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English, French, Portuguese
Experience level
Junior

Job location

Remote
Vienna, Austria

Tech stack

API
Artificial Intelligence
Optical Character Recognition (OCR)
Databases
Computer Engineering
Data Mining
JSON
Python
Natural Language Processing
NoSQL
Software Engineering
SQL Databases
Text Mining
Information Technology

Job description

The consultant will participate in and perform the following technical activities described below: * Survey the formats, standards, and information fields existing in expert reports and state and Federal Police data entry systems. * Map the main variables of interest. * Develop a methodological proposal for data extraction and structuring, including the use of Artificial Intelligence and Natural Language Processing (NLP) tools. * Perform automated data extraction from a sample set of reports. * Implement Optical Character Recognition (OCR) and NLP techniques. * Build a database with the extracted data. * Create a dictionary of variables and metadata, including details in formats such as JSON Schema (for use in OpenAI and Ollama APIs, for example). * Write manuals and documents detailing the methodology for replication and memory. * Participate in periodic alignment meetings with the PNIDD and UNODC teams, reporting on progress, challenges, and necessary adjustments to the implementation schedule. * Deliver the finalised and approved products in the established formats (.py or .zip for codes and scripts; PDF for reports and documentation), observing the defined deadlines and quality requirements.

Requirements

Do you have a Master's degree?, An advanced university degree (Master's degree or equivalent) in Computer Science, Computer Engineering, Software Engineering, Data Science, Artificial Intelligence, Statistics, Economics, Social Sciences, or a related field is required. A first-level university degree in a similar field, in combination with two additional years of qualifying experience, may be accepted in lieu of the advanced university degree. * One (1) year of proven experience in structured data extraction and processing (text, PDF, images) is required. * Experience with Python or R alongside libraries focused on data extraction (e.g., pdfminer, pyMuPDF, and pandas for Python, and tesseract, tm, and/or stringi for R) is desirable. * Experience in applying AI, OCR, and/or NLP for text mining is desirable. * Experience with database integration and modelling (SQL, NoSQL, APIs, etc.) is desirable. Languages English and French are the working languages of the United Nations Secretariat. For this position, fluency in Portuguese, with oral and written proficiency, is required. Working knowledge of English is required. Knowledge of another United Nations official language is an advantage.

Benefits & conditions

results obtained with traditional approaches and with LLMs. It should also contain an illustrated workflow (data pipeline) and the quality and validation criteria for extractions. PRODUCT 4: Technical document containing code (in Python or R, for example) for extracting data from a sample set of reports. The code delivered must have already been tested in different file formats. The document must also contain step-by-step instructions for applying the code. PRODUCT 5: Technical document containing the results of the extraction from real samples of reports from some states and the Federal Police. It should contain an assessment of accuracy, limitations, and necessary adjustments. PRODUCT 6: Technical document containing code revised after sample pre-testing and scripts and technical documentation for the OCR and NLP modules adapted to forensic reports. The document should include text pre-processing (cleaning, tokenisation, and data normalisation). PRODUCT 7: Unified database with information extracted from reports with a relational or document-oriented structure, ready for analysis. It should include data from all states and the Federal District. PRODUCT 8: Technical document containing a dictionary of standardised variables and metadata, with a clear definition of each variable, format, unit of measurement and rules for completion, including details in formats such as JSON Schema (for use in OpenAI and Ollama APIs, for example). PRODUCT 9: Document containing a technical manual and final report on the methodology for replication, including instructions for using the tools, maintenance, and updating. It should also contain an analysis of challenges and recommendations for future expansion. Work Location Home based Expected duration 02.2026 - 01.2027

Apply for this position