Local Consultant (Expert in Information Collection and Database Production Systems)
Role details
Job location
Tech stack
Job description
The consultant will participate in and perform the following technical activities described below: * Survey the formats, standards, and information fields existing in expert reports and state and Federal Police data entry systems. * Map the main variables of interest. * Develop a methodological proposal for data extraction and structuring, including the use of Artificial Intelligence and Natural Language Processing (NLP) tools. * Perform automated data extraction from a sample set of reports. * Implement Optical Character Recognition (OCR) and NLP techniques. * Build a database with the extracted data. * Create a dictionary of variables and metadata, including details in formats such as JSON Schema (for use in OpenAI and Ollama APIs, for example). * Write manuals and documents detailing the methodology for replication and memory. * Participate in periodic alignment meetings with the PNIDD and UNODC teams, reporting on progress, challenges, and necessary adjustments to the implementation schedule. * Deliver the finalised and approved products in the established formats (.py or .zip for codes and scripts; PDF for reports and documentation), observing the defined deadlines and quality requirements.
Requirements
Do you have a Master's degree?, An advanced university degree (Master's degree or equivalent) in Computer Science, Computer Engineering, Software Engineering, Data Science, Artificial Intelligence, Statistics, Economics, Social Sciences, or a related field is required. A first-level university degree in a similar field, in combination with two additional years of qualifying experience, may be accepted in lieu of the advanced university degree. * One (1) year of proven experience in structured data extraction and processing (text, PDF, images) is required. * Experience with Python or R alongside libraries focused on data extraction (e.g., pdfminer, pyMuPDF, and pandas for Python, and tesseract, tm, and/or stringi for R) is desirable. * Experience in applying AI, OCR, and/or NLP for text mining is desirable. * Experience with database integration and modelling (SQL, NoSQL, APIs, etc.) is desirable. Languages English and French are the working languages of the United Nations Secretariat. For this position, fluency in Portuguese, with oral and written proficiency, is required. Working knowledge of English is required. Knowledge of another United Nations official language is an advantage.
Benefits & conditions
results obtained with traditional approaches and with LLMs. It should also contain an illustrated workflow (data pipeline) and the quality and validation criteria for extractions. PRODUCT 4: Technical document containing code (in Python or R, for example) for extracting data from a sample set of reports. The code delivered must have already been tested in different file formats. The document must also contain step-by-step instructions for applying the code. PRODUCT 5: Technical document containing the results of the extraction from real samples of reports from some states and the Federal Police. It should contain an assessment of accuracy, limitations, and necessary adjustments. PRODUCT 6: Technical document containing code revised after sample pre-testing and scripts and technical documentation for the OCR and NLP modules adapted to forensic reports. The document should include text pre-processing (cleaning, tokenisation, and data normalisation). PRODUCT 7: Unified database with information extracted from reports with a relational or document-oriented structure, ready for analysis. It should include data from all states and the Federal District. PRODUCT 8: Technical document containing a dictionary of standardised variables and metadata, with a clear definition of each variable, format, unit of measurement and rules for completion, including details in formats such as JSON Schema (for use in OpenAI and Ollama APIs, for example). PRODUCT 9: Document containing a technical manual and final report on the methodology for replication, including instructions for using the tools, maintenance, and updating. It should also contain an analysis of challenges and recommendations for future expansion. Work Location Home based Expected duration 02.2026 - 01.2027