Data Engineer
Role details
Job location
Tech stack
Job description
The NATO Information and Communication Agency (NCIA) located in The Hague, Netherlands, is currently involved in processing vast amounts and highly variant data coming from theatre for the purpose of efficient archiving. In light of these activities, within NCIA Chief Technology Office, the Exploiting Data Science and Artificial Intelligence (EDS&AI) team is tasked to apply Big Data and AI technology to prepare, run and adjust processing pipelines for processing various source data into archiving formats and metadata, and prepare for (semantic) search. NATO has an obligation to support national investigations into situation that occurred in theatre. In order to support the different teams involved most optimal, the EDS&AI team brings the expertise to extract and exploit the vast and varied data on the table, by using the Agency's high performance computing classified sandbox. The EDS&AI team provides the core data science skills and technology needed for big data analysis and AI. The EDS&AI team applies innovative technology to data whenever it is not possible to extract value with conventional approaches.
Role Duties and Responsibilities
- Setting up / improving pipelines to process all required documents and that uniquely identifies and traces decisions and processing steps. This is to be conducted on the provided classified sandbox environment, with provided performance hardware and toolsets.
- Implementing / improving (missing) pipeline steps for marking duplicate files, based on file attributes, path (structure) and content (similarity), and rules for considering a file or structure a duplicate.
- Extracting document-format records from Functional Area Systems (FAS) databases and back-ups performed otherwise. Archiving SME's and system SME's are available for guidance on target formats and source system structure and data interpretation. Each FAS is processed separately.
- Processing / Monitoring progress of various office, image and video file types to the accepted archiving formats, including extraction of metadata and preparing search semantic indexes.
- Automating registering all processed documents with semantic indexes with the sandbox natural language search tool.
- Automating the final copy of all non-duplicate and extracted archive documents with content and metadata to the NATO archiving system.
- Reporting status, progress and statistics of the (raw) files being processed to archive formats, metadata and search indexes.
- Delivering full reporting of results, trace of pipeline steps taken and (stakeholder) accepted failures. Quarterly updates.
Requirements
- At least 3 years of practical experience in the field of data science and/ or data analytics;
- Experience using data processing/visualization/analytics software packages and development environments, preferably such as KNIME, VS Code, GitLab, Power BI, Jupyter Lab, and Docker-based API;
- Experience with data processing Big Data, creating and utilizing containerized building blocks and running containers (APIs) on Kubernetes clusters;
- Experience with programming/scripting in languages like Python, R, SQL and working with data formats like CSV, XML, JSON;
- Experience performing content extraction from files/databases/systems, (LLM-based) embedding models, entity-extraction, key-word-extraction and content similarity measures;
- Creative, flexible and pro-active overcoming obstacles;
- Good drafting, communication and presentation skills in English, including technical and non-technical levels;
- High attention to detail and accuracy;
Education
- Master in Computer Science, Engineering or relevant field.
- A higher degree in Data Science is preferred., * Valid National or NATO Secret personal security clearance