Data Engineer

Massachusetts Board of Library Commissioners
Boston, United States of America
9 days ago

Role details

Contract type
Temporary contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Compensation
$ 94K

Job location

Remote
Boston, United States of America

Tech stack

Microsoft Outlook
Data Cleansing
Data Deduplication
Data Dictionary
Document-Oriented Databases
Python
Regular Expressions
Pandas

Job description

HBS's Baker Library is seeking a temporary Data Engineer to help launch a faculty citation data project aimed at better understanding how its collections support and influence scholarly research. This initiative involves identifying faculty publications, extracting their cited references, and analyzing the relationships within this data to generate meaningful insights into patterns of use and library collection impact. By analyzing citations, the project seeks to surface evidence of how Baker's resources contribute to the research ecosystem at HBS.

Reporting to Baker Library's User Needs and Assessment Librarian, this temporary Data Engineer role will focus on the final phase of the project, where a corpus of raw citation data has already been collected and aggregated from multiple sources. At this stage, the data requires careful cleaning, normalization, and transformation to ensure it is accurate, consistent, and suitable for analysis. The individual in this role will work with this messy dataset to standardize fields, resolve inconsistencies, and prepare the data for downstream analytical work. This phase is critical to ensuring the reliability and interpretability of the project's findings and will directly shape the quality of insights generated about Baker's impact.

Responsibilities

· Clean and normalize raw citation data by resolving inconsistencies in author names, publication titles, journal names, and other variables

· Co-develop and apply standardized schemas for field names and data structures to ensure consistency across the dataset

· Design and implement reproducible data cleaning workflows using scripts that can be reused

· Co-create or locate unique identifiers (e.g., for authors, works, journals) to enable accurate linking and deduplication across records

· Perform record linkage and deduplication using techniques such as fuzzy matching and string comparison

· Assess and improve data quality by identifying missing, inconsistent, or anomalous values and determining appropriate remediation strategies

· Conduct exploratory analysis to evaluate the completeness and reliability of the dataset, including identifying patterns of data gaps

· Collaborate with project stakeholders to align data cleaning decisions with project goals

· Explore connection points for citation data with other HBS administrative datasets

· Document data transformations, data dictionaries, and workflows to support transparency, reproducibility, and future project phases

This temporary, full-time role is 40 hours/ week, 100% remote.

Requirements

· Experience working with messy, real-world datasets

· Advanced proficiency in R (preferred), using libraries such as dplyr, tidyr, and tidyverse, or Python, using libraries such as pandas

--- Familiarity with regular expressions (regex), string comparison, and fuzzy matching

· Proficient understanding of standardization principles and controlled vocabularies

· Ability to balance precision and pragmatism when making decisions in the absence of perfect information

· Comfort documenting processes and decisions for both technical and non-technical audiences

· Ability to work independently while also seeking input when project ambiguity or edge cases arise

· Ability to envision how data cleaning and manipulation serve larger project goals

· Basic understanding of academic publishing and citation formats

· Proficiency in Microsoft Office tools (Outlook email, Teams sites, folder management, file retrieval)

Benefits & conditions

$45.00 / hour, This is a temporary, full-time, remote position. Employees in fully remote positions must work all scheduled hours in a Harvard registered state in compliance with the University's Policy on Employment Outside of Massachusetts. Specific hours and work days will be determined by business needs and are subject to change with appropriate advanced notice.

Apply for this position