Senior Data Scientist - Big Data R&D, Identity Graph & KYC
Role details
Job location
Tech stack
Job description
The Big Data R&D team develops cutting-edge big data and graph-based solutions for entity search, entity resolution, and identity matching that power Socure's KYC and compliance products.
As a Senior Data Scientist I, you will lead the design and deployment of advanced ML and graph algorithms on large-scale PII datasets, own end-to-end projects from problem definition through production validation, and serve as a key technical partner to Product, Engineering, and Client-facing teams. You will help define standards for feature engineering, experimentation, and data quality across our identity graph stack, with substantial impact on coverage, accuracy, and fairness.
What You'll Do
- Own the design, development, and evaluation of machine learning, statistical, and graph-based algorithms for entity-resolution, identity trust scoring, and anomaly detection on massive datasets.
- Architect and optimize graph-based identity representations (identity graph structure, linkage rules, clustering) to improve match rates, reduce false positives/negatives, and support downstream fraud and KYC models.
- Build and maintain scalable data pipelines and feature stores in Spark/PySpark (or Scala), including data normalization, deduplication, and feature computation across large PII datasets in AWS/Databricks environments.
- Lead A/B tests and offline/online experimentation for new models, features, and data sources; define success metrics, design experiments, and ensure rigorous validation before rollout.
- Evaluate new internal and external data sources: explore signal quality, design backtests, quantify incremental value, and provide clear recommendations on vendor selection and integration.
- Partner closely with product managers and engineers to translate ambiguous business and regulatory requirements (e.g., KYC coverage, watchlist matching) into concrete modeling and data roadmaps.
- Provide deep analytical support to Socure's compliance and regulatory product suite, including investigative analyses, root-cause analysis for anomalies, and clear narratives for internal and external stakeholders.
- Contribute to model governance and documentation: clearly explain model logic, data dependencies, limitations, and monitoring plans to internal risk/compliance stakeholders.
- Mentor junior data scientists and engineers on best practices in data exploration, feature engineering, experimentation, and code quality.
- Communicate complex technical concepts and trade-offs in a concise, structured way to both technical and non-technical audiences (e.g., product reviews, customer meetings, internal briefings).
Requirements
Do you have experience in XGBoost?, Do you have a Master's degree?, * Master's degree with 3+ years of relevant industry experience, or Ph.D. with 1+ years of experience in applied ML / data science roles; background in Computer Science, Statistics, Mathematics, or related quantitative fields preferred.
- Strong proficiency in Python (preferred) or Scala, including experience with ML libraries such as scikit-learn, XGBoost, TensorFlow or PyTorch.
- Extensive experience with Spark or PySpark and distributed data systems (e.g., AWS EMR, Databricks) working on very large, messy datasets.
- Deep understanding of supervised and unsupervised learning, feature engineering, model evaluation, and experiment design (A/B testing, holdout strategies, stratification).
- Experience developing production-quality data pipelines and automated workflows using Airflow or similar orchestration tools.
- Practical familiarity with graph databases and/or graph frameworks (Neo4j, AWS Neptune, GraphFrames, DGL, PyTorch Geometric) and graph algorithms for clustering, link prediction, and community detection is strongly preferred.
- Solid SQL skills and experience working with large-scale analytical data stores.
- Experience in at least one of: identity verification, fraud detection, credit risk, or adjacent high-stakes domains is a plus.
- Demonstrated ability to lead medium-to-large projects end-to-end, make sound trade-off decisions under ambiguity, and influence cross-functional stakeholders with data and clear reasoning.
Please note that sponsorship is not available at this time; and that you must be located within 45 miles of a talent hub to be considered.