Data Scientist II - Big Data R&D, Identity Graph & KYC
Role details
Job location
Tech stack
Job description
The Big Data R&D team is responsible for building the core identity graph and entity-resolution capabilities that power Socure's KYC and compliance products. In this role, you will help develop graph-based algorithms and data pipelines on massive PII datasets, support modelers with high-quality features, and evaluate new data sources that feed our identity and fraud products. You will work closely with senior data scientists and engineers while developing your skills in large-scale ML, distributed systems, and graph analytics., * Contribute to the design and implementation of machine learning, data mining, statistical, and graph-based algorithms to analyze very large datasets for identity verification and anomaly detection.
- Analyze large datasets to help develop and refine entity-resolution and identity-matching algorithms that drive Socure's KYC and compliance solutions.
- Build and maintain components of data-processing pipelines (ETL, feature generation, normalization) using tools such as Spark/PySpark and AWS (e.g., EMR, S3).
- Support senior data scientists with feature engineering, data exploration, error analysis, and A/B test setup for new models and signals.
- Help evaluate new third-party and internal data sources: profile data quality, design offline experiments, and summarize impact on coverage and model performance.
- Implement and maintain SQL and Python/R code for data extraction, transformation, and validation; contribute to code reviews and basic testing.
- Provide analytical support to compliance and regulatory product teams, including ad hoc investigations, simple dashboards, and data deep dives.
- Communicate findings in a clear, structured way to peers and cross-functional partners (Product, Engineering, Client Analysis), focusing on key insights and trade-offs.
- Work effectively in a fast-paced, cross-functional environment; demonstrate ownership of well-scoped tasks and follow through to completion.
Requirements
Do you have experience in Data-driven problem-solving?, Do you have a Master's degree?, * Master's degree with 2+ years of experience, or Ph.D. with 1+ years of experience in a data science or analytics role, or equivalent practical experience.
- Proficiency in at least one general-purpose programming language used in data science (Python, or Scala).
- Solid experience writing and optimizing SQL for large datasets; comfort working in data lake / warehouse environments.
- Hands-on experience with Spark or PySpark and common ML libraries (e.g., scikit-learn, XGBoost, TensorFlow/PyTorch a plus).
- Familiarity with UNIX environments and the AWS ecosystem (e.g., EMR, S3); Databricks experience is a plus.
- Working knowledge of supervised/unsupervised ML and basic statistics (similarity measures, clustering, evaluation metrics).
- Exposure to graph techniques or graph databases (Neo4j, AWS Neptune, GraphFrames) is a strong plus.
- Bonus: experience with Elasticsearch or DynamoDB; workflow tools such as Airflow for automating data pipelines.
- Ability to break down loosely defined problems, ask good clarifying questions, and iterate quickly with feedback.
Please note that sponsorship is not available at this time; and that you must be located within 45 miles of a talent hub to be considered.