Senior Software Engineer AI
Role details
Job location
Tech stack
Job description
- Lead data pipeline development: Build and maintain PySpark ETL pipelines with high data quality and performance
- Manage integrations: Establish robust connections to client data sources via APIs and tools like FiveTran, Plaid, and BlackLine's own internal connector ecosystem
- Ensure reliability: Monitor pipeline performance, automate testing, and validate data accuracy
- Optimize for scale: Implement performance improvements (e.g., CDC mechanisms, indexing strategies) for large-scale datasets
- Collaborate & innovate: Work with business stakeholders to refine data requirements and integrate cutting-edge AI and big data technologies
You'll Get To:
Leadership and Strategy
-
Partner with data science, security, and product teams to set evaluation and governance standards (Guardrails, Bias, Drift, Latency SLAs).
-
Mentor senior engineers and drive design reviews for ML pipelines, model registries, and agentic runtime environments.
-
Lead incident response and reliability strategies for ML/AI systems.
AI System Deployment and Integration:
-
Collaborate with development teams to integrate AI solutions into existing workflows and applications.
-
Ensure seamless integration with different platforms and technologies.
-
Define and manage MCP Registry for agentic component onboarding, lifecycle versioning, and dependency governance.
-
Build CI/CD pipelines automating LLM agent deployment, policy validation, and prompt evaluation of workflows.
-
Develop and operationalize experimentation frameworks for agent evaluations, scenario regression, and performance analytics.
-
Implement logging, metering, and auditing for agent behavior, function calls, and compliance alignment.
-
Create scalable observability systems-tracking conversation outcomes, factual accuracy, latency, escalation patterns, and safety events.
-
Architect end-to-end guardrails for AI agents including prompt injection protection, identity-aware routing, and tool usage authorization.
-
Collaborate cross-functionally to standardize authentication, authorization, and session governance for multi-agent runtimes.
Model Deployment and Integration:
-
Architect and standardize model registries and feature stores to support version tracking, lineage, and reproducibility across environments.
-
Lead the deployment of machine learning models into production environments, ensuring scalability, reliability, and efficiency.
-
Collaborate with software engineers to integrate machine learning models into existing applications and systems.
-
Implement and maintain APIs for model inference.
Infrastructure and Environment Management:
-
Design and manage training infrastructure including distributed training orchestration, GPU/TPU resource allocation, and automatic scaling.
-
Implement CI/CD for model workflows using pipelines integrated with model validation, bias checks, and rollback automation.
-
Build standardized experimentation frameworks for reproducible training, tuning, and deployment cycles (MLflow, W&B, Kubeflow).
-
Manage and optimize the infrastructure required for machine learning operations in cloud.
-
Work closely with other teams to ensure the availability, security, and performance of machine learning systems.
Monitoring and Maintenance:
-
Implement robust monitoring solutions for deployed machine learning models to detect issues and ensure performance.
-
Collaborate with data scientists and engineers to address and resolve model performance and data quality issues.
-
Conduct regular system maintenance, updates, and optimizations to ensure optimal performance of machine learning solutions.
Automation and Orchestration:
-
Develop and maintain automation scripts and tools for managing machine learning workflows.
-
Implement orchestration systems to streamline the end-to-end machine learning lifecycle, from data preparation to model deployment.
Collaboration with Data Science Teams:
-
Collaborate with data scientists to understand model requirements and constraints for deployment.
-
Facilitate the transition of machine learning models from research to production, ensuring scalability and efficiency.
Performance Optimization:
-
Identify and implement optimizations to enhance the performance and efficiency of machine learning models in production.
-
Conduct performance analysis and implement improvements based on resource utilization of metrics.
Security and Compliance:
-
Implement security measures to protect machine learning systems and data.
-
Ensure compliance with regulatory requirements and industry standards related to machine learning and data privacy.
-
Integrate audit controls, metadata storage, and lineage tracking across ML and AI workflows.
-
Ensure complete monitoring and feedback loops including event logs, evaluations, and automated retraining triggers.
-
Enforce secure deployment patterns with Infrastructure-as-Code and cloud-native secrets management.
-
Define SLAs, error budgets, and compliance reporting mechanisms for ML and AI systems.
Requirements
- 3+ years of experience with programming skills in languages such as Python, Java, or Scala.
- Expertise in ML frameworks (TensorFlow, PyTorch, scikit-learn) and orchestration tools (Airflow, Kubeflow, Vertex AI, MLflow).
- Proven experience operating production pipelines for ML and LLM-based systems across cloud ecosystems (GCP, AWS, Azure).
- Deep familiarity with LangChain, LangGraph, ADK or similar agentic system runtime management.
- Strong competencies in CI/CD, IaC, and DevSecOps pipelines integrating testing, compliance, and deployment automation.
- Hands-on with observability stacks (Prometheus, Grafana, Newrelic) for model and agent performance tracking.
- Understanding of governance frameworks for Responsible AI, auditability, and cost metering across training and inference workloads.
- Proficiency in containerization technologies (e.g., Docker, Kubernetes).
We're Even More Excited If You Have:
Operations and Infrastructure:
-
Proficient in scripting languages (e.g., Bash, python) for automation.
-
Experience with workflow orchestration tools (e.g., Apache Airflow).
-
Expertise in managing and optimizing cloud-based infrastructure.
-
Familiarity with DevOps practices and tools for automated deployment.
-
Understanding of network configurations and security protocols.
Problem-solving and Critical Thinking:
- Ability to define problems, collect and analyze data, and propose innovative solutions. Strong critical thinking skills to evaluate models, identify limitations, and
Adaptability and Learning Agility:
- Comfortable working in a fast-paced, rapidly evolving environment. Proactive in staying up to date with the latest trends, techniques, and technologies in AI/data science
Benefits & conditions
$145,000.00/Yr - $182,000.00/Yr