Senior AI Platform Engineer
Role details
Job location
Tech stack
Job description
We are building an AI Engineering function to enable productivity and agentic capabilities across the firm, for end users, developers, and business teams.
As a Senior AI Platform Engineer, you will design and own the shared platform that powers AI systems firm-wide: inference services, agentic platforms, developer tooling, and observability.
This is a financial services environment where data protection, auditability, and regulatory compliance are foundational requirements. You will ensure that AI capabilities are secure by default, auditable end-to-end, and easy for engineering teams to adopt.
You will report to the Head of AI Engineering and partner closely with Security Engineering, AI Integration/Application teams, and core infrastructure groups.
Responsibilities
Platform Infrastructure
- Design, build, and operate the core AI platform, including managed LLM inference services (Amazon Bedrock and related), model access management, versioning, and routing across foundation models
- Design and operate shared integration layers, including MCP servers, an MCP registry/gateway, and authorization services that connect AI platforms with core firm systems
- Design and operate AI productivity data pipelines and dashboards for usage, cost, and adoption metrics
- Design the infrastructure that supports AI-assisted developer tooling (Linux VDI environments), office productivity integrations (M365/Excel), and autonomous agent frameworks
- Develop standardized inference and agentic AI platforms that teams can adopt across use cases, including reusable components for RAG, vector databases, and model integration patterns
Security & Guardrails
- Partner with Security Engineering to embed security controls across the full AI lifecycle
- Design, with the AI Security Engineer and infrastructure/platform teams, the controls that prevent destructive agent actions: filesystem permissions, IAM policies, network allowlists, sandbox configurations, and execution-time policy enforcement
- Architect a default-deny posture: agents and tools access only explicitly permitted resources, with no ability to modify or delete production data unless specifically authorized through a human-approval workflow
- Implement pre-execution guardrails (hooks, policy engines) that intercept and validate agent actions before they run
- Ensure AI workloads operate within the corporate network boundary: VPC endpoints, PrivateLink, no public internet egress for inference traffic
Enablement & Scale
- Build self-service onboarding so teams can consume AI platform services with appropriate access controls
- Design systems that enable cost-effective operation of AI workloads, including quota management and chargeback visibility
- Operate firm-wide AI applications and centrally managed AI services
- Define reference architectures and patterns that other engineering teams use to build on the platform
Requirements
- 10+ years as an infrastructure, platform, or systems engineer, with demonstrated experience building and operating shared services consumed by multiple teams, on-premises and on AWS
- Strong expertise in AWS Bedrock (inference / agent core) and Azure OpenAI
- Strong expertise in designing and implementing MCP registries, gateways, servers and Authorization flows
- Hands-on experience supporting LLM-based workloads in production environments
- Experience designing and enforcing AI security controls at the platform layer in a regulated or security-sensitive environment
- Track record of building production-quality agentic AI patterns: tool use, function calling, MCP gateway/servers, retrieval-augmented generation, human-in-the-loop workflows
- Track record of building production-quality platforms and developer-facing services, with emphasis on usability, standardization, and reliability
- Strong written and verbal communication skills, with the ability to work effectively across security, application, and infrastructure teams
Preferred Qualifications
- Experience in financial services, healthcare, or another heavily regulated industry
- Experience with Microsoft M365 Copilot / Copilot Agents
- Experience building observability pipelines (Splunk, ELK, Datadog, or Grafana)
- Familiarity with containerized and Kubernetes-based environments
- Experience with model fine-tuning workflows and ML lifecycle tooling
- Familiarity with DLP tooling and data classification frameworks
Benefits & conditions
The base salary range for this position is $200,000 - $325,000 per year.