Principal Engineer - Gen AI Platform Inferencing...

Wells Fargo
Concord, United States of America
8 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior
Compensation
$ 305K

Job location

Concord, United States of America

Tech stack

Artificial Intelligence
Nvidia CUDA
Computer Programming
Concurrency Controls
Python
Load Testing
Open Source Technology
Openshift
Regression Testing
Prometheus
Software Engineering
Large Language Models
Grafana
Kubernetes Helm Charts
Caching
AI Platforms
Kubernetes
Infrastructure Automation Frameworks
Low Latency
Performance Monitor
Build Tools
TensorRT
Decoding

Job description

This is a software engineering role - you'll write code, build systems, and solve hard problems in the AI inference stack. You'll work deep inside frameworks like vLLM, SGLang, and NVIDIA Dynamo, extending and optimizing them to serve models at enterprise scale. You'll also build the automation, tooling, and deployment infrastructure that connects these runtimes to Kubernetes-native serving layers like KServe, KNative, and OpenShift AI.

If you've contributed to inference frameworks, written custom serving logic, or built production ML serving pipelines in Python, we want to hear from you.

In this role, you will:

  • Develop, extend, and optimize inference runtime configurations and integrations across vLLM, SGLang, NVIDIA Dynamo, TensorRT-LLM, and Triton

  • Write Python-based tooling and automation for model onboarding, serving configuration, performance benchmarking, and deployment pipelines

  • Build and maintain Kubernetes-native model serving infrastructure using KServe, KNative, and OpenShift AI - including custom serving runtimes and inference graphs

  • Implement and tune inference performance optimizations - continuous batching, speculative decoding, prefix caching, concurrency control, autoscaling policies, and disaggregated prefill/decode pipelines

  • Develop Helm charts, operators, and Kustomize overlays for deploying and managing inference workloads on OpenShift/OCP

  • Integrate inference platforms with GPU workload orchestrators (Run:AI or similar) - automating project provisioning, quota management, and workload scheduling

  • Build observability and testing harnesses - load testing frameworks, latency/throughput profiling scripts, and regression test suites for inference stack upgrades

  • Partner with AI/ML teams to productionize new models, defining serving architectures, resource requirements, and SLA targets, Employees support our focus on building strong customer relationships balanced with a strong risk mitigating and compliance-driven culture which firmly establishes those disciplines as critical to the success of our customers and company. They are accountable for execution of all applicable risk programs (Credit, Market, Financial Crimes, Operational, Regulatory Compliance), which includes effectively following and adhering to applicable Wells Fargo policies and procedures, appropriately fulfilling risk and compliance obligations, timely and effective escalation and remediation of issues, and making sound risk decisions. There is emphasis on proactive monitoring, governance, risk identification and escalation, as well as making sound risk decisions commensurate with the business unit's risk appetite and all risk and compliance program requirements.

Requirements

  • 7+ years in software engineering or platform engineering (work experience, training, military experience, or education)

  • 5+ years of programming experience in Python with experience building production systems

Desired Qualifications:

  • Experience with Inference frameworks, such as vLLM, SGLang, NVIDIA Dynamo, TensorRT-LLM, or Triton Inference Server

  • Experience with Kubernetes-native ML serving, such as KServe, KNative, Seldon, or OpenShift AI

  • Experience with Inference optimization, (Continuous batching, speculative decoding, KV-cache management, prefix caching, quantization-aware serving (FP8, AWQ, GPTQ), or tensor parallelism configuration)

  • Experience with Container platform development, (Writing Helm charts, operators, or custom controllers for OpenShift, GKE, or EKS)

  • Experience with GPU workload orchestration, (Run:AI, Kueue, Volcano - scripting workload automation, quota management, or scheduler integrations)

  • Experience with Performance and load testing, (Building benchmarking tools for token throughput, time-to-first-token, batch latency, and autoscaling behavior)

  • Familiarity with NVIDIA GPU fundamentals (CUDA, MIG, NCCL), experience contributing to open-source inference projects, or background in ML observability tooling (Prometheus, Grafana, Arize)

Benefits & conditions

Wells Fargo provides eligible employees with a comprehensive set of benefits, many of which are listed below. Visit Benefits - Wells Fargo Jobs (https://www.wellsfargojobs.com/en/life-at-wells-fargo/benefits) for an overview of the following benefit plans and programs offered to employees.

  • Health benefits

  • 401(k) Plan

  • Paid time off

About the company

Wells Fargo maintains a drug free workplace. Please see our Drug and Alcohol Policy (https://www.wellsfargojobs.com/en/wells-fargo-drug-and-alcohol-policy) to learn more.

Apply for this position