Senior Machine Learning Engineer

Cloudflare

San Francisco, United States of America

yesterday

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

Austin, United States of America

Tech stack

JavaScript

Artificial Intelligence

Airflow

Amazon Web Services (AWS)

Azure

Google BigQuery

Continuous Delivery

Continuous Integration

ETL

Data Systems

Distributed Systems

Monitoring of Systems

Python

PostgreSQL

Machine Learning

TensorFlow

Azure

Scientific Computating

TypeScript

Chatbots

PyTorch

React

Large Language Models

Multi-Agent Systems

Database Optimization

Reliability of Systems

Generative AI

Backend

FastAPI

Pytest

Containerization

Scikit Learn

Infrastructure Automation Frameworks

Information Technology

Cloudflare

Web Technologies

Machine Learning Operations

Virtual Agents

Terraform

GPT

Software Version Control

Data Pipelines

Serverless Computing

Docker

Job description

At Cloudflare, we're not looking for people who wait for a polished roadmap; we're looking for the builders who see the cracks in the Internet that everyone else has simply learned to live with. We value candidates who have the instinct to spot a "normalized" problem and the AI-native curiosity to create a solution using the latest tools. Our culture is built on iteration, leveraging AI to ship faster today to make it better tomorrow, while ensuring that every improvement, no matter how small, is shared across the team to lift everyone up. If you're the type of person who values curiosity over bureaucracy, and that AI is a partner in solving tough problems to keep the Internet moving forward, you'll fit right in., We are looking for a visionary and hands-on Lead Machine Learning Engineer to join our Austin team. In this role, you will be the principal architect behind the next generation of our unified AI/ML platform, designing and building the infrastructure that powers everything from traditional predictive models to generative AI, large language models (LLMs), and autonomous agent frameworks.

You will own the end-to-end technical strategy, blueprint, and execution of scalable backend services and data pipelines that support AI-driven applications across go-to-market, engineering, and product teams. Because our products are initiated and owned entirely within the team, you will drive the vision from initial requirements and system design to global deployment, optimization, and long-term evolutionary ownership.

What you'll do

Architect and evolve a highly scalable, multi-tenant AI/ML platform that seamlessly unifies traditional ML (classification, regression, forecasting) and Generative AI/LLM orchestration. Design and implement robust production-grade AI Agents and Advanced Chatbots. Build reliable execution environments for Multi-Agent Systems, including state management, long-term memory architectures, and Model Context Protocol (MCP) server integrations. Build high-throughput, low-latency application backends and orchestration layers. Partner closely with data, platform, and full-stack engineers to ensure seamless feature delivery and reliable production operations. Act as a technical anchor for the Data Science team - enforcing rigorous engineering standards, leading design and security reviews, evaluating build-vs-buy decisions, and mapping business requirements to robust technical designs.

Evaluate trade-offs and drive adoption of modern AI infrastructure tools, optimized embedding pipelines, vector databases, and serverless compute paradigms (such as Workers AI).

Requirements

Extensive experience as a Senior or Lead ML Engineer, with a proven track record of architecting and operating production-grade ML platforms, services and distributed backends. Strong competency in Traditional ML lifecycles (feature stores, training pipelines, model monitoring) alongside deep experience in Generative AI patterns (RAG pipelines, context engineering, fine-tuning, guardrailing, and agentic AI systems). Mastery of Python and robust experience with modern backend ecosystems. Familiarity with (or willingness to collaborate on) full-stack technologies like React and TypeScript is highly valued.

A builder's mindset. You are comfortable navigating ambiguity, shaping your own technical roadmap, adapt as needed and taking extreme ownership of system reliability, costs, and model performance., * 3+ years of dedicated ML Engineering experience within a large-scale, enterprise environment (handling petabyte-scale data and working across globally distributed teams).
Proven ability to architect, scale, and secure reliable, highly observable distributed systems, with a track record of leveling up platform foundations.
Experience mentoring engineers, leading by example through high-quality code and rigorous design reviews, and fostering a culture of technical excellence.
Strong problem-solving skills with a demonstrated ability to independently drive complex projects through ambiguous spaces and collaborate cross-functionally with data engineers, full-stack teams, and analysts., * Hands-on proficiency in building production-grade GenAI applications and multi-agent systems using advanced LLM frameworks like LangGraph, LangChain, or Autogen. Deep understanding of agent harness primitives, state management, memory architectures, and tool-calling loop mechanics.
Experience establishing LLMOps foundations, including automated prompt tracking, LLM evaluation pipelines (e.g., Ragas, TruLens), vector database optimization, context/token management, and real-time guardrailing/moderation layers.
Deep experience in scientific computing using Python (Scikit-Learn, PyTorch, or TensorFlow) and deploying traditional systems for end-to-end training, batch/real-time inference, and model observability., * Strong experience with Docker and Kubernetes for containerization and orchestration, alongside Infrastructure-as-Code tools like Terraform and public cloud ecosystems (GCP, AWS, or Azure).
Hands-on experience with modern MLOps platform tools (e.g., Airflow, Argo Workflows, ArgoCD) and data systems including BigQuery, Postgres, and robust ETL/ELT practices.
Experience with full-stack web technologies and serverless/edge environments (FastAPI, TypeScript/JavaScript, Cloudflare Workers), with the agility to contribute across a multi-language stack.
Strong foundation in continuous integration/continuous deployment (CI/CD), testing frameworks (Pytest), and robust version control practices., * M.S. or Ph.D. in Computer Science, Statistics, Mathematics, or a related quantitative field.
Exceptional written and verbal communication skills, with the ability to translate complex technical architectures into clear concepts for both engineering peers and business stakeholders.

About the company

Cloudflare, Inc. is the leading connectivity cloud company on a mission to help build a better Internet. It empowers organizations to create an application modernization and AI strategy to consume, build, protect, and defend at scale. Cloudflare’s connectivity cloud delivers the most full-featured, unified platform of cloud-native products and developer tools, so any organization can power and protect their applications.

Swing by booth #12 in Hall A for your chance to win a YETI Carryall bag!

Role details

Job location

Tech stack

Job description

Requirements

About the company

Apply for this position

Good distractions

Moments

Videos View all