Nico Martin

From ML to LLM: On-device AI in the Browser

His journey started with detecting filler words in his native dialect. It ended with running a private LLM in the browser to answer questions about any PDF.

From ML to LLM: On-device AI in the Browser
#1about 2 minutes

Using machine learning to detect verbal filler words

A personal project to detect and count filler words in Swiss German speech highlights the limitations of standard speech-to-text APIs.

#2about 2 minutes

Comparing TensorFlow.js backends for performance

TensorFlow.js performance depends on the chosen backend, with WebGPU offering significant speed improvements over CPU, WebAssembly, and WebGL.

#3about 2 minutes

Real-time face landmark detection with WebGPU

A live demo showcases how the WebGPU backend in TensorFlow.js achieves 30 frames per second for face detection, far outpacing CPU and WebGL.

#4about 1 minute

Building a browser extension for gesture control

A Chrome extension uses a hand landmark detection model to enable website navigation and interaction through pinch gestures.

#5about 2 minutes

Training a custom speech model with Teachable Machine

Teachable Machine provides a no-code interface to train a custom speech command model directly in the browser for recognizing specific words.

#6about 2 minutes

The technical challenges of running LLMs in browsers

To run LLMs on-device, we must understand their internal workings, from tokenizers that convert text to numbers to the massive model weights.

#7about 2 minutes

Reducing LLM size for browser use with quantization

Quantization is a key technique for reducing the file size of LLM weights by using lower-precision numbers, making them feasible for browser deployment.

#8about 2 minutes

Running on-device models with the WebLLM library

The WebLLM library, powered by Apache TVM, simplifies the process of loading and running quantized LLMs directly within a web application.

#9about 2 minutes

A live demo of on-device text generation

A markdown editor demonstrates fast, local text generation using the Gemma 2B model, with all processing happening in the browser without cloud requests.

#10about 1 minute

Mitigating LLM hallucinations with RAG

Retrieval-Augmented Generation (RAG) improves LLM accuracy by providing relevant source documents alongside the user's prompt to ground the response in facts.

#11about 3 minutes

Building an on-device RAG solution for PDFs

A demo application shows how to implement a fully client-side RAG system that processes a PDF and uses vector embeddings to answer questions.

#12about 1 minute

Forcing an LLM to admit when it doesn't know

By instructing the model to only use the provided context, a RAG system can reliably respond that it doesn't know the answer if it's not in the source document.

#13about 2 minutes

The future of on-device AI hardware and APIs

The performance of on-device AI is heavily hardware-dependent, but future improvements in chips (NPUs) and browser APIs like WebNN will broaden access.

#14about 2 minutes

Key benefits of running AI in the browser

Browser-based AI offers significant advantages including privacy by default, zero installation, high interactivity, and infinite scalability since users provide the compute.

Related jobs
Jobs that call for the skills explored in this talk.

Featured Partners

From learning to earning

Jobs that call for the skills explored in this talk.