Ayon Roy

PySpark - Combining Machine Learning & Big Data

How do you apply machine learning when your dataset is too big for a single machine? Discover PySpark's powerful, distributed ML pipelines.

PySpark - Combining Machine Learning & Big Data
#1about 3 minutes

Combining big data and machine learning for business insights

The exponential growth of data necessitates combining big data processing with machine learning to personalize user experiences and drive revenue.

#2about 3 minutes

An introduction to the Apache Spark analytics engine

Apache Spark is a unified analytics engine for large-scale data processing that provides high-level APIs and specialized libraries like Spark SQL and MLlib.

#3about 4 minutes

Understanding Spark's core data APIs and abstractions

Spark's data abstractions evolved from the low-level Resilient Distributed Dataset (RDD) to the more optimized and user-friendly DataFrame and Dataset APIs.

#4about 11 minutes

How the Spark cluster architecture enables parallel processing

Spark's architecture uses a driver program to coordinate tasks across a cluster manager and multiple worker nodes, which run executors to process data in parallel.

#5about 5 minutes

Using Python with Spark through the PySpark library

PySpark provides a Python API for Spark, using the Py4J library to communicate between the Python process and Spark's core JVM environment.

#6about 5 minutes

Exploring the key features of the Spark MLlib library

Spark's MLlib offers a comprehensive toolkit for machine learning, including pre-built algorithms, featurization tools, pipelines for workflow management, and model persistence.

#7about 4 minutes

The standard workflow for machine learning in PySpark

A typical machine learning workflow in Spark involves using DataFrames, applying Transformers for feature engineering, training a model with an Estimator, and orchestrating these steps with a Pipeline.

#8about 3 minutes

Pre-built algorithms and utilities available in Spark MLlib

MLlib includes a variety of common, pre-built algorithms for classification, regression, and clustering, such as logistic regression, SVM, and K-means clustering.

Related jobs
Jobs that call for the skills explored in this talk.
Picnic Technologies B.V.

Picnic Technologies B.V.
Amsterdam, Netherlands

Intermediate
Senior
Python
Structured Query Language (SQL)
+1

Featured Partners

Related Articles

View all articles
DC
Daniel Cranney
The State of WebDev AI 2025 Results: What Can We Learn?
Introduction The 2025 edition of The State of WebDev AI offers a detailed snapshot of how developers are using AI today, which tools have gained the most traction over the past year, and what these trends suggest about the future of the industry. In...
The State of WebDev AI 2025 Results: What Can We Learn?

From learning to earning

Jobs that call for the skills explored in this talk.