Spark Structured Streaming Demo

Spark Structured Streaming data pipeline that processes movie ratings data in real-time.

Consumes events from a Kafka topic in Avro, transforms and writes to an Apache Iceberg table.

The pipeline handles updates and duplicate events by merging to the destination table based on the event_id.

Late arriving events from more than 5 days ago are discarded (for performance reasons in the merge - to leverage partitioning and avoid full scans).

Data Architecture

Local setup

We spin up a local Kafka cluster with Schema Registry based on the Docker Compose file provided by Confluent.

We install a local Spark Structured Streaming app using uv.

Dependency management

Dependabot is configured to periodically upgrade repo dependencies. See dependabot.yml.

Running instructions

Run the following commands in order:

make setup to install the Spark Structured Streaming app on a local Python env.
make kafka-up to start local Kafka in Docker.
make kafka-create-topic to create the Kafka topic we will use.
make kafka-produce-test-events to start writing messages to the topic.

On a separate console, run:

make streaming-app-run to start the Spark Structured Streaming app.

On a separate console, you can check the output dataset by running:

$ make pyspark
>>> df = spark.read.table("movie_ratings")
>>> df.show()
+--------------------+--------------------+--------------------+------+-----------+----------------+-----------+
|            event_id|             user_id|            movie_id|rating|is_approved|rating_timestamp|rating_date|
+--------------------+--------------------+--------------------+------+-----------+----------------+-----------+
|a41847d0-37de-11f...|a418482a-37de-11f...|a418483e-37de-11f...|   1.8|      false|      1748008982| 2025-05-23|
|a46519c0-37de-11f...|a4651a42-37de-11f...|a4651a60-37de-11f...|   6.9|      false|      1748008982| 2025-05-23|
|a4c15a50-37de-11f...|a4c15ac8-37de-11f...|a4c15ae6-37de-11f...|   5.0|      false|      1748008983| 2025-05-23|
|a79b2b98-37de-11f...|a79b2c10-37de-11f...|a79b2c2e-37de-11f...|   4.0|      false|      1748008988| 2025-05-23|
+--------------------+--------------------+--------------------+------+-----------+----------------+-----------+

Table internal maintenance

The streaming microbatches can produce many small files and constant table snapshots.

In order to tackle these issues, the recommended Iceberg table maintenance operations can be used, see doc.

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.github		.github
checkpoint		checkpoint
movie_ratings_streaming		movie_ratings_streaming
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Makefile		Makefile
README.md		README.md
docker-compose.yaml		docker-compose.yaml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Spark Structured Streaming Demo

Data Architecture

Local setup

Dependency management

Running instructions

Table internal maintenance

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

guidok91/spark-structured-streaming-kafka

Folders and files

Latest commit

History

Repository files navigation

Spark Structured Streaming Demo

Data Architecture

Local setup

Dependency management

Running instructions

Table internal maintenance

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages