Process Streaming Data with Spark Structure Streaming

Why Spark

picture from databricks

Dealing with steaming data

Spark Streaming VS Spark Structure Streaming

Get started with Spark Structure Streaming API

from pyspark.sql import SparkSession

spark = SparkSession \
.builder \
.appName("StructuredStreamingKafkaExample") \
.getOrCreate()
from pyspark.sql.types import BooleanType, StructType, StringType, TimestampTypeschema = (StructType()
.add(“user”, StringType())
.add("isActiveUser", BooleanType())
.add(“timestamp”, TimestampType())
.add(“geocoding”, StructType()
.add(“city”, StringType())
.add(“country”, StringType())
.add(“stateProvince”, StringType())
)
  • The server endpoint
  • The topic to subscribe to: en
  • A location to log checkpoint metadata
  • For the forma, here we use:kafka
kafkaDF = (spark
.readStream
.option("kafka.bootstrap.servers", "your.server.name:port")
.option("subscribe", "en")
.format("kafka")
.load()
)
filterDF = (kafkaDF
.filter(col("country").isNotNull())
.select("timestamp", "user", "country", "city")
.groupBy(filterDF.country).count()
filterDF.isStreaming
True
query = filterDF \
.writeStream \
.outputMode("complete") \
.format("console") \
.start()
query.awaitTermination()

Conclusion

Software Engineer working on building big data & machine learning platform.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Web Scraping my Github — python

Basic Builds :: How to update data in a Shiny App on RStudio Connect

Polkadex partners with Cere

Becoming a software developer

Why You Shouldn’t Use Julia

Product Development Phases: Explore, Expand, Extract

Spreadsheet Math: Graph-i-pedia

3 Simple Ways to Download Files With Python

Python download file illustration by Overcoded

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Catherine Shen

Catherine Shen

Software Engineer working on building big data & machine learning platform.

More from Medium

Benchmarking Spark Adaptive Query Execution

Introduction

Consumer Lag in Delta Lake

Understanding Apache Hive LLAP