Setting up local Spark Cluster

Varokas Panusuwan

01 Dec 2024 • 2 min read

EMR presents a convenient method for rapidly deploying a cluster. However, we frequently require an ability to explore locally stored data rapidly. This guide walktlhoughts setting up a minimal local spark cluster fit for stated purpose.

The completed project can be found here.

PySpark
Jupyter Notebook connecting to Pyspark
Writing Iceberg Table Locally
Writing Iceberg Table to AWS (TBD)
Reading Parquet files from AWS (TBD)

PySpark

Install Java and Sparks via SDK Man.

$ curl -s "https://get.sdkman.io" | bash

$ sdk install java 17.0.13-zulu
$ sdk install spark 3.5.3

# Uses (sdk list java) OR (sdk list sparks) to see all available version

Install UV

# On macOS and Linux.
curl -LsSf https://astral.sh/uv/install.sh | sh

Jupyter Notebook

Using uv to setup the project and add libraries

$ uv init 
$ uv add pyspark jupyterlab

Then we can run bringup the notebook using this command

$ uv run jupyter lab

Minimally we can connect to by creating Sparksession using

from pyspark.sql import SparkSession

spark = SparkSession.builder \
      .master("local[*]") \
      .appName("spark-demo") \
      .config(map={
         ## Configuration goes here
      }) \
      .getOrCreate()

Starting spark this way would create a WebUI at 4040. we can inspect this address by evaluating

spark.sparkContext.uiWebUrl

Local Iceberg Table

For a short experimentation with Local Iceberg table, we can configure sparks to store metadata in a local file. The example below shows how we could setup a catalog named local , where the data files are stored at $PWD/warehouse

import os
cwd = os.getcwd()

spark = SparkSession.builder \
      .master("local[*]") \
      .appName("spark-demo") \
      .config(map={
          "spark.jars.packages": "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.0",          
          "spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",

          "spark.sql.catalog.local": "org.apache.iceberg.spark.SparkSessionCatalog",
          "spark.sql.catalog.local.type": "hive",
          "spark.sql.catalog.local": "org.apache.iceberg.spark.SparkCatalog",
          "spark.sql.catalog.local.type": "hadoop",
          "spark.sql.catalog.local.warehouse": f"{cwd}/warehouse",
      }) \
      .getOrCreate()

after setting up spark session, we can create and use the iceberg tables from pyspark.

Data will be written to designated directory