SuperSpark: Union of Distributed SQL Engine and Self-Serve Analytics Service

Blogs

Implementing Semantic Search in Elasticsearch
February 15, 2024
LangChain: Simplifying LLM Development for Next-Gen AI Applications
March 7, 2024

SuperSpark: Union of Distributed SQL Engine and Self-Serve Analytics Service

Introduction

Apache spark is a very versatile distributed in-memory framework that allows a user to work with its core using either Scala, Python, SQL and R. Now this brings a lot of scope for developers of diverse backgrounds to be able to work with spark directly with a few tweaks here and there to get things started. After working for 2.5 years in this piece of tech I wanted to explore something different than the typical spark submit and databricks notebook I use to run my code, which for me was Distributed SQL Engine feature of spark.

The flow of a Spark application is simple, let’s say you have a jar or a Python script containing your code, you have to initialize a Spark session and from there, this becomes a full-blown custom application with its configs to further narrow down on how your code needs to be executed and once your code is executed, the spark session is terminated, if you are using databricks Jobs cluster then your whole cluster gets terminated at this point. The above statements make it clear that each time your code executes a new spark session needs to be created each time you need to do something in spark. Distributed SQL Engine allows a permanent spark session to be created via Thrift server, through which ad-hoc queries can be submitted to a cluster and it starts to execute right away. Since I have already worked on Trino and superset, I wanted to see how good will spark SQL be as a query engine in superset.

Apache Superset is a lightweight FAB that allows users to visualize their data stored in different databases, it has wide support for many query engines and Spark looks to be one of them. Well since we are already here, let us add DeltaLake into the picture too.

Disclaimer: This is for a development purpose and not a setup process for production system!!!

Objective:

Apache superset to interact with Apache spark cluster as a query engine and create/execute queries on Delta table

Setup:  

Hardware: Ubuntu VM running a 6 core CPU with 6GB of RAM 

Software: Java8, Python3.9, Apache Spark 3.5, Apache superset 3.1.0 and docker 

Implementation: 

Spark Cluster:  

Step 1: Download the latest version of Spark, at the time of writing this it is Spark 3.5 

Step 2: Setup spark after installing JAVA8 and python3. Do remember to add spark home to your PATH variable 

Step 3: Since we are going to use a cluster-based architecture, we need to start a spark master and 2 workers 

Step 4: Once our spark UI loads with 2 workers in alive status we can now move ahead to the next step 

 

Del­­ta:

Delta can be added as a plugin to spark session giving us power to create and modify delta tables

Step 1: Download all necessary jars mentions the official delta documentation

Step 2: Add the files to spark library

Step 3: restart your spark session in master

You can validate by seeing if delta catalog is being displayed in your spark session environment variables.

Thrift Server:

Spark comes bundles with a Hiveserver2 baked into the framework, now we are going to use this to act as a bridge to connect Superset and spark core via JDBC.

  • start your thrift server pointing to the spark master address

Superset UI: 

Since superset is a FAB, it is faster while running in a python kernel, or at least I think so. 

Step 1: Create a python env  

Step 2: Activate the python env and install superset using PIP 

Step 3: Create a user for accessing Superset UI 

You can validate by heading over to superset UI, should be on localhost:5000 

Database Connection:

Once you reach until here, you are just a step away from working on the whole picture I am talking about

Now for Thrift Server to connect with Superset via JDBC we need to install a python package called PyHive. Pyhive works as a translator for applications that use meta data store rather than direct table access, since spark’s thrift server is also a meta data store PyHive is required.

Step 1: head over to Superset UI

Step 2: Go to settings -> database connections

Step 3: If everything is done right, you should see SPARK SQL in the dropdown list

Step 4:  This is a very important step so if this is not working you might have missed something in above setup steps. Add the JDBC URL like so

You can validate If everything is working by click­ing on test connection to see if superset can connect with spark cluster.

Usage:

Once you have set up the database successfully, lets create a delta table.

Step1: Go to SQL tab in superset UI and click on SQL LABS

Step 2: Create a delta table, I already have a dataset with me so I am going to create a table to accommodate it.

Step 3: Insert some values to the table, I am inserting 1000 rows.

Step 4: well, display the data you just inserted by using select * from users table

Conclusion:

We have successfully connected Apache Spark and Apache superset by which are can now able to query delta tables directly from a UI. We achieved all this by using only open-source technologies and no paid services being involved.


Staline Edson D'souza

Leave a Reply

Your email address will not be published. Required fields are marked *