Unraveling Parquet: Exploring its Format and Comparing with Delta Parquet

Blogs

Private Link in Fabric: The Key to Enhanced Security and Compliance
May 21, 2024
Understanding Subqueries in WHERE Clauses : A Real time Scenario
July 15, 2024

Unraveling Parquet: Exploring its Format and Comparing with Delta Parquet

Overview:

Parquet file format is one of the most used file format in the modern data solution architectures because of its faster and cost-efficient way of reading data in a very quicker way. Parquet file is one of the file formats currently available in the industry. Parquet was created as a collaborative effort by Cloudera and Twitter to address the shortcomings of existing storage formats, particularly for use cases involving large-scale data processing. It was designed to provide efficient storage and retrieval of data for analytics and big data workloads.

 

Parquet files excel in achieving faster read performance due to the following reasons:

  1. It stores the data in hybrid mode(columnar + row based).
  2. It optimizes selective scans by retrieving only the necessary data from the data pages..
  3. It contains the file metadata which helps in achieving the faster reads.

 

Alright, let’s dive right into an example and perform a post-mortem to gain a deeper understanding of how Parquet works in practice!

Structure of Parquet File:

Parquet file structure consists of three parts:-

  1. File Header: This contains the special 4-character code which denotes its a parquet file.
  2. Row Groups: The actual data is present inside the row groups. It contains a logical hierarchy of column chunks within itself. Each column chunks are separately stored and each column chunks may have more than one data pages where data is actually stored in a compressed & encoded form.
  3. File Footer: This is a special section of parquet file because it contains file metadata and row groups & column chunks metadata which helps query engine to retrieve data quickly.

Overall Structure:-

 

File Header & Row Groups:-

 

 

File Footer:-

I know the structure looks a bit scary but with an example and connecting the dots and mapping each component of the structure we can clearly understand how data is actually retrieved.

So lets go!

Consider a query “select country,region,sum(sales_amount) from table where country=’UK’  group by counry,region”

Now when query engine gets the query , it goes to the parquet file, and starts the following process:-

1. Scanning the File Header section:

It first scans the file header which tells the query engine that it is a parquet file.

 

2. Scanning the file footer section:

Next, it goes directly to the file footer to check the “File Metadata”.

File Metadata: It contains the information such as schema & data types of each column, statistics about the no. of columns, total no of row groups present in the file & compression type of the file. This information helps query engine to understand the following information:-

  • It helps query to get the column data types of the columns (country,region & sales_amount) and would ignore the columns not required from there. This process of skipping & only focusing on the columns needed are called “Projection Pushdown”.

 

3. Fetching the Data from Row Group(Column Chunk)-Hierarchal storage:

File metadata extends to contain row group metadata and column chunks metadata. This step is the most significant one as it helps query engine to do selective scan on the required data.. How?

When query engine understood the required columns & its data type from the file metadata, it goes to each & every row group metadata , so lets say there are three row groups , so query engine will start with particular row group metadata say row group-1 and check the “Column Chunks Metadata”-(But only for those columns that are needed- Thanks to Projection Pushdown-which means only column chunk metadata scan will happen for columns needed in the “SELECT STATEMENT” – which are country,region & sales_amount.

Note: {Column Chunk means , column values for a particular column present in a particular row group. This can be simplified as lets say i have column “Country” – It has column values US,UK,INDIA,CANADA. So if my parquet file has 3 row groups, the column values may be splitted into three row groups and the column chunk(country) for row group-1 may have (US,UK), for row group-2 may have (INDIA) and row group three may have (CANADA).}

Lets continue with our example.

So for Row Group-1 –

We check the column chunk metadata for:-

Country –

  • Statistics:(Min/Max values of column chunk for country column).
  • Compression Type: This tells the query engine how to decode the data sitting in data page.
  • Encoding Type: This tells how to decode the data present in data page to fetch the required result.
  • Offset: This tells the starting position of the column chunk from a particular row group inside the disk.

Similarly we would have it for Region & Sales_Amount.

Now what would we gain from this?? The query engine would go to each of the row group metadata and then column chunk metadata to check the statistics for the particular columns. But why? So that we can evaluate whether to skip scanning the entire row group or not!

Wait, let me elaborate. In our query we have a where clause “Country=’UK’, so it will go to each of the row groups and check their column chunk metadata and if the statistics -Min/Max value would contain the data for country=’UK’ it would mark it, if not it would just skip the row group. The ability to just scan the row group -column chunk (metadata hierarchy) to skip the row groups not necessary to scan is called “Predicate Pushdown”.

After marking all the column chunks for respective row groups where data would reside, it would sequentially go and fetch the data from those column chunks.

It could do so because it knows the start position of those column chunks, compression method used, and encoded method used for that specific column chunk which helps in data retrieval in efficient manner.

 

4.

Finally within the marked column chunks, it contains data pages and it is stored in columnar format which makes much faster and cost efficient.

After reading the required data , it performs the aggregation for the sales_amount based on region, country , it presents the result in very minimal time.

 

Some characteristics of parquet file:-

  1. It is based on hybrid storage, which means the data is stored in columnar storage but those column values are splitted in row groups, hence taking advantage of row based & column based storage format.
  2. It allows you to append different schemas inside the parquet file. Technically , it means it allows schema evolution.
  3. It is immutable in nature, means that you cannot change the name of column or change data types. You are required to require the entire file with the new schema.
  4. It doesn’t keep up to its best performance if there are lot of small files created for a particular dataset stored in parquet format because it has to scan all the file footer & their respective file metadata to search for the data which can be expensive & time consuming. There are ways to deal with it such as file compaction etc..

 

Industry has moved one step ahead with the parquet file, which is popularly known as Delta Parquet Format. This is the open-source file format used to store data in One-Lake (Fabric).

Delta parquet files stores the data in parquet files and any changes applied to the tables in a “transaction log file”.

Delta Lake Parquet File Format is an advanced form of parquet file. What separates these two?

  1. Delta Tables(aka. Delta Parquet) has ACID properties, which is lacking in parquet file and any transaction half-done is stored in parquet file and can lead to file corruption or issues in down-stream applications. Delta Tables revoke the transaction if its is stopped for any reason.
  2. Delta Tables have schema enforcement and also support schema evolution. Schema enforcement is that user predefines the schema of the table and data ingested is checked for the same, if mismatch happens then it doesn’t allow to write the data in delta table. This leads to data consistency and integrity. But in Parquet file there is no schema enforcement which means if data has different schema than the data already present in parquet file , it still allows you to write the data which can lead to many data-issues.
  3. Delta Table has a logical & physical storage, this helps users to change the column names or data types, because those changes are happening in logical storage-metadata present in delta log file. Whereas parquet file has metadata stored within the file footer itself, hence we need to rewrite the entire file.

 

Schema Evolution in Parquet vs Delta Tables:-

Lets say we have a dataset with id, name,age and country as its schema:-

We would store this data frame in the parquet format. The parquet file looks like this-

There are four parquet files created in name of person-folder which contains data for the dataframe. We would not try to append another data frame which would show us whether or not parquet files allow schema evolution(meaning allow different schema to be embedded in the same parquet file)…

                                             Fig: (Parquet File supported schema evolution)

On adding another dataframe with additional column- Is_human it allowed in the parquet file.

The result set for parquet file is as follows:-

                                                                       Fig :(Result Set of Parquet File)

 

But , what would happen if i do the same in delta tables??

Lets find out!

                                Fig:-(Result – Schema Enforcement Verified for Delta Tables.)

 

When attempting to append a new DataFrame to an existing Delta table, the operation was unsuccessful because Delta tables enforce schema constraints. Consequently, the new data could not be written to the Delta table due to schema mismatches.

Hence we can conclude this blog by the following highlights:-

  1. Parquet files are very fast for data retrieval , hence they are extensively used in the market in its advanced version -Delta Parquet Format.
  2. A demonstration regarding schema evolution & schema enforcement takes place in case of parquet vs delta tables.
  3. In-depth explanation of parquet file format by reading the file metadata & respective row group-column chunk metadata hierarchy for selecting only that chunk of dataset required by the query engine.

 

Thanks for reading!

– Manan Choudhary

 

 


Manan Choudhary

Leave a Reply

Your email address will not be published. Required fields are marked *