Overview:
Parquet file format is one of the most used file format in the modern data solution architectures because of its faster and cost-efficient way of reading data in a very quicker way. Parquet file is one of the file formats currently available in the industry. Parquet was created as a collaborative effort by Cloudera and Twitter to address the shortcomings of existing storage formats, particularly for use cases involving large-scale data processing. It was designed to provide efficient storage and retrieval of data for analytics and big data workloads.
Parquet files excel in achieving faster read performance due to the following reasons:
Alright, let’s dive right into an example and perform a post-mortem to gain a deeper understanding of how Parquet works in practice!
Structure of Parquet File:
Parquet file structure consists of three parts:-
Overall Structure:-
File Header & Row Groups:-
File Footer:-
I know the structure looks a bit scary but with an example and connecting the dots and mapping each component of the structure we can clearly understand how data is actually retrieved.
So lets go!
Consider a query “select country,region,sum(sales_amount) from table where country=’UK’ group by counry,region”
Now when query engine gets the query , it goes to the parquet file, and starts the following process:-
1. Scanning the File Header section:
It first scans the file header which tells the query engine that it is a parquet file.
2. Scanning the file footer section:
Next, it goes directly to the file footer to check the “File Metadata”.
File Metadata: It contains the information such as schema & data types of each column, statistics about the no. of columns, total no of row groups present in the file & compression type of the file. This information helps query engine to understand the following information:-
3. Fetching the Data from Row Group(Column Chunk)-Hierarchal storage:
File metadata extends to contain row group metadata and column chunks metadata. This step is the most significant one as it helps query engine to do selective scan on the required data.. How?
When query engine understood the required columns & its data type from the file metadata, it goes to each & every row group metadata , so lets say there are three row groups , so query engine will start with particular row group metadata say row group-1 and check the “Column Chunks Metadata”-(But only for those columns that are needed- Thanks to Projection Pushdown-which means only column chunk metadata scan will happen for columns needed in the “SELECT STATEMENT” – which are country,region & sales_amount.
Note: {Column Chunk means , column values for a particular column present in a particular row group. This can be simplified as lets say i have column “Country” – It has column values US,UK,INDIA,CANADA. So if my parquet file has 3 row groups, the column values may be splitted into three row groups and the column chunk(country) for row group-1 may have (US,UK), for row group-2 may have (INDIA) and row group three may have (CANADA).}
Lets continue with our example.
So for Row Group-1 –
We check the column chunk metadata for:-
Country –
Similarly we would have it for Region & Sales_Amount.
Now what would we gain from this?? The query engine would go to each of the row group metadata and then column chunk metadata to check the statistics for the particular columns. But why? So that we can evaluate whether to skip scanning the entire row group or not!
Wait, let me elaborate. In our query we have a where clause “Country=’UK’, so it will go to each of the row groups and check their column chunk metadata and if the statistics -Min/Max value would contain the data for country=’UK’ it would mark it, if not it would just skip the row group. The ability to just scan the row group -column chunk (metadata hierarchy) to skip the row groups not necessary to scan is called “Predicate Pushdown”.
After marking all the column chunks for respective row groups where data would reside, it would sequentially go and fetch the data from those column chunks.
It could do so because it knows the start position of those column chunks, compression method used, and encoded method used for that specific column chunk which helps in data retrieval in efficient manner.
4.
Finally within the marked column chunks, it contains data pages and it is stored in columnar format which makes much faster and cost efficient.
After reading the required data , it performs the aggregation for the sales_amount based on region, country , it presents the result in very minimal time.
Some characteristics of parquet file:-
Industry has moved one step ahead with the parquet file, which is popularly known as Delta Parquet Format. This is the open-source file format used to store data in One-Lake (Fabric).
Delta parquet files stores the data in parquet files and any changes applied to the tables in a “transaction log file”.
Delta Lake Parquet File Format is an advanced form of parquet file. What separates these two?
Schema Evolution in Parquet vs Delta Tables:-
Lets say we have a dataset with id, name,age and country as its schema:-
We would store this data frame in the parquet format. The parquet file looks like this-
There are four parquet files created in name of person-folder which contains data for the dataframe. We would not try to append another data frame which would show us whether or not parquet files allow schema evolution(meaning allow different schema to be embedded in the same parquet file)…
Fig: (Parquet File supported schema evolution)
On adding another dataframe with additional column- Is_human it allowed in the parquet file.
The result set for parquet file is as follows:-
Fig :(Result Set of Parquet File)
But , what would happen if i do the same in delta tables??
Lets find out!
Fig:-(Result – Schema Enforcement Verified for Delta Tables.)
When attempting to append a new DataFrame to an existing Delta table, the operation was unsuccessful because Delta tables enforce schema constraints. Consequently, the new data could not be written to the Delta table due to schema mismatches.
Hence we can conclude this blog by the following highlights:-
Thanks for reading!
– Manan Choudhary
Manan Choudhary