Be it Data analysis, Research, Data export/import, Big Data or ML, this article could be a great starting point for you to know about Parquet file format.
What is Parquet?
Parquet file format is created by Apache Foundation to efficiently store and operate big data (in their initial case for Hadoop ecosystem). It uses concepts from Dremel to create a column based file format. This format enables large volumes of datasets with efficient compression. CSV files use Comma Separated Vector rows to store individual records of information. Similarly, Parquet uses COLUMNS to store high amounts of data to achieve greater compression and efficiency for analysis.
Amazon AWS stores your RDS export backups to S3 in Parquet format. While the RDS (Create from S3) feature can read this backup & restore the database. You might still need to read these Parquet files without spinning up an RDS DB instance in many cases (Not considering encryptions in this article).
Key things to know about Parquet format:
- It’s data model, data processing framework and language independent.
- Parquet files are immutable in nature. You can’t update a parquet file.
- It’s a column based file format
Let’s see how we can interact with Parquet format.
What is Apache Spark?
Apache Spark is a large data analytics engine which is available for multiple programming languages. It has multiple set of tools for various use cases ranging from Data Analysis to ML. Apache Spark makes efficient uses Parquet format for data analysis function.
Introducing PySpark
We will be using PySpark, a Python API of Apache Spark that helps us interact with Parquet using pyspark.sql module. This is just one feature of it. There are many other aspects to same, but I prefer to KISS for now.
Below are some CLI commands for some quick hands on. (Side note: Make sure you are in a fresh Python Virtual Environment and have a Parquet file ready)
- Activate your python virtual environment.
- Install Spark SQL for Python:
pip install pyspark[sql] - Enter Python CLI session:
python - Create a SparkSession (This opens a Spark session so that we can connect and interact with your Parquet file in Python):
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate() - Now that you have spark session created, you can load Parquet file as a spark data frame to interact with:
df = spark.read.load(‘PATH/TO/PARQUET_FILE.parquet’) - To show records in parquet file (Only top 20 records):
df.show() - To show only specific columns from the parquet file:
df.select(‘column1Header’, ‘column2Header’).show() - To Print schema of your parquet file:
df.printSchema() - To print all column names:
df.columns - To write the DataFrame data into a CSV file (Note- It uses parquet filename for the new file inside the given folder):
df.write.csv(‘FolderName’)
Further you can also run SQL functions to analyse the data & create more CSV/Parquet files using its output.
Note: The DataFrames in spark can also come from other data source and not just from your parquet file. This gives us the ability to mix/mash, process, analyse and create our own parquet files for further consumption as its an analysis engine. With Spark you can also merge and process multiple DataFrames. Exporting to CSV/Parquet is just one part of Spark.
Have a nice day 🫐
Leave a Reply