Here's a comprehensive solution for reading multiple Parquet files from a folder and writing them into a single CSV file using Python. We'll use the robust Pandas library for this task.
Prerequisites:
- Ensure you have Pandas installed. You can install it using
pip install pandas
.
Implementation:
import pandas as pd
# Define the folder path containing the Parquet files
folder_path = 'path/to/folder/'
# Create an empty CSV file to write the data
with open('combined_data.csv', 'w') as csv_file:
# Iterate through the Parquet files in the folder
for file in os.listdir(folder_path):
# Check if the file is a Parquet file
if file.endswith('.parquet'):
# Read the Parquet file
df = pd.read_parquet(os.path.join(folder_path, file))
# Append the data to the CSV file
df.to_csv(csv_file, mode='a', index=False, header=False)
# Print success message
print("All Parquet files have been combined into a single CSV file.")
Explanation:
- We define the
folder_path
variable to specify the directory where the Parquet files reside. - We create an empty CSV file named
combined_data.csv
. - We iterate through the files in the specified folder.
- For each file that ends with
.parquet
, we read it usingpd.read_parquet()
and store the data in a Pandas DataFramedf
. - We append the data from
df
to the CSV file usingdf.to_csv()
. We setmode='a'
to append the data,index=False
to exclude the index column, andheader=False
to prevent writing headers for each DataFrame. - After processing all Parquet files, we print a success message.
Output:
This script will create a single CSV file named combined_data.csv
in the same directory as your Parquet files. The CSV file will contain all the data from the individual Parquet files, appended together.