ParquetProcessor
ParquetProcessor
A class to process Parquet files and generate CSV files with custom transformations.
__init__(input_dir, output_dir)
Initialize the ParquetProcessor with input and output directories.
Parameters:
-
input_dir
(str | Path
) –Directory containing the input Parquet files.
-
output_dir
(str | Path
) –Directory to save the output CSV files.
save_dataframe_to_csv(df, csv_file_name)
Save a DataFrame or Series to a CSV file with specific formatting.
Parameters:
-
df
(DataFrame | Series
) –The DataFrame or Series to save.
-
csv_file_name
(str
) –Name of the output CSV file.
convert_parquet_to_csv(parquet_file_name, columns, csv_file_name)
Convert a Parquet file to a CSV file with specific columns.
Parameters:
-
parquet_file_name
(str
) –Name of the input Parquet file.
-
columns
(List[str]
) –List of columns to include in the output CSV.
-
csv_file_name
(str
) –Name of the output CSV file.
create_relationship_file(df, element_list_name, element_name, collection_name, collection_new_name, output_name)
Generate a CSV file for relationship mapping based on input DataFrame.
Parameters:
-
df
(DataFrame
) –Input DataFrame containing relationship data.
-
element_list_name
(str
) –Name of the column containing element lists.
-
element_name
(str
) –Name of the element to map.
-
collection_name
(str
) –Name of the collection column.
-
collection_new_name
(str
) –New name for the collection in the output.
-
output_name
(str
) –Name of the output CSV file.
process_parquet_files(configs)
Process a list of Parquet file configurations and convert them to CSV.
Parameters:
-
configs
(List[Dict[str, Any]]
) –List of configuration dictionaries for processing Parquet files.
process_relationship_files(configs)
Process a list of relationship file configurations and generate CSV files.
Parameters:
-
configs
(List[Dict[str, Any]]
) –List of configuration dictionaries for generating relationship files.