Skip to content

ParquetProcessor

ParquetProcessor

A class to process Parquet files and generate CSV files with custom transformations.

__init__(input_dir, output_dir)

Initialize the ParquetProcessor with input and output directories.

Parameters:

  • input_dir (str | Path) –

    Directory containing the input Parquet files.

  • output_dir (str | Path) –

    Directory to save the output CSV files.

save_dataframe_to_csv(df, csv_file_name)

Save a DataFrame or Series to a CSV file with specific formatting.

Parameters:

  • df (DataFrame | Series) –

    The DataFrame or Series to save.

  • csv_file_name (str) –

    Name of the output CSV file.

convert_parquet_to_csv(parquet_file_name, columns, csv_file_name)

Convert a Parquet file to a CSV file with specific columns.

Parameters:

  • parquet_file_name (str) –

    Name of the input Parquet file.

  • columns (List[str]) –

    List of columns to include in the output CSV.

  • csv_file_name (str) –

    Name of the output CSV file.

create_relationship_file(df, element_list_name, element_name, collection_name, collection_new_name, output_name)

Generate a CSV file for relationship mapping based on input DataFrame.

Parameters:

  • df (DataFrame) –

    Input DataFrame containing relationship data.

  • element_list_name (str) –

    Name of the column containing element lists.

  • element_name (str) –

    Name of the element to map.

  • collection_name (str) –

    Name of the collection column.

  • collection_new_name (str) –

    New name for the collection in the output.

  • output_name (str) –

    Name of the output CSV file.

process_parquet_files(configs)

Process a list of Parquet file configurations and convert them to CSV.

Parameters:

  • configs (List[Dict[str, Any]]) –

    List of configuration dictionaries for processing Parquet files.

process_relationship_files(configs)

Process a list of relationship file configurations and generate CSV files.

Parameters:

  • configs (List[Dict[str, Any]]) –

    List of configuration dictionaries for generating relationship files.