Supporting Microsoft’s GraphRAG: Part 1 - Setup and Data Preparation

Microsoft's GraphRAG is a method for creating structured knowledge graphs from raw text, enhancing Retrieval Augmented Generation (RAG) tasks. By organizing information hierarchically, it enables more efficient data retrieval and summarization.

What You’ll Learn in This Guide

Indexing: Utilize Microsoft's GraphRAG to convert unstructured documents into Parquet files.
Data Preprocessing: Learn how to use utility methods provided by TigerGraphX to transform Parquet files into CSV files compatible with TigerGraph.

Prerequisites

Before proceeding, ensure you’ve completed the installation and setup steps outlined in the Installation Guide, including:

Setting up Python and TigerGraph. For more details, refer to the Requirements section.
Install TigerGraphX along with its development dependencies. For more details, refer to the Development Installation section.

Utilize Microsoft GraphRAG for Indexing

The indexing process transforms raw documents into structured data using Microsoft’s GraphRAG. Follow these steps to prepare your data:

Data Preparation

For this demo, we will use applications/msft_graphrag/data as the working directory.

The input dataset, input/clapnq_dev_answerable_orig.jsonl.10.txt, is located in the working directory. It consists of the first ten records from the original dataset.

Additionally, we have another dataset, clapnq_dev_answerable.jsonl.10, for evaluation, stored in applications/resources. This dataset contains ten questions from the annotated dataset, each with corresponding context from the original dataset.

Initialization

Initialize the indexing system in the data directory.

python3 -m graphrag init --root data

Set Up OpenAI API Key

GraphRAG requires an OpenAI API key. To configure it:

Open the .env file in the data directory:
```
vi data/.env
```
Add your API key:
```
GRAPHRAG_API_KEY=<Your OpenAI API Key>
```

Optional: Switch to a Cost-Effective Model

GraphRAG uses the gpt-4-turbo-preview model by default. To reduce costs, switch to the gpt-4o-mini model by editing the settings.yaml file in the data directory:

llm:
  api_key: ${GRAPHRAG_API_KEY}
  type: openai_chat # or azure_openai_chat
  model: gpt-4o-mini # Use a cost-effective model

Indexing

Run the indexing process to convert documents into structured data. This step uses LLMs and may take several minutes depending on the dataset size and hardware.

python3 -m graphrag index --no-cache --root data

Utilize TigerGraphX for Data Preprocessing

Transform the structured Parquet files generated by GraphRAG into CSV files that TigerGraph can import.

Convert Parquet to CSV

Run the script below to convert Parquet files into TigerGraph-compatible CSV files. You can find the Python script here.

python3 data_import/convert_parquet_to_tg_csv.py \
--input_dir data/output \
--output_dir data/tg_csv

Export Data from LanceDB to CSV

Use the script below to export data from LanceDB into CSV files that are compatible with TigerGraph. You can access the Python script here.

python3 data_import/export_lancedb_to_csv.py \
--input_dir data/output/lancedb \
--output_dir data/tg_csv

Transfer CSV Files to TigerGraph Server

Transfer the generated CSV files to your TigerGraph server. Use the following command, replacing username and tigergraph-server with your server credentials:

scp data/tg_csv/* username@tigergraph-server:/home/tigergraph/data/graphrag

Next Steps

Supporting Microsoft’s GraphRAG: Part 2: Use Jupyter Notebook to create the schema and load data into TigerGraph.

Start transforming your GraphRAG workflows with the power of TigerGraphX today!