Supporting Microsoft’s GraphRAG: Part 1 - Setup and Data Preparation
Microsoft's GraphRAG is a method for creating structured knowledge graphs from raw text, enhancing Retrieval Augmented Generation (RAG) tasks. By organizing information hierarchically, it enables more efficient data retrieval and summarization.
What You’ll Learn in This Guide
- Indexing: Utilize Microsoft's GraphRAG to convert unstructured documents into Parquet files.
- Data Preprocessing: Learn how to use utility methods provided by TigerGraphX to transform Parquet files into CSV files compatible with TigerGraph.
Prerequisites
Before proceeding, ensure you’ve completed the installation and setup steps outlined in the Installation Guide, including:
- Setting up Python and TigerGraph. For more details, refer to the Requirements section.
- Install TigerGraphX along with its development dependencies. For more details, refer to the Development Installation section.
Utilize Microsoft GraphRAG for Indexing
The indexing process transforms raw documents into structured data using Microsoft’s GraphRAG. Follow these steps to prepare your data:
Data Preparation
For this demo, we will use applications/msft_graphrag/data
as the working directory.
The input dataset, input/clapnq_dev_answerable_orig.jsonl.10.txt
, is located in the working directory. It consists of the first ten records from the original dataset.
Additionally, we have another dataset, clapnq_dev_answerable.jsonl.10
, for evaluation, stored in applications/resources
. This dataset contains ten questions from the annotated dataset, each with corresponding context from the original dataset.
Initialization
Initialize the indexing system in the data
directory.
python3 -m graphrag init --root data
Set Up OpenAI API Key
GraphRAG requires an OpenAI API key. To configure it:
- Open the
.env
file in thedata
directory:vi data/.env
- Add your API key:
GRAPHRAG_API_KEY=<Your OpenAI API Key>
Optional: Switch to a Cost-Effective Model
GraphRAG uses the gpt-4-turbo-preview
model by default. To reduce costs, switch to the gpt-4o-mini
model by editing the settings.yaml
file in the data
directory:
llm:
api_key: ${GRAPHRAG_API_KEY}
type: openai_chat # or azure_openai_chat
model: gpt-4o-mini # Use a cost-effective model
Indexing
Run the indexing process to convert documents into structured data. This step uses LLMs and may take several minutes depending on the dataset size and hardware.
python3 -m graphrag index --no-cache --root data
Utilize TigerGraphX for Data Preprocessing
Transform the structured Parquet files generated by GraphRAG into CSV files that TigerGraph can import.
Convert Parquet to CSV
Run the script below to convert Parquet files into TigerGraph-compatible CSV files. You can find the Python script here.
python3 data_import/convert_parquet_to_tg_csv.py \
--input_dir data/output \
--output_dir data/tg_csv
Export Data from LanceDB to CSV
Use the script below to export data from LanceDB into CSV files that are compatible with TigerGraph. You can access the Python script here.
python3 data_import/export_lancedb_to_csv.py \
--input_dir data/output/lancedb \
--output_dir data/tg_csv
Transfer CSV Files to TigerGraph Server
Transfer the generated CSV files to your TigerGraph server. Use the following command, replacing username
and tigergraph-server
with your server credentials:
scp data/tg_csv/* username@tigergraph-server:/home/tigergraph/data/graphrag
Next Steps
- Supporting Microsoft’s GraphRAG: Part 2: Use Jupyter Notebook to create the schema and load data into TigerGraph.
Start transforming your GraphRAG workflows with the power of TigerGraphX today!