Skip to content

Commit 44188dd

Browse files
Automatic schema extraction from text (#331)
* Add schema extraction prompt template * Add schema from text using an LLM * Update SimpleKGPipeline for automatic schema extraction * Save/Read inferred schema * Bug fixes * Add unit tests * Update changelog and api rst * Update documentation * Fix Changelog after rebase * Ruff * Fix mypy issues * Remove unused imports * Fix unit tests * Fix component connections * Improve default schema extraction prompt and add examples * Rename schema from text component * Fix remaining mypy errors * Improve schema from text example * Ruff * Fix unit tests * Handle cases where LLM outputs a valid JSON array * Fix e2e tests * Add examples running SimpleKGPipeline * Add inferred schema json and yaml files example * Improve handling LLM response * Improve handling errors for extracted schema * Replace warning logs with real deprecation warnings * Ensure proper handling of schema when provided as dict * Add custom schema extraction error * Handle invalid format for extracted schema
1 parent b36967d commit 44188dd

File tree

18 files changed

+1644
-96
lines changed

18 files changed

+1644
-96
lines changed

CHANGELOG.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,15 @@
22

33
## Next
44

5+
### Added
6+
7+
- Added support for automatic schema extraction from text using LLMs. In the `SimpleKGPipeline`, when the user provides no schema, the automatic schema extraction is enabled by default.
8+
59
### Fixed
610

711
- Fixed a bug where `spacy` and `rapidfuzz` needed to be installed even if not using the relevant entity resolvers.
812

13+
914
## 1.7.0
1015

1116
### Added

docs/source/api.rst

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -77,6 +77,12 @@ SchemaBuilder
7777
.. autoclass:: neo4j_graphrag.experimental.components.schema.SchemaBuilder
7878
:members: run
7979

80+
SchemaFromTextExtractor
81+
-----------------------
82+
83+
.. autoclass:: neo4j_graphrag.experimental.components.schema.SchemaFromTextExtractor
84+
:members: run
85+
8086
EntityRelationExtractor
8187
=======================
8288

@@ -362,6 +368,13 @@ ERExtractionTemplate
362368
:members:
363369
:exclude-members: format
364370

371+
SchemaExtractionTemplate
372+
------------------------
373+
374+
.. autoclass:: neo4j_graphrag.generation.prompts.SchemaExtractionTemplate
375+
:members:
376+
:exclude-members: format
377+
365378
Text2CypherTemplate
366379
--------------------
367380

docs/source/user_guide_kg_builder.rst

Lines changed: 119 additions & 68 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ A Knowledge Graph (KG) construction pipeline requires a few components (some of
2121
- **Data loader**: extract text from files (PDFs, ...).
2222
- **Text splitter**: split the text into smaller pieces of text (chunks), manageable by the LLM context window (token limit).
2323
- **Chunk embedder** (optional): compute the chunk embeddings.
24-
- **Schema builder**: provide a schema to ground the LLM extracted entities and relations and obtain an easily navigable KG.
24+
- **Schema builder**: provide a schema to ground the LLM extracted entities and relations and obtain an easily navigable KG. Schema can be provided manually or extracted automatically using LLMs.
2525
- **Lexical graph builder**: build the lexical graph (Document, Chunk and their relationships) (optional).
2626
- **Entity and relation extractor**: extract relevant entities and relations from the text.
2727
- **Knowledge Graph writer**: save the identified entities and relations.
@@ -75,10 +75,11 @@ Graph Schema
7575

7676
It is possible to guide the LLM by supplying a list of entities, relationships,
7777
and instructions on how to connect them. However, note that the extracted graph
78-
may not fully adhere to these guidelines. Entities and relationships can be
79-
represented as either simple strings (for their labels) or dictionaries. If using
80-
a dictionary, it must include a label key and can optionally include description
81-
and properties keys, as shown below:
78+
may not fully adhere to these guidelines unless schema enforcement is enabled
79+
(see :ref:`Schema Enforcement Behaviour`). Entities and relationships can be represented
80+
as either simple strings (for their labels) or dictionaries. If using a dictionary,
81+
it must include a label key and can optionally include description and properties keys,
82+
as shown below:
8283

8384
.. code:: python
8485
@@ -117,14 +118,20 @@ This schema information can be provided to the `SimpleKGBuilder` as demonstrated
117118

118119
.. code:: python
119120
121+
# Using the schema parameter (recommended approach)
120122
kg_builder = SimpleKGPipeline(
121123
# ...
122-
entities=ENTITIES,
123-
relations=RELATIONS,
124-
potential_schema=POTENTIAL_SCHEMA,
124+
schema={
125+
"entities": ENTITIES,
126+
"relations": RELATIONS,
127+
"potential_schema": POTENTIAL_SCHEMA
128+
},
125129
# ...
126130
)
127131
132+
.. note::
133+
By default, if no schema is provided to the SimpleKGPipeline, automatic schema extraction will be performed using the LLM (See the :ref:`Automatic Schema Extraction with SchemaFromTextExtractor`).
134+
128135
Extra configurations
129136
--------------------
130137

@@ -412,41 +419,44 @@ within the configuration file.
412419
"neo4j_database": "myDb",
413420
"on_error": "IGNORE",
414421
"prompt_template": "...",
415-
"entities": [
416-
"Person",
417-
{
418-
"label": "House",
419-
"description": "Family the person belongs to",
420-
"properties": [
421-
{"name": "name", "type": "STRING"}
422-
]
423-
},
424-
{
425-
"label": "Planet",
426-
"properties": [
427-
{"name": "name", "type": "STRING"},
428-
{"name": "weather", "type": "STRING"}
429-
]
430-
}
431-
],
432-
"relations": [
433-
"PARENT_OF",
434-
{
435-
"label": "HEIR_OF",
436-
"description": "Used for inheritor relationship between father and sons"
437-
},
438-
{
439-
"label": "RULES",
440-
"properties": [
441-
{"name": "fromYear", "type": "INTEGER"}
442-
]
443-
}
444-
],
445-
"potential_schema": [
446-
["Person", "PARENT_OF", "Person"],
447-
["Person", "HEIR_OF", "House"],
448-
["House", "RULES", "Planet"]
449-
],
422+
423+
"schema": {
424+
"entities": [
425+
"Person",
426+
{
427+
"label": "House",
428+
"description": "Family the person belongs to",
429+
"properties": [
430+
{"name": "name", "type": "STRING"}
431+
]
432+
},
433+
{
434+
"label": "Planet",
435+
"properties": [
436+
{"name": "name", "type": "STRING"},
437+
{"name": "weather", "type": "STRING"}
438+
]
439+
}
440+
],
441+
"relations": [
442+
"PARENT_OF",
443+
{
444+
"label": "HEIR_OF",
445+
"description": "Used for inheritor relationship between father and sons"
446+
},
447+
{
448+
"label": "RULES",
449+
"properties": [
450+
{"name": "fromYear", "type": "INTEGER"}
451+
]
452+
}
453+
],
454+
"potential_schema": [
455+
["Person", "PARENT_OF", "Person"],
456+
["Person", "HEIR_OF", "House"],
457+
["House", "RULES", "Planet"]
458+
]
459+
},
450460
"lexical_graph_config": {
451461
"chunk_node_label": "TextPart"
452462
}
@@ -462,31 +472,32 @@ or in YAML:
462472
neo4j_database: myDb
463473
on_error: IGNORE
464474
prompt_template: ...
465-
entities:
466-
- label: Person
467-
- label: House
468-
description: Family the person belongs to
469-
properties:
470-
- name: name
471-
type: STRING
472-
- label: Planet
473-
properties:
474-
- name: name
475-
type: STRING
476-
- name: weather
477-
type: STRING
478-
relations:
479-
- label: PARENT_OF
480-
- label: HEIR_OF
481-
description: Used for inheritor relationship between father and sons
482-
- label: RULES
483-
properties:
484-
- name: fromYear
485-
type: INTEGER
486-
potential_schema:
487-
- ["Person", "PARENT_OF", "Person"]
488-
- ["Person", "HEIR_OF", "House"]
489-
- ["House", "RULES", "Planet"]
475+
schema:
476+
entities:
477+
- Person
478+
- label: House
479+
description: Family the person belongs to
480+
properties:
481+
- name: name
482+
type: STRING
483+
- label: Planet
484+
properties:
485+
- name: name
486+
type: STRING
487+
- name: weather
488+
type: STRING
489+
relations:
490+
- PARENT_OF
491+
- label: HEIR_OF
492+
description: Used for inheritor relationship between father and sons
493+
- label: RULES
494+
properties:
495+
- name: fromYear
496+
type: INTEGER
497+
potential_schema:
498+
- ["Person", "PARENT_OF", "Person"]
499+
- ["Person", "HEIR_OF", "House"]
500+
- ["House", "RULES", "Planet"]
490501
lexical_graph_config:
491502
chunk_node_label: TextPart
492503
@@ -791,6 +802,44 @@ Here is a code block illustrating these concepts:
791802
After validation, this schema is saved in a `SchemaConfig` object, whose dict representation is passed
792803
to the LLM.
793804

805+
Automatic Schema Extraction
806+
---------------------------
807+
808+
Instead of manually defining the schema, you can use the `SchemaFromTextExtractor` component to automatically extract a schema from your text using an LLM:
809+
810+
.. code:: python
811+
812+
from neo4j_graphrag.experimental.components.schema import SchemaFromTextExtractor
813+
from neo4j_graphrag.llm import OpenAILLM
814+
815+
# Instantiate the automatic schema extractor component
816+
schema_extractor = SchemaFromTextExtractor(
817+
llm=OpenAILLM(
818+
model_name="gpt-4o",
819+
model_params={
820+
"max_tokens": 2000,
821+
"response_format": {"type": "json_object"},
822+
},
823+
)
824+
)
825+
826+
# Extract the schema from the text
827+
extracted_schema = await schema_extractor.run(text="Some text")
828+
829+
The `SchemaFromTextExtractor` component analyzes the text and identifies entity types, relationship types, and their property types. It creates a complete `SchemaConfig` object that can be used in the same way as a manually defined schema.
830+
831+
You can also save and reload the extracted schema:
832+
833+
.. code:: python
834+
835+
# Save the schema to JSON or YAML files
836+
schema_config.store_as_json("my_schema.json")
837+
schema_config.store_as_yaml("my_schema.yaml")
838+
839+
# Later, reload the schema from file
840+
from neo4j_graphrag.experimental.components.schema import SchemaConfig
841+
restored_schema = SchemaConfig.from_file("my_schema.json") # or my_schema.yaml
842+
794843
795844
Entity and Relation Extractor
796845
=============================
@@ -832,6 +881,8 @@ The LLM to use can be customized, the only constraint is that it obeys the :ref:
832881

833882
Schema Enforcement Behaviour
834883
----------------------------
884+
.. _schema-enforcement-behaviour:
885+
835886
By default, even if a schema is provided to guide the LLM in the entity and relation extraction, the LLM response is not validated against that schema.
836887
This behaviour can be changed by using the `enforce_schema` flag in the `LLMEntityRelationExtractor` constructor:
837888

examples/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
This folder contains examples usage for the different features
44
supported by the `neo4j-graphrag` package:
55

6+
- [Automatic Schema Extraction](#schema-extraction) from PDF or text
67
- [Build Knowledge Graph](#build-knowledge-graph) from PDF or text
78
- [Retrieve](#retrieve) information from the graph
89
- [Question Answering](#answer-graphrag) (Q&A)
@@ -122,6 +123,7 @@ are listed in [the last section of this file](#customize).
122123
- [Chunk embedder]()
123124
- Schema Builder:
124125
- [User-defined](./customize/build_graph/components/schema_builders/schema.py)
126+
- [Automatic schema extraction](./automatic_schema_extraction/schema_from_text.py)
125127
- Entity Relation Extractor:
126128
- [LLM-based](./customize/build_graph/components/extractors/llm_entity_relation_extractor.py)
127129
- [LLM-based with custom prompt](./customize/build_graph/components/extractors/llm_entity_relation_extractor_with_custom_prompt.py)
Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
"""This example demonstrates how to use SimpleKGPipeline with automatic schema extraction
2+
from a PDF file. When no schema is provided to SimpleKGPipeline, automatic schema extraction
3+
is performed using the LLM.
4+
5+
Note: This example requires an OpenAI API key to be set in the .env file.
6+
"""
7+
8+
import asyncio
9+
import logging
10+
import os
11+
from pathlib import Path
12+
from dotenv import load_dotenv
13+
import neo4j
14+
15+
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline
16+
from neo4j_graphrag.llm import OpenAILLM
17+
from neo4j_graphrag.embeddings import OpenAIEmbeddings
18+
19+
# Load environment variables from .env file
20+
load_dotenv()
21+
22+
# Configure logging
23+
logging.basicConfig()
24+
logging.getLogger("neo4j_graphrag").setLevel(logging.INFO)
25+
26+
# PDF file path
27+
root_dir = Path(__file__).parents[2]
28+
PDF_FILE = str(
29+
root_dir / "data" / "Harry Potter and the Chamber of Secrets Summary.pdf"
30+
)
31+
32+
33+
async def run_kg_pipeline_with_auto_schema() -> None:
34+
"""Run the SimpleKGPipeline with automatic schema extraction from a PDF file."""
35+
36+
# Define Neo4j connection
37+
uri = os.getenv("NEO4J_URI", "neo4j://localhost:7687")
38+
user = os.getenv("NEO4J_USER", "neo4j")
39+
password = os.getenv("NEO4J_PASSWORD", "password")
40+
41+
# Define LLM parameters
42+
llm_model_params = {
43+
"max_tokens": 2000,
44+
"response_format": {"type": "json_object"},
45+
"temperature": 0, # Lower temperature for more consistent output
46+
}
47+
48+
# Initialize the Neo4j driver
49+
driver = neo4j.GraphDatabase.driver(uri, auth=(user, password))
50+
51+
# Create the LLM instance
52+
llm = OpenAILLM(
53+
model_name="gpt-4o",
54+
model_params=llm_model_params,
55+
)
56+
57+
# Create the embedder instance
58+
embedder = OpenAIEmbeddings()
59+
60+
try:
61+
# Create a SimpleKGPipeline instance without providing a schema
62+
# This will trigger automatic schema extraction
63+
kg_builder = SimpleKGPipeline(
64+
llm=llm,
65+
driver=driver,
66+
embedder=embedder,
67+
from_pdf=True,
68+
)
69+
70+
print(f"Processing PDF file: {PDF_FILE}")
71+
# Run the pipeline on the PDF file
72+
await kg_builder.run_async(file_path=PDF_FILE)
73+
74+
finally:
75+
# Close connections
76+
await llm.async_client.close()
77+
driver.close()
78+
79+
80+
async def main() -> None:
81+
"""Run the example."""
82+
# Create data directory if it doesn't exist
83+
data_dir = root_dir / "data"
84+
data_dir.mkdir(exist_ok=True)
85+
86+
# Check if the PDF file exists
87+
if not Path(PDF_FILE).exists():
88+
print(f"Warning: PDF file not found at {PDF_FILE}")
89+
print("Please replace with a valid PDF file path.")
90+
return
91+
92+
# Run the pipeline
93+
await run_kg_pipeline_with_auto_schema()
94+
95+
96+
if __name__ == "__main__":
97+
asyncio.run(main())

0 commit comments

Comments
 (0)