Skip to content

Commit 8b3af81

Browse files
authored
Refactor SchemaConfig -> GraphSchema (#340)
* WIP: get rid of SchemaConfig * Replace SchemaConfig WIP * Update tests - postpone frozen schema * Use tuples instead of lists for immutability (and make the _*_index always consistent) * Ruff * Update docstrings and CHANGELOG * Mypy and tests * Update examples * Ruff again * Renaming fields after internal discussion * Fix remaining SchemaConfig * Mypy * Fix bad test update * Fix README/doc/bug * Mypy * Ruff
1 parent 9b8b8e8 commit 8b3af81

28 files changed

+996
-1063
lines changed

CHANGELOG.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,17 @@
1212

1313
### Changed
1414

15+
#### Strict mode
16+
1517
- Strict mode in `SimpleKGPipeline`: now properties and relationships are pruned only if they are defined in the input schema.
1618

19+
#### Schema definition
20+
21+
- The `SchemaEntity` model has been renamed `NodeType`.
22+
- The `SchemaRelation` model has been renamed `RelationshipType`.
23+
- The `SchemaProperty` model has been renamed `PropertyType`.
24+
- `SchemaConfig` has been removed in favor of `GraphSchema` (used in the `SchemaBuilder` and `EntityRelationExtractor` classes). `entities`, `relations` and `potential_schema` fields have also been renamed `node_types`, `relationship_types` and `patterns` respectively.
25+
1726

1827
## 1.7.0
1928

README.md

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -102,9 +102,9 @@ NEO4J_PASSWORD = "password"
102102
driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USERNAME, NEO4J_PASSWORD))
103103

104104
# List the entities and relations the LLM should look for in the text
105-
entities = ["Person", "House", "Planet"]
106-
relations = ["PARENT_OF", "HEIR_OF", "RULES"]
107-
potential_schema = [
105+
node_types = ["Person", "House", "Planet"]
106+
relationship_types = ["PARENT_OF", "HEIR_OF", "RULES"]
107+
patterns = [
108108
("Person", "PARENT_OF", "Person"),
109109
("Person", "HEIR_OF", "House"),
110110
("House", "RULES", "Planet"),
@@ -128,8 +128,11 @@ kg_builder = SimpleKGPipeline(
128128
llm=llm,
129129
driver=driver,
130130
embedder=embedder,
131-
entities=entities,
132-
relations=relations,
131+
schema={
132+
"node_types": node_types,
133+
"relationship_types": relationship_types,
134+
"patterns": patterns,
135+
},
133136
on_error="IGNORE",
134137
from_pdf=False,
135138
)
@@ -365,7 +368,7 @@ When you're finished with your changes, create a pull request (PR) using the fol
365368

366369
## 🧪 Tests
367370

368-
To be able to run all tests, all extra packages needs to be installed.
371+
To be able to run all tests, all extra packages needs to be installed.
369372
This is achieved by:
370373

371374
```bash

docs/source/types.rst

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -75,25 +75,25 @@ KGWriterModel
7575

7676
.. autoclass:: neo4j_graphrag.experimental.components.kg_writer.KGWriterModel
7777

78-
SchemaProperty
79-
==============
78+
PropertyType
79+
============
8080

81-
.. autoclass:: neo4j_graphrag.experimental.components.schema.SchemaProperty
81+
.. autoclass:: neo4j_graphrag.experimental.components.schema.PropertyType
8282

83-
SchemaEntity
84-
============
83+
NodeType
84+
========
8585

86-
.. autoclass:: neo4j_graphrag.experimental.components.schema.SchemaEntity
86+
.. autoclass:: neo4j_graphrag.experimental.components.schema.NodeType
8787

88-
SchemaRelation
89-
==============
88+
RelationshipType
89+
================
9090

91-
.. autoclass:: neo4j_graphrag.experimental.components.schema.SchemaRelation
91+
.. autoclass:: neo4j_graphrag.experimental.components.schema.RelationshipType
9292

93-
SchemaConfig
94-
============
93+
GraphSchema
94+
===========
9595

96-
.. autoclass:: neo4j_graphrag.experimental.components.schema.SchemaConfig
96+
.. autoclass:: neo4j_graphrag.experimental.components.schema.GraphSchema
9797

9898
LexicalGraphConfig
9999
===================

docs/source/user_guide_kg_builder.rst

Lines changed: 42 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ A Knowledge Graph (KG) construction pipeline requires a few components (some of
2121
- **Data loader**: extract text from files (PDFs, ...).
2222
- **Text splitter**: split the text into smaller pieces of text (chunks), manageable by the LLM context window (token limit).
2323
- **Chunk embedder** (optional): compute the chunk embeddings.
24-
- **Schema builder**: provide a schema to ground the LLM extracted entities and relations and obtain an easily navigable KG. Schema can be provided manually or extracted automatically using LLMs.
24+
- **Schema builder**: provide a schema to ground the LLM extracted node and relationship types and obtain an easily navigable KG. Schema can be provided manually or extracted automatically using LLMs.
2525
- **Lexical graph builder**: build the lexical graph (Document, Chunk and their relationships) (optional).
2626
- **Entity and relation extractor**: extract relevant entities and relations from the text.
2727
- **Knowledge Graph writer**: save the identified entities and relations.
@@ -73,18 +73,18 @@ Customizing the SimpleKGPipeline
7373
Graph Schema
7474
------------
7575

76-
It is possible to guide the LLM by supplying a list of entities, relationships,
77-
and instructions on how to connect them. However, note that the extracted graph
78-
may not fully adhere to these guidelines unless schema enforcement is enabled
79-
(see :ref:`Schema Enforcement Behaviour`). Entities and relationships can be represented
76+
It is possible to guide the LLM by supplying a list of node and relationship types,
77+
and instructions on how to connect them (patterns). However, note that the extracted graph
78+
may not fully adhere to these guidelines unless schema enforcement is enabled
79+
(see :ref:`Schema Enforcement Behaviour`). Node and relationship types can be represented
8080
as either simple strings (for their labels) or dictionaries. If using a dictionary,
8181
it must include a label key and can optionally include description and properties keys,
8282
as shown below:
8383

8484
.. code:: python
8585
86-
ENTITIES = [
87-
# entities can be defined with a simple label...
86+
NODE_TYPES = [
87+
# node types can be defined with a simple label...
8888
"Person",
8989
# ... or with a dict if more details are needed,
9090
# such as a description:
@@ -93,7 +93,7 @@ as shown below:
9393
{"label": "Planet", "properties": [{"name": "weather", "type": "STRING"}]},
9494
]
9595
# same thing for relationships:
96-
RELATIONS = [
96+
RELATIONSHIP_TYPES = [
9797
"PARENT_OF",
9898
{
9999
"label": "HEIR_OF",
@@ -102,13 +102,13 @@ as shown below:
102102
{"label": "RULES", "properties": [{"name": "fromYear", "type": "INTEGER"}]},
103103
]
104104
105-
The `potential_schema` is defined by a list of triplet in the format:
105+
The `patterns` are defined by a list of triplet in the format:
106106
`(source_node_label, relationship_label, target_node_label)`. For instance:
107107

108108

109109
.. code:: python
110110
111-
POTENTIAL_SCHEMA = [
111+
PATTERNS = [
112112
("Person", "PARENT_OF", "Person"),
113113
("Person", "HEIR_OF", "House"),
114114
("House", "RULES", "Planet"),
@@ -122,15 +122,15 @@ This schema information can be provided to the `SimpleKGBuilder` as demonstrated
122122
kg_builder = SimpleKGPipeline(
123123
# ...
124124
schema={
125-
"entities": ENTITIES,
126-
"relations": RELATIONS,
127-
"potential_schema": POTENTIAL_SCHEMA
125+
"node_types": NODE_TYPES,
126+
"relationship_types": RELATIONSHIP_TYPES,
127+
"patterns": PATTERNS
128128
},
129129
# ...
130130
)
131131
132132
.. note::
133-
By default, if no schema is provided to the SimpleKGPipeline, automatic schema extraction will be performed using the LLM (See the :ref:`Automatic Schema Extraction with SchemaFromTextExtractor`).
133+
By default, if no schema is provided to the SimpleKGPipeline, automatic schema extraction will be performed using the LLM (See the :ref:`Automatic Schema Extraction`).
134134

135135
Extra configurations
136136
--------------------
@@ -419,9 +419,8 @@ within the configuration file.
419419
"neo4j_database": "myDb",
420420
"on_error": "IGNORE",
421421
"prompt_template": "...",
422-
423422
"schema": {
424-
"entities": [
423+
"node_types": [
425424
"Person",
426425
{
427426
"label": "House",
@@ -438,7 +437,7 @@ within the configuration file.
438437
]
439438
}
440439
],
441-
"relations": [
440+
"relationship_types": [
442441
"PARENT_OF",
443442
{
444443
"label": "HEIR_OF",
@@ -451,7 +450,7 @@ within the configuration file.
451450
]
452451
}
453452
],
454-
"potential_schema": [
453+
"patterns": [
455454
["Person", "PARENT_OF", "Person"],
456455
["Person", "HEIR_OF", "House"],
457456
["House", "RULES", "Planet"]
@@ -473,7 +472,7 @@ or in YAML:
473472
on_error: IGNORE
474473
prompt_template: ...
475474
schema:
476-
entities:
475+
node_types:
477476
- Person
478477
- label: House
479478
description: Family the person belongs to
@@ -486,15 +485,15 @@ or in YAML:
486485
type: STRING
487486
- name: weather
488487
type: STRING
489-
relations:
488+
relationship_types:
490489
- PARENT_OF
491490
- label: HEIR_OF
492491
description: Used for inheritor relationship between father and sons
493492
- label: RULES
494493
properties:
495494
- name: fromYear
496495
type: INTEGER
497-
potential_schema:
496+
patterns:
498497
- ["Person", "PARENT_OF", "Person"]
499498
- ["Person", "HEIR_OF", "House"]
500499
- ["House", "RULES", "Planet"]
@@ -747,62 +746,62 @@ Optionally, the document and chunk node labels can be configured using a `Lexica
747746
Schema Builder
748747
==============
749748

750-
The schema is used to try and ground the LLM to a list of possible entities and relations of interest.
749+
The schema is used to try and ground the LLM to a list of possible node and relationship types of interest.
751750
So far, schema must be manually created by specifying:
752751

753-
- **Entities** the LLM should look for in the text, including their properties (name and type).
754-
- **Relations** of interest between these entities, including the relation properties (name and type).
755-
- **Triplets** to define the start (source) and end (target) entity types for each relation.
752+
- **Node types** the LLM should look for in the text, including their properties (name and type).
753+
- **Relationship types** of interest between these node types, including the relationship properties (name and type).
754+
- **Patterns** (triplets) to define the start (source) and end (target) entity types for each relationship.
756755

757756
Here is a code block illustrating these concepts:
758757

759758
.. code:: python
760759
761760
from neo4j_graphrag.experimental.components.schema import (
762761
SchemaBuilder,
763-
SchemaEntity,
764-
SchemaProperty,
765-
SchemaRelation,
762+
NodeType,
763+
PropertyType,
764+
RelationshipType,
766765
)
767766
768767
schema_builder = SchemaBuilder()
769768
770769
await schema_builder.run(
771-
entities=[
772-
SchemaEntity(
770+
node_types=[
771+
NodeType(
773772
label="Person",
774773
properties=[
775774
SchemaProperty(name="name", type="STRING"),
776775
SchemaProperty(name="place_of_birth", type="STRING"),
777776
SchemaProperty(name="date_of_birth", type="DATE"),
778777
],
779778
),
780-
SchemaEntity(
779+
NodeType(
781780
label="Organization",
782781
properties=[
783782
SchemaProperty(name="name", type="STRING"),
784783
SchemaProperty(name="country", type="STRING"),
785784
],
786785
),
787786
],
788-
relations=[
789-
SchemaRelation(
787+
relationship_types=[
788+
RelationshipType(
790789
label="WORKED_ON",
791790
),
792-
SchemaRelation(
791+
RelationshipType(
793792
label="WORKED_FOR",
794793
),
795794
],
796-
possible_schema=[
795+
patterns=[
797796
("Person", "WORKED_ON", "Field"),
798797
("Person", "WORKED_FOR", "Organization"),
799798
],
800799
)
801800
802-
After validation, this schema is saved in a `SchemaConfig` object, whose dict representation is passed
801+
After validation, this schema is saved in a `GraphSchema` object, whose dict representation is passed
803802
to the LLM.
804803

805-
Automatic Schema Extraction
804+
Automatic Schema Extraction
806805
---------------------------
807806

808807
Instead of manually defining the schema, you can use the `SchemaFromTextExtractor` component to automatically extract a schema from your text using an LLM:
@@ -826,19 +825,19 @@ Instead of manually defining the schema, you can use the `SchemaFromTextExtracto
826825
# Extract the schema from the text
827826
extracted_schema = await schema_extractor.run(text="Some text")
828827
829-
The `SchemaFromTextExtractor` component analyzes the text and identifies entity types, relationship types, and their property types. It creates a complete `SchemaConfig` object that can be used in the same way as a manually defined schema.
828+
The `SchemaFromTextExtractor` component analyzes the text and identifies entity types, relationship types, and their property types. It creates a complete `GraphSchema` object that can be used in the same way as a manually defined schema.
830829

831830
You can also save and reload the extracted schema:
832831

833832
.. code:: python
834833
835834
# Save the schema to JSON or YAML files
836-
schema_config.store_as_json("my_schema.json")
837-
schema_config.store_as_yaml("my_schema.yaml")
838-
835+
extracted_schema.store_as_json("my_schema.json")
836+
extracted_schema.store_as_yaml("my_schema.yaml")
837+
839838
# Later, reload the schema from file
840-
from neo4j_graphrag.experimental.components.schema import SchemaConfig
841-
restored_schema = SchemaConfig.from_file("my_schema.json") # or my_schema.yaml
839+
from neo4j_graphrag.experimental.components.schema import GraphSchema
840+
restored_schema = GraphSchema.from_file("my_schema.json") # or my_schema.yaml
842841
843842
844843
Entity and Relation Extractor
@@ -993,7 +992,6 @@ If more customization is needed, it is possible to subclass the `EntityRelationE
993992
994993
from pydantic import validate_call
995994
from neo4j_graphrag.experimental.components.entity_relation_extractor import EntityRelationExtractor
996-
from neo4j_graphrag.experimental.components.schema import SchemaConfig
997995
from neo4j_graphrag.experimental.components.types import (
998996
Neo4jGraph,
999997
Neo4jNode,

examples/build_graph/simple_kg_builder_from_pdf.py

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -27,11 +27,11 @@
2727
file_path = root_dir / "data" / "Harry Potter and the Chamber of Secrets Summary.pdf"
2828

2929

30-
# Instantiate Entity and Relation objects. This defines the
30+
# Instantiate NodeType and RelationshipType objects. This defines the
3131
# entities and relations the LLM will be looking for in the text.
32-
ENTITIES = ["Person", "Organization", "Location"]
33-
RELATIONS = ["SITUATED_AT", "INTERACTS", "LED_BY"]
34-
POTENTIAL_SCHEMA = [
32+
NODE_TYPES = ["Person", "Organization", "Location"]
33+
RELATIONSHIP_TYPES = ["SITUATED_AT", "INTERACTS", "LED_BY"]
34+
PATTERNS = [
3535
("Person", "SITUATED_AT", "Location"),
3636
("Person", "INTERACTS", "Person"),
3737
("Organization", "LED_BY", "Person"),
@@ -47,9 +47,11 @@ async def define_and_run_pipeline(
4747
llm=llm,
4848
driver=neo4j_driver,
4949
embedder=OpenAIEmbeddings(),
50-
entities=ENTITIES,
51-
relations=RELATIONS,
52-
potential_schema=POTENTIAL_SCHEMA,
50+
schema={
51+
"node_types": NODE_TYPES,
52+
"relationship_types": RELATIONSHIP_TYPES,
53+
"patterns": PATTERNS,
54+
},
5355
neo4j_database=DATABASE,
5456
)
5557
return await kg_builder.run_async(file_path=str(file_path))

0 commit comments

Comments
 (0)