Skip to content

Commit 60a42b9

Browse files
authored
Merge pull request #32 from doccano/feature/token-based-tagger
Support token based
2 parents 7b5ccf6 + bdaf62e commit 60a42b9

19 files changed

+1440
-1113
lines changed

README.md

Lines changed: 8 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -8,19 +8,17 @@ This is a library to build a CRF tagger for a partially annotated dataset in spa
88

99
## Dataset Preparation
1010

11-
Prepare spaCy binary format file. This library expects tokenization is character-based.
12-
For more detail about spaCy binary format, see [this page](https://spacy.io/api/data-formats#training).
11+
Prepare spaCy binary format file to train your tagger.
12+
If you are not familiar with spaCy binary format, see [this page](https://spacy.io/api/data-formats#training).
1313

1414
You can prepare your own dataset with [spaCy's entity ruler](https://spacy.io/usage/rule-based-matching#entityruler) as follows:
1515

1616
```py
1717
import spacy
1818
from spacy.tokens import DocBin
19-
from spacy_partial_tagger.tokenizer import CharacterTokenizer
2019

2120

2221
nlp = spacy.blank("en")
23-
nlp.tokenizer = CharacterTokenizer(nlp.vocab) # Use a character-based tokenizer
2422

2523
patterns = [{"label": "LOC", "pattern": "Tokyo"}, {"label": "LOC", "pattern": "Japan"}]
2624
ruler = nlp.add_pipe("entity_ruler")
@@ -35,60 +33,24 @@ doc_bin.add(doc)
3533
doc_bin.to_disk("/path/to/data.spacy")
3634
```
3735

38-
In the example above, entities are extracted by character-based matching. However, in some cases, character-based matching may not be suitable (e.g., the element symbol `na` for sodium matches name). In such cases, token-based matching can be used as follows:
39-
40-
```py
41-
import spacy
42-
from spacy.tokens import DocBin
43-
from spacy_partial_tagger.tokenizer import CharacterTokenizer
44-
45-
text = "Selegiline - induced postural hypotension in Parkinson's disease: a longitudinal study on the effects of drug withdrawal."
46-
patterns = [
47-
{"label": "Chemical", "pattern": [{"LOWER": "selegiline"}]},
48-
{"label": "Disease", "pattern": [{"LOWER": "hypotension"}]},
49-
{
50-
"label": "Disease",
51-
"pattern": [{"LOWER": "parkinson"}, {"LOWER": "'s"}, {"LOWER": "disease"}],
52-
},
53-
]
54-
55-
# Add an entity ruler to the pipeline.
56-
nlp = spacy.blank("en")
57-
ruler = nlp.add_pipe("entity_ruler")
58-
ruler.add_patterns(patterns)
59-
60-
# Extract entities from the text.
61-
doc = nlp(text)
62-
entities = [(ent.start_char, ent.end_char, ent.label_) for ent in doc.ents]
63-
64-
# Create a DocBin object.
65-
nlp = spacy.blank("en")
66-
nlp.tokenizer = CharacterTokenizer(nlp.vocab)
67-
doc_bin = DocBin()
68-
doc = nlp.make_doc(text)
69-
doc.ents = [
70-
doc.char_span(start, end, label=label) for start, end, label in entities
71-
]
72-
doc_bin.add(doc)
73-
doc_bin.to_disk("/path/to/data.spacy")
74-
```
75-
7636
## Training
7737

78-
Train your model as follows:
38+
Train your tagger as follows:
7939

8040
```sh
8141
python -m spacy train config.cfg --output outputs --paths.train /path/to/train.spacy --paths.dev /path/to/dev.spacy --gpu-id 0
8242
```
8343

84-
You could download `config.cfg` [here](https://github.com/tech-sketch/spacy-partial-tagger/blob/main/config.cfg).
85-
Or you could setup your own. This library would train models through spaCy. If you are not familiar with spaCy's config file format, please check the [documentation](https://spacy.io/usage/training#config).
44+
This library is implemented as [a trainable component](https://spacy.io/usage/layers-architectures#components) in spaCy,
45+
so you could control the training setting via spaCy's configuration system.
46+
We provide you the default configuration file [here](https://github.com/tech-sketch/spacy-partial-tagger/blob/main/config.cfg).
47+
Or you could setup your own. If you are not familiar with spaCy's config file format, please check the [documentation](https://spacy.io/usage/training#config).
8648

8749
Don't forget to replace `/path/to/train.spacy` and `/path/to/dev.spacy` with your own.
8850

8951
## Evaluation
9052

91-
Evaluate your model as follows:
53+
Evaluate your tagger as follows:
9254

9355
```sh
9456
python -m spacy evaluate outputs/model-best /path/to/test.spacy --gpu-id 0

config.cfg

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,10 +9,12 @@ vectors = null
99
[corpora.train]
1010
@readers = "spacy.Corpus.v1"
1111
path = ${paths.train}
12+
gold_preproc = true
1213

1314
[corpora.dev]
1415
@readers = "spacy.Corpus.v1"
1516
path = ${paths.dev}
17+
gold_preproc = true
1618

1719
[system]
1820
gpu_allocator = null
@@ -21,7 +23,6 @@ seed = 0
2123
[nlp]
2224
lang = "en"
2325
pipeline = ["partial_ner"]
24-
tokenizer = {"@tokenizers": "character_tokenizer.v1"}
2526
disabled = []
2627
before_creation = null
2728
after_creation = null
@@ -60,7 +61,7 @@ nO = null
6061
dropout = 0.2
6162

6263
[components.partial_ner.model.decoder]
63-
@architectures = "spacy-partial-tagger.ConstrainedViterbiDecoder.v1"
64+
@architectures = "spacy-partial-tagger.ViterbiDecoder.v1"
6465
padding_index = ${components.partial_ner.loss.padding_index}
6566

6667
[training]

0 commit comments

Comments
 (0)