doccano
diff --git a/‎README.md
Lines changed: 8 additions & 46 deletions b/‎README.md
Lines changed: 8 additions & 46 deletions
diff --git a/‎config.cfg
Lines changed: 3 additions & 2 deletions b/‎config.cfg
Lines changed: 3 additions & 2 deletions
@@ -8,19 +8,17 @@ This is a library to build a CRF tagger for a partially annotated dataset in spa
 
 ## Dataset Preparation
 
-Prepare spaCy binary format file. This library expects tokenization is character-based.
-For more detail about spaCy binary format, see [this page](https://spacy.io/api/data-formats#training).
+Prepare spaCy binary format file to train your tagger.
+If you are not familiar with spaCy binary format, see [this page](https://spacy.io/api/data-formats#training).
 
 You can prepare your own dataset with [spaCy's entity ruler](https://spacy.io/usage/rule-based-matching#entityruler) as follows:
 
 ```py
 import spacy
 from spacy.tokens import DocBin
-from spacy_partial_tagger.tokenizer import CharacterTokenizer
 
 
 nlp = spacy.blank("en")
-nlp.tokenizer = CharacterTokenizer(nlp.vocab)  # Use a character-based tokenizer
 
 patterns = [{"label": "LOC", "pattern": "Tokyo"}, {"label": "LOC", "pattern": "Japan"}]
 ruler = nlp.add_pipe("entity_ruler")
@@ -35,60 +33,24 @@ doc_bin.add(doc)
 doc_bin.to_disk("/path/to/data.spacy")
 ```
 
-In the example above, entities are extracted by character-based matching. However, in some cases, character-based matching may not be suitable (e.g., the element symbol `na` for sodium matches name). In such cases, token-based matching can be used as follows:
-
-```py
-import spacy
-from spacy.tokens import DocBin
-from spacy_partial_tagger.tokenizer import CharacterTokenizer
-
-text = "Selegiline - induced postural hypotension in Parkinson's disease: a longitudinal study on the effects of drug withdrawal."
-patterns = [
-    {"label": "Chemical", "pattern": [{"LOWER": "selegiline"}]},
-    {"label": "Disease", "pattern": [{"LOWER": "hypotension"}]},
-    {
-        "label": "Disease",
-        "pattern": [{"LOWER": "parkinson"}, {"LOWER": "'s"}, {"LOWER": "disease"}],
-    },
-]
-
-# Add an entity ruler to the pipeline.
-nlp = spacy.blank("en")
-ruler = nlp.add_pipe("entity_ruler")
-ruler.add_patterns(patterns)
-
-# Extract entities from the text.
-doc = nlp(text)
-entities = [(ent.start_char, ent.end_char, ent.label_) for ent in doc.ents]
-
-# Create a DocBin object.
-nlp = spacy.blank("en")
-nlp.tokenizer = CharacterTokenizer(nlp.vocab)
-doc_bin = DocBin()
-doc = nlp.make_doc(text)
-doc.ents = [
-    doc.char_span(start, end, label=label) for start, end, label in entities
-]
-doc_bin.add(doc)
-doc_bin.to_disk("/path/to/data.spacy")
-```
-
 ## Training
 
-Train your model as follows:
+Train your tagger as follows:
 
 ```sh
 python -m spacy train config.cfg --output outputs --paths.train /path/to/train.spacy --paths.dev /path/to/dev.spacy --gpu-id 0
 ```
 
-You could download `config.cfg` [here](https://github.com/tech-sketch/spacy-partial-tagger/blob/main/config.cfg).
-Or you could setup your own. This library would train models through spaCy. If you are not familiar with spaCy's config file format, please check the [documentation](https://spacy.io/usage/training#config).
+This library is implemented as [a trainable component](https://spacy.io/usage/layers-architectures#components) in spaCy,
+so you could control the training setting via spaCy's configuration system.
+We provide you the default configuration file [here](https://github.com/tech-sketch/spacy-partial-tagger/blob/main/config.cfg).
+Or you could setup your own. If you are not familiar with spaCy's config file format, please check the [documentation](https://spacy.io/usage/training#config).
 
 Don't forget to replace `/path/to/train.spacy` and `/path/to/dev.spacy` with your own.
 
 ## Evaluation
 
-Evaluate your model as follows:
+Evaluate your tagger as follows:
 
 ```sh
 python -m spacy evaluate outputs/model-best /path/to/test.spacy --gpu-id 0
 
@@ -9,10 +9,12 @@ vectors = null
 [corpora.train]
 @readers = "spacy.Corpus.v1"
 path = ${paths.train}
+gold_preproc = true
 
 [corpora.dev]
 @readers = "spacy.Corpus.v1"
 path = ${paths.dev}
+gold_preproc = true
 
 [system]
 gpu_allocator = null
@@ -21,7 +23,6 @@ seed = 0
 [nlp]
 lang = "en"
 pipeline = ["partial_ner"]
-tokenizer = {"@tokenizers": "character_tokenizer.v1"}
 disabled = []
 before_creation = null
 after_creation = null
@@ -60,7 +61,7 @@ nO = null
 dropout = 0.2
 
 [components.partial_ner.model.decoder]
-@architectures = "spacy-partial-tagger.ConstrainedViterbiDecoder.v1"
+@architectures = "spacy-partial-tagger.ViterbiDecoder.v1"
 padding_index = ${components.partial_ner.loss.padding_index}
 
 [training]