Skip to content

Commit 956dfda

Browse files
authored
Use tokenizer.vocab_size() instead of hardcoding 32000 in convert-pth-to-ggml.py (#142)
There are ways that special tokens or other new tokens could be added to the tokenizer; therefore it's probably best not to assume the vocabulary is only 32000 tokens.
1 parent 113e685 commit 956dfda

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

convert-pth-to-ggml.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -99,7 +99,7 @@ def get_n_parts(dim):
9999
fout.write(struct.pack("i", ftype))
100100

101101
# Is this correct??
102-
for i in range(32000):
102+
for i in range(tokenizer.vocab_size()):
103103
if tokenizer.is_unknown(i):
104104
# "<unk>" token (translated as ??)
105105
text = " \u2047 ".encode("utf-8")

0 commit comments

Comments
 (0)