Skip to content

Commit 35bf42b

Browse files
committed
Dec 2023 Update
1 parent 8fa2f43 commit 35bf42b

File tree

5 files changed

+14
-9
lines changed

5 files changed

+14
-9
lines changed

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -102,3 +102,7 @@ venv.bak/
102102

103103
# mypy
104104
.mypy_cache/
105+
106+
etymology.csv
107+
etymology.csv.gz
108+
etymology.parquet

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
# etymology-db
2-
**Downloads:** (Last generated 2021-11-14)
2+
**Downloads:** (Last generated 2023-12-05)
33
[**Gzipped CSV**](https://1drv.ms/u/s!AtpEocFNRNBWhAe7co0JFvac-OfA?e=wnJe4r)
44
[**Parquet**](https://1drv.ms/u/s!AtpEocFNRNBWhhP6w5D9XfdtPH9I?e=jWRwnI)
55

66
A structured, comprehensive, and multilingual etymology dataset created by parsing Wiktionary's etymology sections. Key features:
7-
* 3.8+ million etymological relationships between 1.8+ million terms in 2900+ languages/dialects
7+
* 4.2+ million etymological relationships between 2.0+ million terms in 3300+ languages/dialects
88
* 31 different types of etymological relations, distinguishing between inheritance, borrowing, etc.
99
* Hierarchical data that preserves relationship structures, such as the evolution of a term across languages
1010

main.py

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
import csv
44
import logging
55
import re
6-
from multiprocessing import Pool
6+
from multiprocessing import Pool, freeze_support
77
from datetime import datetime, timedelta
88
from pathlib import Path
99
from typing import Generator, List, Tuple
@@ -64,8 +64,9 @@ def write_all():
6464
elapsed = (datetime.now() - time)
6565
if elapsed.total_seconds() > 1:
6666
elapsed -= timedelta(microseconds=elapsed.microseconds)
67-
print(f"Entries parsed: {entries_parsed} Time elapsed: {elapsed} "
68-
f"Entries per second: {entries_parsed // elapsed.total_seconds()}{' ' * 10}", end="\r", flush=True)
67+
if entries_parsed % 1000 == 0:
68+
print(f"Entries parsed: {entries_parsed} Time elapsed: {elapsed} "
69+
f"Entries per second: {entries_parsed // elapsed.total_seconds()}{' ' * 10}", end="\r", flush=True)
6970

7071

7172
def stream_terms() -> Generator[Tuple[str, str], None, None]:

requirements.txt

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
1-
lxml==4.9.1
2-
requests==2.26.0
3-
mwparserfromhell==0.6.3
1+
lxml==4.9.3
2+
requests==2.31.0
3+
mwparserfromhell==0.6.5

templates.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
from elements import Etymology
99

1010

11-
unparsed_templates = Manager().dict()
11+
unparsed_templates = dict()
1212

1313
class RelType(Enum):
1414
Inherited = "inherited_from"

0 commit comments

Comments
 (0)