You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -12,26 +11,26 @@ Collection of python code to re-use across python-based scrapers
12
11
13
12
# Usage
14
13
15
-
* This library is meant to be installed via PyPI ([`zimscraperlib`](https://pypi.org/project/zimscraperlib/)).
16
-
* Make sure to reference it using a version code as the API is subject to frequent changes.
17
-
* API should remain the same only within the same *minor* version.
14
+
- This library is meant to be installed via PyPI ([`zimscraperlib`](https://pypi.org/project/zimscraperlib/)).
15
+
- Make sure to reference it using a version code as the API is subject to frequent changes.
16
+
- API should remain the same only within the same _minor_ version.
18
17
19
18
Example usage:
20
19
21
-
```pip
20
+
```pip
22
21
zimscraperlib>=1.1,<1.2
23
22
```
24
23
25
24
See [functional architecture](docs/functional_architecture.md), [software architecture](docs/software_architecture.md) and [technical architecture](docs/technical_architecture.md) for more details on scraperlib (not all aspects are covered yet, this is a WIP).
26
25
27
26
# Dependencies
28
27
29
-
* libmagic
30
-
* wget
31
-
* libzim (auto-installed, not available on Windows)
32
-
* Pillow
33
-
* FFmpeg
34
-
* gifsicle (>=1.92)
28
+
- libmagic
29
+
- wget
30
+
- libzim (auto-installed, not available on Windows)
As can be seen on the last line (but this is true for all URLs), this rewriting has to take into account the convention saying at which ZIM path a given web resource will be stored.
Copy file name to clipboardExpand all lines: docs/software_architecture.md
+1-1
Original file line number
Diff line number
Diff line change
@@ -22,4 +22,4 @@ Static JS rewriting is simply a matter of pure textual manipulation with regular
22
22
23
23
Dynamic JS rewriting is done with [wombat JS library](https://github.com/webrecorder/wombat). The same fuzzy rules that are used for static rewritting are injected into wombat configuration. Code to rewrite URLs is an adapted version of the code used to compute ZIM paths.
24
24
25
-
For wombat setup, including the URL rewriting part, we need to pass wombat configuration info. This code is developed in the `javascript` folder. For URL parsing, it relies on the [uri-js library](https://www.npmjs.com/package/uri-js). This javascript code is bundled into a single `wombatSetup.js` file with [rollup bundler](https://rollupjs.org), the same bundler used by webrecorder team to bundle wombat.
25
+
For wombat setup, including the URL rewriting part, we need to pass wombat configuration info. This code is developed in the `javascript` folder. For URL parsing, it relies on the [uri-js library](https://www.npmjs.com/package/uri-js). This javascript code is bundled into a single `wombatSetup.js` file with [rollup bundler](https://rollupjs.org), the same bundler used by webrecorder team to bundle wombat.
Copy file name to clipboardExpand all lines: docs/technical_architecture.md
+3-1
Original file line number
Diff line number
Diff line change
@@ -5,6 +5,7 @@
5
5
Fuzzy rules are stored in `rules/rules.yaml`. This configuration file is then used by `rules/generateRules.py` to generate Python and JS code.
6
6
7
7
Should you update these fuzzy rules, you hence have to:
8
+
8
9
- regenerate Python and JS files by running `python rules/generateRules.py`
9
10
- bundle again Javascript `wombatSetup.js` (see below).
10
11
@@ -25,6 +26,7 @@ WARC record stores the items URL inside a header named "WARC-Target-URI". The va
25
26
It has been decided (by convention) that we will drop the scheme, the port, the username and password from the URL. Headers are also not considered in this computation.
26
27
27
28
Computation of the ZIM path is hence mostly straightforward:
29
+
28
30
- decode the hostname which is puny-encoded
29
31
- decode the path and query parameter which might be url-encoded
30
32
@@ -49,4 +51,4 @@ JS Rewriting is a bit special because rules to apply are different wether we are
49
51
50
52
Detection of Javascript modules starts at the HTML level where we have a `<script type="module" src="...">` tag. This tells us that file at src location is a Javascript module. From there we now that its subresources are also Javascript module.
51
53
52
-
Currently this detection is done on-the-fly, based on the fact that WARC items are processed in the same order that they have been fetched by the browser, and we hence do not need a multi-pass approach. Meaning that HTML will be processed first, then parent JS, then its dependencies, ... **This is a strong assumption**.
54
+
Currently this detection is done on-the-fly, based on the fact that WARC items are processed in the same order that they have been fetched by the browser, and we hence do not need a multi-pass approach. Meaning that HTML will be processed first, then parent JS, then its dependencies, ... **This is a strong assumption**.
0 commit comments