Skip to content

Commit 17c8ec8

Browse files
committed
Fix formatting - prettier
1 parent c0f4e19 commit 17c8ec8

File tree

5 files changed

+38
-35
lines changed

5 files changed

+38
-35
lines changed

README.md

+24-24
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,4 @@
1-
zimscraperlib
2-
=============
1+
# zimscraperlib
32

43
[![Build Status](https://github.com/openzim/python-scraperlib/workflows/CI/badge.svg?query=branch%3Amain)](https://github.com/openzim/python-scraperlib/actions?query=branch%3Amain)
54
[![CodeFactor](https://www.codefactor.io/repository/github/openzim/python-scraperlib/badge)](https://www.codefactor.io/repository/github/openzim/python-scraperlib)
@@ -12,26 +11,26 @@ Collection of python code to re-use across python-based scrapers
1211

1312
# Usage
1413

15-
* This library is meant to be installed via PyPI ([`zimscraperlib`](https://pypi.org/project/zimscraperlib/)).
16-
* Make sure to reference it using a version code as the API is subject to frequent changes.
17-
* API should remain the same only within the same *minor* version.
14+
- This library is meant to be installed via PyPI ([`zimscraperlib`](https://pypi.org/project/zimscraperlib/)).
15+
- Make sure to reference it using a version code as the API is subject to frequent changes.
16+
- API should remain the same only within the same _minor_ version.
1817

1918
Example usage:
2019

21-
``` pip
20+
```pip
2221
zimscraperlib>=1.1,<1.2
2322
```
2423

2524
See [functional architecture](docs/functional_architecture.md), [software architecture](docs/software_architecture.md) and [technical architecture](docs/technical_architecture.md) for more details on scraperlib (not all aspects are covered yet, this is a WIP).
2625

2726
# Dependencies
2827

29-
* libmagic
30-
* wget
31-
* libzim (auto-installed, not available on Windows)
32-
* Pillow
33-
* FFmpeg
34-
* gifsicle (>=1.92)
28+
- libmagic
29+
- wget
30+
- libzim (auto-installed, not available on Windows)
31+
- Pillow
32+
- FFmpeg
33+
- gifsicle (>=1.92)
3534

3635
## macOS
3736

@@ -49,6 +48,7 @@ sudo apt install libmagic1 wget ffmpeg \
4948
```
5049

5150
## Alpine
51+
5252
```
5353
apk add ffmpeg gifsicle libmagic wget libjpeg
5454
```
@@ -71,15 +71,15 @@ invoke coverage
7171

7272
Non-exhaustive list of scrapers using it (check status when updating API):
7373

74-
* [openzim/freecodecamp](https://github.com/openzim/freecodecamp)
75-
* [openzim/gutenberg](https://github.com/openzim/gutenberg)
76-
* [openzim/ifixit](https://github.com/openzim/ifixit)
77-
* [openzim/kolibri](https://github.com/openzim/kolibri)
78-
* [openzim/nautilus](https://github.com/openzim/nautilus)
79-
* [openzim/nautilus](https://github.com/openzim/nautilus)
80-
* [openzim/openedx](https://github.com/openzim/openedx)
81-
* [openzim/sotoki](https://github.com/openzim/sotoki)
82-
* [openzim/ted](https://github.com/openzim/ted)
83-
* [openzim/warc2zim](https://github.com/openzim/warc2zim)
84-
* [openzim/wikihow](https://github.com/openzim/wikihow)
85-
* [openzim/youtube](https://github.com/openzim/youtube)
74+
- [openzim/freecodecamp](https://github.com/openzim/freecodecamp)
75+
- [openzim/gutenberg](https://github.com/openzim/gutenberg)
76+
- [openzim/ifixit](https://github.com/openzim/ifixit)
77+
- [openzim/kolibri](https://github.com/openzim/kolibri)
78+
- [openzim/nautilus](https://github.com/openzim/nautilus)
79+
- [openzim/nautilus](https://github.com/openzim/nautilus)
80+
- [openzim/openedx](https://github.com/openzim/openedx)
81+
- [openzim/sotoki](https://github.com/openzim/sotoki)
82+
- [openzim/ted](https://github.com/openzim/ted)
83+
- [openzim/warc2zim](https://github.com/openzim/warc2zim)
84+
- [openzim/wikihow](https://github.com/openzim/wikihow)
85+
- [openzim/youtube](https://github.com/openzim/youtube)

docs/functional_architecture.md

+5-5
Original file line numberDiff line numberDiff line change
@@ -54,11 +54,11 @@ For proper reader operation, openZIM prohibits using absolute URLs, so this has
5454

5555
The table below gives some examples of what the rewritten URL is going to be, depending on the URL of the rewritten document.
5656

57-
| HTML document URL | image URL rewritten for usage inside the ZIM |
58-
|--|--|
59-
| `https://en.wikipedia.org/wiki/Kiwix` | `./File:Kiwix_logo_v3.svg` |
60-
| `https://en.wikipedia.org/wiki` | `./wiki/File:Kiwix_logo_v3.svg` |
61-
| `https://en.wikipedia.org/waka/Kiwix` | `../wiki/File:Kiwix_logo_v3.svg` |
57+
| HTML document URL | image URL rewritten for usage inside the ZIM |
58+
| ------------------------------------- | ---------------------------------------------------- |
59+
| `https://en.wikipedia.org/wiki/Kiwix` | `./File:Kiwix_logo_v3.svg` |
60+
| `https://en.wikipedia.org/wiki` | `./wiki/File:Kiwix_logo_v3.svg` |
61+
| `https://en.wikipedia.org/waka/Kiwix` | `../wiki/File:Kiwix_logo_v3.svg` |
6262
| `https://fr.wikipedia.org/wiki/Kiwix` | `../../en.wikipedia.org/wiki/File:Kiwix_logo_v3.svg` |
6363

6464
As can be seen on the last line (but this is true for all URLs), this rewriting has to take into account the convention saying at which ZIM path a given web resource will be stored.

docs/software_architecture.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -22,4 +22,4 @@ Static JS rewriting is simply a matter of pure textual manipulation with regular
2222

2323
Dynamic JS rewriting is done with [wombat JS library](https://github.com/webrecorder/wombat). The same fuzzy rules that are used for static rewritting are injected into wombat configuration. Code to rewrite URLs is an adapted version of the code used to compute ZIM paths.
2424

25-
For wombat setup, including the URL rewriting part, we need to pass wombat configuration info. This code is developed in the `javascript` folder. For URL parsing, it relies on the [uri-js library](https://www.npmjs.com/package/uri-js). This javascript code is bundled into a single `wombatSetup.js` file with [rollup bundler](https://rollupjs.org), the same bundler used by webrecorder team to bundle wombat.
25+
For wombat setup, including the URL rewriting part, we need to pass wombat configuration info. This code is developed in the `javascript` folder. For URL parsing, it relies on the [uri-js library](https://www.npmjs.com/package/uri-js). This javascript code is bundled into a single `wombatSetup.js` file with [rollup bundler](https://rollupjs.org), the same bundler used by webrecorder team to bundle wombat.

docs/technical_architecture.md

+3-1
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
Fuzzy rules are stored in `rules/rules.yaml`. This configuration file is then used by `rules/generateRules.py` to generate Python and JS code.
66

77
Should you update these fuzzy rules, you hence have to:
8+
89
- regenerate Python and JS files by running `python rules/generateRules.py`
910
- bundle again Javascript `wombatSetup.js` (see below).
1011

@@ -25,6 +26,7 @@ WARC record stores the items URL inside a header named "WARC-Target-URI". The va
2526
It has been decided (by convention) that we will drop the scheme, the port, the username and password from the URL. Headers are also not considered in this computation.
2627

2728
Computation of the ZIM path is hence mostly straightforward:
29+
2830
- decode the hostname which is puny-encoded
2931
- decode the path and query parameter which might be url-encoded
3032

@@ -49,4 +51,4 @@ JS Rewriting is a bit special because rules to apply are different wether we are
4951

5052
Detection of Javascript modules starts at the HTML level where we have a `<script type="module" src="...">` tag. This tells us that file at src location is a Javascript module. From there we now that its subresources are also Javascript module.
5153

52-
Currently this detection is done on-the-fly, based on the fact that WARC items are processed in the same order that they have been fetched by the browser, and we hence do not need a multi-pass approach. Meaning that HTML will be processed first, then parent JS, then its dependencies, ... **This is a strong assumption**.
54+
Currently this detection is done on-the-fly, based on the fact that WARC items are processed in the same order that they have been fetched by the browser, and we hence do not need a multi-pass approach. Meaning that HTML will be processed first, then parent JS, then its dependencies, ... **This is a strong assumption**.
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,9 @@
11
This folder must contain two files which are not under Git version control:
2+
23
- wombat.js, a webrecorder software
34
- wombatSetup.js, a custom configuration script for wombat.js, which is built in this
4-
project from files in the javascript folder
5+
project from files in the javascript folder
56

6-
If you install zimscraperlib from sdist or wheel, we've pre-packaged these files for
7-
convenience and also so that your version of wombatSetup.js ais "aligned" (i.e. if you
8-
install zimscraperlib x.y.z, we are sure which version wombatSetup.js you have).
7+
If you install zimscraperlib from sdist or wheel, we've pre-packaged these files for
8+
convenience and also so that your version of wombatSetup.js ais "aligned" (i.e. if you
9+
install zimscraperlib x.y.z, we are sure which version wombatSetup.js you have).

0 commit comments

Comments
 (0)