Skip to content

Commit bed3960

Browse files
committed
fixup! Add documentation, especially warc2zim doc about rewriting
1 parent 0d0a10f commit bed3960

File tree

4 files changed

+12
-8
lines changed

4 files changed

+12
-8
lines changed

docs/functional_architecture.md

+7-7
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ zimscraperlib has primitives to manipulate pictures with some operations which a
1616

1717
zimscraperlib also contains primitives to rewrite HTML, CSS and JS fetched online, to proper operate within a ZIM without heavy modifications. While originaly developped for warc2zim, some of these primitives are now also used for mindtouch scraper and others might follow, so they are shared in zimscraperlib. See `rewriting` module.
1818

19-
## ZIM storage
19+
### ZIM storage
2020

2121
While storing web resources in a ZIM is mostly straightforward (we just transfer the raw bytes, after some modification for URL rewriting if needed), the decision of the path where the resource will be stored is very important.
2222

@@ -26,7 +26,7 @@ This function is responsible to compute the ZIM path where a given web resource
2626

2727
While the URL is the only driver of this computation for now, zimscraperlib might have to consider other contextual data in the future. E.g. the resource to serve might by dynamic, depending not only on URL query parameters but also header(s) value(s).
2828

29-
## Fuzzy rules
29+
### Fuzzy rules
3030

3131
Unfortunately, it is not always possible / desirable to store the resource with a simple transformation.
3232

@@ -36,11 +36,11 @@ When running again the same javascript code inside the ZIM, the URL will hence b
3636

3737
zimscraperlib hence relies on fuzzy rules to transform/simplify some URLs when computing the ZIM path.
3838

39-
## URL Rewriting
39+
### URL Rewriting
4040

4141
zimscraperlib transforms (rewrites) URLs found in documents (HTML, CSS, JS, ...) so that they are usable inside the ZIM.
4242

43-
### General case
43+
#### General case
4444

4545
One simple example is that we might have following code in an HTML document to load an image with an absolute URL:
4646

@@ -63,7 +63,7 @@ The table below gives some examples of what the rewritten URL is going to be, de
6363

6464
As can be seen on the last line (but this is true for all URLs), this rewriting has to take into account the convention saying at which ZIM path a given web resource will be stored.
6565

66-
### Dynamic case
66+
#### Dynamic case
6767

6868
The explanation above more or less assumed that the transformations can be done statically, i.e zimscraperlib can open every known document, find existing URLs and replace them with their counterpart inside the ZIM.
6969

@@ -75,13 +75,13 @@ A specific function is hence needed to rewrite URL **live in client browser**, i
7575

7676
_Spoiler: this is where we will rely on wombat.js from webrecorder team, since this dynamic interception is quite complex and already done quite neatly by them_
7777

78-
### Fuzzy rules
78+
#### Fuzzy rules
7979

8080
The same fuzzy rules that have been used to compute the ZIM path from a resource URL have to be applied again when rewriting URLs.
8181

8282
While this is expected to serve mostly for the dynamic case, we still applies them on both side (staticaly and dynamicaly) for coherency.
8383

84-
## Documents rewriten statically
84+
### Documents rewriten statically
8585

8686
For now zimscraperlib rewrites HTML, CSS and JS documents. For CSS and JS, this mainly consists in replacing URLs. For HTML, we also have more specific rewritting necessary (e.g. to handle base href or redirects with meta).
8787

docs/software_architecture.md

+2
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
# Software architecture
22

3+
Currently only HTML, CSS and JS rewriting is described in this document.
4+
35
## HTML rewriting
46

57
HTML rewriting is purely static (i.e. before resources are written to the ZIM). HTML code is parsed with the [HTML parser from Python standard library](https://docs.python.org/3/library/html.parser.html).

docs/technical_architecture.md

+2
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
# Technical architecture
22

3+
Currently only HTML, CSS and JS rewriting is described in this document.
4+
35
## Fuzzy rules
46

57
Fuzzy rules are stored in `rules/rules.yaml`. This configuration file is then used by `rules/generateRules.py` to generate Python and JS code.

src/zimscraperlib/rewriting/statics/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -5,5 +5,5 @@ This folder must contain two files which are not under Git version control:
55
project from files in the javascript folder
66

77
If you install zimscraperlib from sdist or wheel, we've pre-packaged these files for
8-
convenience and also so that your version of wombatSetup.js ais "aligned" (i.e. if you
8+
convenience and also so that your version of wombatSetup.js is "aligned" (i.e. if you
99
install zimscraperlib x.y.z, we are sure which version wombatSetup.js you have).

0 commit comments

Comments
 (0)