You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/functional_architecture.md
+7-7
Original file line number
Diff line number
Diff line change
@@ -16,7 +16,7 @@ zimscraperlib has primitives to manipulate pictures with some operations which a
16
16
17
17
zimscraperlib also contains primitives to rewrite HTML, CSS and JS fetched online, to proper operate within a ZIM without heavy modifications. While originaly developped for warc2zim, some of these primitives are now also used for mindtouch scraper and others might follow, so they are shared in zimscraperlib. See `rewriting` module.
18
18
19
-
## ZIM storage
19
+
###ZIM storage
20
20
21
21
While storing web resources in a ZIM is mostly straightforward (we just transfer the raw bytes, after some modification for URL rewriting if needed), the decision of the path where the resource will be stored is very important.
22
22
@@ -26,7 +26,7 @@ This function is responsible to compute the ZIM path where a given web resource
26
26
27
27
While the URL is the only driver of this computation for now, zimscraperlib might have to consider other contextual data in the future. E.g. the resource to serve might by dynamic, depending not only on URL query parameters but also header(s) value(s).
28
28
29
-
## Fuzzy rules
29
+
###Fuzzy rules
30
30
31
31
Unfortunately, it is not always possible / desirable to store the resource with a simple transformation.
32
32
@@ -36,11 +36,11 @@ When running again the same javascript code inside the ZIM, the URL will hence b
36
36
37
37
zimscraperlib hence relies on fuzzy rules to transform/simplify some URLs when computing the ZIM path.
38
38
39
-
## URL Rewriting
39
+
###URL Rewriting
40
40
41
41
zimscraperlib transforms (rewrites) URLs found in documents (HTML, CSS, JS, ...) so that they are usable inside the ZIM.
42
42
43
-
### General case
43
+
####General case
44
44
45
45
One simple example is that we might have following code in an HTML document to load an image with an absolute URL:
46
46
@@ -63,7 +63,7 @@ The table below gives some examples of what the rewritten URL is going to be, de
63
63
64
64
As can be seen on the last line (but this is true for all URLs), this rewriting has to take into account the convention saying at which ZIM path a given web resource will be stored.
65
65
66
-
### Dynamic case
66
+
####Dynamic case
67
67
68
68
The explanation above more or less assumed that the transformations can be done statically, i.e zimscraperlib can open every known document, find existing URLs and replace them with their counterpart inside the ZIM.
69
69
@@ -75,13 +75,13 @@ A specific function is hence needed to rewrite URL **live in client browser**, i
75
75
76
76
_Spoiler: this is where we will rely on wombat.js from webrecorder team, since this dynamic interception is quite complex and already done quite neatly by them_
77
77
78
-
### Fuzzy rules
78
+
####Fuzzy rules
79
79
80
80
The same fuzzy rules that have been used to compute the ZIM path from a resource URL have to be applied again when rewriting URLs.
81
81
82
82
While this is expected to serve mostly for the dynamic case, we still applies them on both side (staticaly and dynamicaly) for coherency.
83
83
84
-
## Documents rewriten statically
84
+
###Documents rewriten statically
85
85
86
86
For now zimscraperlib rewrites HTML, CSS and JS documents. For CSS and JS, this mainly consists in replacing URLs. For HTML, we also have more specific rewritting necessary (e.g. to handle base href or redirects with meta).
Copy file name to clipboardExpand all lines: docs/software_architecture.md
+2
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,7 @@
1
1
# Software architecture
2
2
3
+
Currently only HTML, CSS and JS rewriting is described in this document.
4
+
3
5
## HTML rewriting
4
6
5
7
HTML rewriting is purely static (i.e. before resources are written to the ZIM). HTML code is parsed with the [HTML parser from Python standard library](https://docs.python.org/3/library/html.parser.html).
0 commit comments