Skip to content

Commit 97e41c6

Browse files
authored
[Examples] Add CLI tool for starting a WordPress site from static files (#5)
Adds a CLI tool to import static files into a new Playground site. For example, this command: ```bash > npx create-wp-site.tar.gz crawler https://adamadam.blog --output-dir=./site Importing static files from https://adamadam.blog Starting the import Import finished! See your imported content at: http://127.0.0.1:9400/adam-s-perspective-wordpress-playground-and-more ``` Crawls my blog, imports the content to a new WordPress site stored in `./site` directory, and starts a local server for that site. ## Supported files and data sources * Supported formats: Markdown, HTML, XHTML, WXR, EPUB * Supported data sources: Local drive, URL, remote Git repository, a remote website ## Implementation This PR ships a WordPress plugin with a CLI script that runs the import. The plugin and its dependencies are installed via Blueprints and WordPress Playground CLI (`@wp-playground/cli`). The import is done via the streaming parsers and readers provided in this repo. They follow the [Data Liberation](WordPress/data-liberation#78) assumptions and are PHP-only, dependency-free, streaming, re-entrant, report progress, parallelize downloads, etc. etc. This code would work on any WordPress host (except for maybe PHP 7 support – but it's easy to add). Finally, this PR also ships a "Playground runtime protocol" – a wrapper over `post_message_to_js` that allows the PHP script to control the Playground runtime. This PR effectively ships the following import pipeline as a CLI script: ![CleanShot 2025-03-12 at 15 12 11@2x](https://github.com/user-attachments/assets/2203b29d-d1b1-43eb-bc60-df3d93002793) The same logic could be easily reused in the browser via Playground. ## Usage examples Import a subset of Gutenberg documentation focused on data basics: ```bash bun index.js \ git https://github.com/WordPress/gutenberg.git \ --branch=trunk \ --path-in-repo=docs/how-to-guides/data-basics/ \ --media-url=https://developer.wordpress.org/files/ \ --media-url=https://raw.githubusercontent.com/WordPress/gutenberg/HEAD/docs/ \ --source-site-url=https://developer.wordpress.org/block-editor/how-to-guides/data-basics/ \ --additional-site-urls=https://developer.wordpress.org/docs/how-to-guides/data-basics/ ``` Import the complete Gutenberg documentation: ```bash bun index.js \ git https://github.com/WordPress/gutenberg.git \ --branch=trunk \ --path-in-repo=docs/ \ --media-url=https://developer.wordpress.org/files/ \ --media-url=https://raw.githubusercontent.com/WordPress/gutenberg/HEAD/docs/ \ --source-site-url=https://developer.wordpress.org/block-editor/ \ --additional-site-urls=https://developer.wordpress.org/docs/ ``` Import content from Adam's blog using a crawler: ```bash bun index.js crawler https://adamadam.blog ``` Import accessibility testing content from a WordPress XML export: ```bash bun index.js wxr https://raw.githubusercontent.com/wpaccessibility/a11y-theme-unit-test/master/a11y-theme-unit-test-data.xml ``` Import content from an EPUB file: ```bash bun index.js epub https://github.com/IDPF/epub3-samples/releases/download/20230704/childrens-literature.epub ``` Import Gutenberg documentation from a local repository checkout: ```bash bun index.js path ../../../gutenberg/docs/how-to-guides/ ``` Import documentation from the WordPress Playground project: ```bash bun index.js \ git https://github.com/WordPress/wordpress-playground.git \ --branch=trunk \ --path-in-repo=packages/docs/site/docs/blueprints/ \ --media-url=https://wordpress.github.io/wordpress-playground/ \ --source-site-url=https://wordpress.github.io/ ``` Import Bootstrap 5.3 documentation: ```bash bun index.js \ git https://github.com/twbs/bootstrap.git \ --branch=gh-pages \ --path-in-repo=docs/5.3/ \ --media-url=https://getbootstrap.com/docs/5.3/ \ --source-site-url=https://getbootstrap.com/docs/5.3/ \ --additional-site-urls=https://getbootstrap.com/docs/ ``` Import Laravel documentation: ```bash bun index.js \ git https://github.com/laravel/docs.git \ --path-in-repo=/ \ --branch=12.x \ --source-site-url=https://laravel.com/docs/ ``` Import internal documentation from the CPython repository: ```bash bun index.js \ git https://github.com/python/cpython.git \ --branch=main \ --path-in-repo=InternalDocs/ \ --source-site-url=https://raw.githubusercontent.com/python/cpython/refs/heads/main/InternalDocs/ ``` Import content from the Fullstack GraphQL book repository: ```bash bun index.js \ git https://github.com/GraphQLCollege/fullstack-graphql.git \ --branch=master \ --path-in-repo=manuscript/ \ --source-site-url=https://raw.githubusercontent.com/GraphQLCollege/fullstack-graphql/refs/heads/master/manuscript/ ``` Import content from the CPP WASM book repository: ```bash bun index.js \ git https://github.com/3dgen/cppwasm-book.git \ --branch=master \ --path-in-repo=en/ \ --source-site-url=https://raw.githubusercontent.com/3dgen/cppwasm-book/refs/heads/master/en/ ```
1 parent b02ee43 commit 97e41c6

File tree

62 files changed

+5842
-278
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

62 files changed

+5842
-278
lines changed

.github/workflows/release.yml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,3 +80,13 @@ jobs:
8080
asset_path: ./dist/plugins/static-files-editor.zip
8181
asset_name: static-files-editor.zip
8282
asset_content_type: application/zip
83+
84+
- name: Upload Import Static Files Example Asset
85+
uses: actions/upload-release-asset@v1
86+
env:
87+
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
88+
with:
89+
upload_url: ${{ steps.create_release.outputs.upload_url }}
90+
asset_path: ./dist/examples/create-wp-site.tar.gz
91+
asset_name: create-wp-site.tar.gz
92+
asset_content_type: application/gzip

.gitignore

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
vendor/
22
**/*/vendor/
33
opfs/
4+
node_modules
5+
.cursor
46
rest/
57
dist/
68
outdir/
@@ -26,4 +28,5 @@ buntest.php
2628
blueprint-url-updater.json
2729
components/DataLiberation/Tests/test-output-html/
2830
components/DataLiberation/Tests/test-output-md
29-
blueprint-dev.json
31+
blueprint-dev.json
32+
examples/create-wp-site/data-liberation.zip

bin/build-all.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,3 +2,4 @@
22

33
bash bin/build-libraries-phar.sh
44
bash bin/build-plugins.sh
5+
bash bin/build-examples.sh

bin/build-examples.sh

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
#!/bin/bash
2+
3+
SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
4+
PROJECT_DIR=$SCRIPT_DIR/..
5+
6+
rm -rf $PROJECT_DIR/dist/examples
7+
mkdir -p $PROJECT_DIR/dist/examples
8+
9+
cp $PROJECT_DIR/dist/plugins/data-liberation.zip $PROJECT_DIR/examples/create-wp-site/data-liberation.zip
10+
cp -r $PROJECT_DIR/examples/create-wp-site/ $PROJECT_DIR/dist/package
11+
cd $PROJECT_DIR/dist
12+
tar -czvf examples/create-wp-site.tar.gz package/{*.js,*.json,*.php,*.zip,cli,playground-protocol,README.md}
13+
rm -rf package
File renamed without changes.

components/DataLiberation/BlockMarkup/BlockMarkupUrlProcessor.php

Lines changed: 52 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -140,6 +140,7 @@ private function next_url_attribute() {
140140

141141
if ( is_string( $url_maybe ) ) {
142142
$parsed_url = WPURL::parse( $url_maybe, $this->base_url_string );
143+
143144
if ( false === $parsed_url ) {
144145
return false;
145146
}
@@ -177,31 +178,41 @@ private function next_url_block_attribute() {
177178
return false;
178179
}
179180

180-
public function set_raw_url( $new_url ) {
181+
/**
182+
* Replaces the currently matched URL with a new one.
183+
*
184+
* @param string $raw_url The raw URL.
185+
* @param URL $parsed_url The parsed version of the raw URL. It is required
186+
* as $raw_url might be a relative URL pointing to a different
187+
* host than this processor's base URL.
188+
* @return bool True if the URL was set, false otherwise.
189+
*/
190+
public function set_url( $raw_url, $parsed_url ) {
181191
if ( null === $this->raw_url ) {
182192
return false;
183193
}
184-
$this->raw_url = $new_url;
194+
$this->raw_url = $raw_url;
195+
$this->parsed_url = $parsed_url;
185196
switch ( parent::get_token_type() ) {
186197
case '#tag':
187198
$attr = $this->get_inspected_attribute_name();
188199
if ( false === $attr ) {
189200
return false;
190201
}
191-
$this->set_attribute( $attr, $new_url );
202+
$this->set_attribute( $attr, $raw_url );
192203

193204
return true;
194205

195206
case '#block-comment':
196-
return $this->set_block_attribute_value( $new_url );
207+
return $this->set_block_attribute_value( $raw_url );
197208

198209
case '#text':
199210
if ( null === $this->url_in_text_processor ) {
200211
return false;
201212
}
202213
$this->url_in_text_node_updated = true;
203214

204-
return $this->url_in_text_processor->set_raw_url( $new_url );
215+
return $this->url_in_text_processor->set_raw_url( $raw_url );
205216
}
206217
}
207218

@@ -216,7 +227,7 @@ public function set_raw_url( $new_url ) {
216227
* relative URLs in text nodes. On the other hand, the detection is performed
217228
* by this WPURL_In_Text_Processor class so maybe the two do go hand in hand?
218229
*/
219-
public function replace_base_url( URL $to_url ) {
230+
public function replace_base_url( URL $to_url, ?URL $base_url = null ) {
220231
$updated_url = clone $this->get_parsed_url();
221232

222233
$updated_url->hostname = $to_url->hostname;
@@ -227,19 +238,22 @@ public function replace_base_url( URL $to_url ) {
227238
$from_url = $this->get_parsed_url();
228239
$from_pathname = $from_url->pathname;
229240
$to_pathname = $to_url->pathname;
230-
if ( $this->base_url_object->pathname !== $to_pathname ) {
231-
$base_pathname_with_trailing_slash = rtrim( $this->base_url_object->pathname, '/' ) . '/';
241+
242+
$base_url = $base_url ?? $this->base_url_object;
243+
if ( $base_url->pathname !== $to_pathname ) {
244+
$base_pathname_with_trailing_slash = rtrim( $base_url->pathname, '/' ) . '/';
232245
$decoded_matched_pathname = urldecode_n(
233246
$from_pathname,
234247
strlen( $base_pathname_with_trailing_slash )
235248
);
236249
$to_pathname_with_trailing_slash = rtrim( $to_pathname, '/' ) . '/';
237-
$updated_url->pathname =
238-
$to_pathname_with_trailing_slash .
239-
substr(
240-
$decoded_matched_pathname,
241-
strlen( $base_pathname_with_trailing_slash )
242-
);
250+
$remaining_pathname =
251+
substr(
252+
$decoded_matched_pathname,
253+
strlen( $base_pathname_with_trailing_slash )
254+
);
255+
256+
$updated_url->pathname = $to_pathname_with_trailing_slash . $remaining_pathname;
243257
}
244258

245259
/*
@@ -263,18 +277,8 @@ public function replace_base_url( URL $to_url ) {
263277
return false;
264278
}
265279

266-
$is_relative = (
267-
// The URL-rewriting specific logic. We make an assumption that only
268-
// absolute URLs are detected in text nodes.
269-
// @TODO: Verify this assumption, evaluate whether this is the right
270-
// place to place this logic. Perhaps this *method* could be
271-
// decoupled into two separate *functions*?
272-
$this->get_token_type() !== '#text' &&
273-
! str_starts_with( $this->get_raw_url(), 'http://' ) &&
274-
! str_starts_with( $this->get_raw_url(), 'https://' )
275-
);
276-
if ( ! $is_relative ) {
277-
$this->set_raw_url( $new_raw_url );
280+
if ( ! $this->is_url_relative() ) {
281+
$this->set_url( $new_raw_url, $updated_url );
278282
return true;
279283
}
280284

@@ -286,10 +290,31 @@ public function replace_base_url( URL $to_url ) {
286290
$new_relative_url .= $updated_url->hash;
287291
}
288292

289-
$this->set_raw_url( $new_relative_url );
293+
$this->set_url( $new_relative_url, $updated_url );
290294
return true;
291295
}
292296

297+
/**
298+
* Returns true if the currently matched URL is relative.
299+
*
300+
* @return bool Whether the currently matched URL is relative.
301+
*/
302+
public function is_url_relative() {
303+
return (
304+
! WPURL::can_parse( $this->get_raw_url() ) &&
305+
// only absolute URLs are detected in text nodes.
306+
'#text' !== $this->get_token_type()
307+
);
308+
}
309+
/**
310+
* Returns true if the currently matched URL is absolute.
311+
*
312+
* @return bool Whether the currently matched URL is absolute.
313+
*/
314+
public function is_url_absolute() {
315+
return WPURL::can_parse( $this->get_raw_url() );
316+
}
317+
293318
public function get_inspected_attribute_name() {
294319
if ( '#tag' !== $this->get_token_type() ) {
295320
return false;

components/DataLiberation/DataFormatConsumer/MarkupProcessorConsumer.php

Lines changed: 27 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@
3030
* array(
3131
* 'post_title' => array( 'My first post' ),
3232
* )
33-
*
33+
*
3434
* @TODO: Satisfy the same test suite as the Gutenberg paste/raw handler.
3535
* Look for the test examples (input/output) in the Gutenberg repo.
3636
* @TODO: Consider option for "presentation" vs "semantics" mode. We can preserve
@@ -59,6 +59,9 @@ public function consume() {
5959
break;
6060
}
6161
$this->append_rich_text( htmlspecialchars( $this->markup_processor->get_modifiable_text() ) );
62+
if ( in_array( $this->markup_processor->get_tag(), array( 'H1', 'H2', 'H3', 'H4', 'H5', 'H6' ) ) ) {
63+
$this->on_title_candidate( $this->markup_processor->get_modifiable_text() );
64+
}
6265
break;
6366
case '#tag':
6467
$this->handle_tag();
@@ -92,11 +95,16 @@ private function handle_tag() {
9295
$is_void_tag = ! $html->expects_closer() && ! $html->is_tag_closer();
9396
if ( $is_void_tag ) {
9497
switch ( $tag ) {
98+
case 'TITLE':
99+
$this->on_title_candidate( $html->get_modifiable_text() );
100+
break;
95101
case 'META':
96102
$key = $html->get_attribute( 'name' );
97103
$value = $html->get_attribute( 'content' );
98104
if ( ! array_key_exists( $key, $this->metadata ) ) {
99-
$this->metadata[ $key ] = array();
105+
if ( $key ) {
106+
$this->metadata[ $key ] = array();
107+
}
100108
}
101109
switch ( $html->get_attribute( 'type' ) ) {
102110
case 'integer':
@@ -107,7 +115,9 @@ private function handle_tag() {
107115
break;
108116
// @TODO: Discuss what would support for other types look like.
109117
}
110-
$this->metadata[ $key ][] = $value;
118+
if ( $key ) {
119+
$this->metadata[ $key ][] = $value;
120+
}
111121
break;
112122
case 'IMG':
113123
$template = new \WP_HTML_Tag_Processor( '<img>' );
@@ -286,6 +296,20 @@ private function handle_tag() {
286296
}
287297
}
288298

299+
private function on_title_candidate( $text ) {
300+
if ( ! array_key_exists( 'post_title', $this->metadata ) ) {
301+
$this->metadata['post_title'] = array(
302+
$text,
303+
);
304+
}
305+
if ( ! array_key_exists( 'post_name', $this->metadata ) ) {
306+
$this->metadata['post_name'] = array(
307+
// @TODO: Slugify
308+
$text,
309+
);
310+
}
311+
}
312+
289313
/**
290314
* Checks whether the given tag is an inline formatting element
291315
* that we want to preserve when parsing rich text. For example,

components/DataLiberation/EntityReader/BlocksWithMetadataEntityReader.php

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@ public function next_entity() {
4646
$all_metadata = $this->metadata;
4747
$post_fields = array();
4848
$other_metadata = array();
49+
4950
foreach ( $all_metadata as $key => $values ) {
5051
if ( in_array( $key, ImportEntity::POST_FIELDS, true ) ) {
5152
$post_fields[ $key ] = $values[0];
@@ -54,8 +55,9 @@ public function next_entity() {
5455
}
5556
}
5657

57-
$post_fields['post_id'] = $this->post_id;
58-
$post_fields['post_content'] = $this->block_markup;
58+
$post_fields['post_id'] = $this->post_id;
59+
$post_fields['post_content'] = $this->block_markup;
60+
$post_fields['parsed_metadata'] = $all_metadata;
5961

6062
// In Markdown, the frontmatter title can be a worse title candidate than
6163
// the first H1 block. In block markup exports, it will be the opposite.

0 commit comments

Comments
 (0)