Converting an HTML Page to an Article
An HTML document can be easily converted to an article using the parseArticle
command.
Let's take a look at an example.
Example 1​
Full Query​
open https://www.crul.com/blog/2023-03-07-tales-hn-front-page --html
|| filter "(nodeName == 'HTML')"
|| parseArticle outerHTML
Stage 1: Open a web page​
open https://www.crul.com/blog/2023-03-07-tales-hn-front-page --html
Open a web page in a browser and wait for all JavaScript and external assets to load. We use the -html
flag to include the HTML source of the rendered web page for each returned element.
NOTE: The -html
flag has speed implications as it includes both the outerHTML and innerHTML per element.
Stage 2: Filtering for the HTML document​
...
|| filter "(nodeName == 'HTML')"
The filter will find and match all rows that are HTML
elements.
Stage 3: Parse/convert the html document to an article dataset.​
...
|| parseArticle outerHTML
The HTML document will be parsed and converted into a data set that will contain multiple article columns and the hash of the content.