Paginating a blog

There are several approaches to pagination of web content in crul, but the most simple one involves understanding the urls of paginated content. Oftentimes, a blog directory or other paginated content will indicate new pages through a /page/1, /page/2, etc. form in the url. We can use this structure in combination with crul's expanding stages to paginate through a directory of blog posts.

Paginating web content usually involves:

Getting a range of pages OR list of links
Accessing each page in that range OR list of links

Getting the range of pages can either be manual (check how many pages exist) or dynamic (reading the directory to find out how many pages exist). Getting a list of links usually involves filtering, particularly using || filter "nodeName == 'A'" in combination with other more specific filters.

Example: Company blog

Full Query

range 1 5 page
|| open https://cribl.io/blog/page/$page$
|| filter "(nodeName == 'A') and (parentElement.nodeName == 'H5')"

Stage 1: Creating a range

range 1 5 page

This stage uses the range command to construct a dataset consisting of a page column with five rows containing the numbers 1 through 5 in each row.

Stages 2-3: Open each page and process the results

...
|| open https://cribl.io/blog/page/$page$
|| filter "(nodeName == 'A') and (parentElement.nodeName == 'H5')"

The next stages will open five pages by substituting the $page$ token value with the page column value in each of the 5 rows generated by the previous range command. Then we will filter the data to just extract blog links and titles.

Paginating a blog

Example: Company blog​

Full Query​

Stage 1: Creating a range​

Stages 2-3: Open each page and process the results​

Example: Company blog

Full Query

Stage 1: Creating a range

Stages 2-3: Open each page and process the results