Paginating a blog
There are several approaches to pagination of web content in crul, but the most simple one involves understanding the urls of paginated content. Oftentimes, a blog directory or other paginated content will indicate new pages through a /page/1
, /page/2
, etc. form in the url. We can use this structure in combination with crul's expanding stages to paginate through a directory of blog posts.
Paginating web content usually involves:
- Getting a range of pages OR list of links
- Accessing each page in that range OR list of links
Getting the range of pages can either be manual (check how many pages exist) or dynamic (reading the directory to find out how many pages exist). Getting a list of links usually involves filtering, particularly using || filter "nodeName == 'A'"
in combination with other more specific filters.
Example: Company blog​
Full Query​
range 1 5 page
|| open https://cribl.io/blog/page/$page$
|| filter "(nodeName == 'A') and (parentElement.nodeName == 'H5')"
Stage 1: Creating a range​
range 1 5 page
This stage uses the range
command to construct a dataset consisting of a page
column with five rows containing the numbers 1 through 5 in each row.
Stages 2-3: Open each page and process the results​
...
|| open https://cribl.io/blog/page/$page$
|| filter "(nodeName == 'A') and (parentElement.nodeName == 'H5')"
The next stages will open
five pages by substituting the $page$
token value with the page
column value in each of the 5 rows generated by the previous range
command. Then we will filter
the data to just extract blog links and titles.