📄️ Quickstart 1: Finding Expired GitHub GPG Keys
Click me to watch a recorded video version of this crul quickstart!
📄️ Quickstart 2: Retrieving comments from top posts on Hacker News
Introduction
📄️ Building a data feed of GitHub issues to Cribl
Building a data feed from a query is a common use case of crul. In this example, we'll deploy a crul query that builds a dynamic data set from a REST API and delivers it to Cribl on a scheduled interval. We'll also explore how the diff command can be used to construct data feeds that only send new results.
📄️ Building a synthetic data feed to AWS S3
Building a data feed from a query is a common use case of crul. In this example, we'll deploy a crul query that builds a synthetic data set using a REST API to seed values, as well as fully synthesized values for personally identifying or sensitive values. We'll configure this data feed to write to AWS S3 on a scheduled interval.
📄️ Building a data feed of vector embeddings from Monday.com to Pinecone
Building a data feed from a query is a common use case of crul. In this example, we'll deploy a crul query that transforms REST API responses from Monday.com into vector embeddings that are then loaded into a Pinecone vector database on a set schedule
📄️ OpenAI assisted Spotify playlist curation
Let's take a look at how we can use the Spotify API to categorize songs into different types of playlists with the help of OpenAI's evaluation of a track's lyrics, and audio features.
📄️ Expanding links from a webpage (Hacker News)
A common use case for the crul query language is taking advantage of expanding stages to open many pages from a single webpage and return the results as a consolidated data set. For example, we may have a recipe site with many recipes listed in a recipe directory. We can use crul to get links to all the recipes in the directory, then expand each of those links and filter for recipe ingredients or another use case.
📄️ Expanding links from an API (Hacker News)
In this example, we will get back the last 10 items from the Hacker News API. Comments, stories, etc. are all considered items with a unique id and metadata defining if it is a comment, etc. We will first get the largest item (most recent) and use it to construct a range of items to query. We will then use this range to make an expanding stage that makes an api request for each item.
📄️ Fetch a Zip Archive and Scan Extract
ZIP archives can be remotely fetched via the api command for entry metadata scanning and extraction/conversion to datasets.
📄️ Getting product prices from a webpage (Shopify)
This query will extract product prices from a Shopify powered site.
📄️ Importing deals to Outseta fast
We recently decided to use the deals feature in Outseta (our CRM here at crul) to manage our customer interview lifecycle. We found crul to be perfect tool to get our data into the import format expected by Outseta in less than 5 minutes.
📄️ Querying an authenticated API (Twitter)
Many APIs require some form of authentication. This can be a token, an OAuth flow, or another mechanism. The api command is able to send requests with custom headers and data payloads, and also includes a --bearer flag among other auth related flags to support many forms of API authentication.
📄️ Querying an asynchronous API (Splunk Query)
Many services, such as query engines like GCP BigQuery, AWS Athena, Splunk, etc. have asynchronous dispatch APIs for running queries. This means that you can dispatch a query against these services, get back a job id, which you can then poll for status/completion before accessing the results. This is a common API pattern, and is supported by the crul api command syntax.
📄️ Querying an API Through a Proxy
An API request can be easily routed through a proxy. Supported proxy protocols include http, https, socks4, socks5 and pac.
📄️ Querying a Web Page Through a Proxy
A web page request can be easily routed through a proxy. Supported proxy protocols include http, https, socks4 and socks5.
📄️ Paginating a blog
There are several approaches to pagination of web content in crul, but the most simple one involves understanding the urls of paginated content. Oftentimes, a blog directory or other paginated content will indicate new pages through a /page/1, /page/2, etc. form in the url. We can use this structure in combination with crul's expanding stages to paginate through a directory of blog posts.
📄️ Paginating an API (pokeapi)
Many APIs return paginated responses, this means that not all results are available in a single request, the response however will include a pointer to the next set of results. This pointer could be a hash value, a page, an offset, or an explicit link. The crul api command can handle many types of pagination using the --pagination.* set of flags.
📄️ Exporting results
Exporting results to 3rd party data lakes or other destinations is straightforward with the help of the freeze command. With the freeze command, we can easily push the results of our query directly to a preconfigured 3rd party store, or save the results locally to a file.
📄️ Capturing network requests
In addition to rendering web content and processing web pages into tabular data sets, crul can also capture the network lifecycle of loading a webpage using the requests command. This command allows us to capture performance and request/response content of third party network requests as well as a whole assortment of rich metadata for performance and security monitoring. It's also incredibly easy to use.
📄️ How to find filters
Using the open or api command often generates a large data set that can be tricky to find filters for. For example, if we are trying to get all headlines from a news site, how do we know what filters describe a headline?
📄️ How to use branching
The branch is a powerful way to split a query mid execution an operate on a stage in two different ways before joining the results back together. For example, you could run a specific filter in one branch and a totally different one in another, then further interact with each set of filtered data, before joining them back together in a single consolidated data set.
📄️ Converting an HTML Table to a Dataset
HTML tables allow web developers to arrange data into rows and columns. A table in HTML consists of table cells inside rows and columns which can be easily converted to a dataset using the parseHTMLTable command.
📄️ Converting an HTML Page to an Article
An HTML document can be easily converted to an article using the parseArticle command.