Synthetic Data Generation
The synthesize
command is a powerful way to create synthetic data sets based on real data in combination with fully synthesized values.
You can upload CSV/JSON/NDJSON files to the cellar and thaw
those files as seeds for the synthesize commands.
You can also provide natural language prompts to the synthesize command describing the synthetic data sets you would like to generate.
Note: The synthesize
command requires auth when useing th --prompt
flag, so you'll first need to configure an openai
credential containing your OpenAI API key with the name openai
.
How it works​
The synthesize command takes zero or more arguments, which must refer to columns in the previous stage's results. The synthesize command will use these columns as data sets of possible values to select from for data synthesis.
Columns can be linked together by providing the argument as a comma delimited list of arguments. If columns are linked, then all values synthesized for the linked columns will come from the same row.
Let's take a look at an example.
Query​
seed '[
{
"product": "phone",
"price": 800,
"country": "US",
"user": "Nemo",
"user_id": "zxcv-asdf-1234-5678"
},
{
"product": "computer",
"price": 1000,
"country": "FR",
"user": "Dory",
"user_id": "1234-5678-asdf-zxcv"
},
{
"country": "CA",
"user": "Marlin",
"user_id": "0978-6543-asdf-yuio"
},
{
"country": "GB",
},
]'
|| synthesize "product,price" "user,user_id" "country"
--prompt "add a timestamp (named ts), a guid (named tx_id)"
--count 100
Synthesize Arguments​
By providing the argument "product,price"
, we are telling the synthesize command that the product
and price
columns are linked, meaning that there is a 1:1 relationship between the two.
Synthesized data generated with a certain product
value will always have the same price
, based on the previous set of results used to seed the synthesize
command.
The same logic applies for the "user,user_id"
argument, which are independently linked.
The "country"
argument is standalone in this example. This means that any possible value for country
will be selected to generate a row.
Synthesize Flags​
The --prompt
flag allows us to provide a natural language prompt detailing what fully synthetic values to include in the result.
The --count
flag determines how many results will be synthesized.
Results​
This query will generate 100 results resembling the below:
product | price | user | user_id | country | ts | tx_id |
---|---|---|---|---|---|---|
phone | 800 | Dory | 1234-5678-asdf-zxcv | FR | 2022-06-29T13:38:37.413Z | e00024d9-3969-4243-8388-53ec0b76e31f |
.. | .. | .. | .. | .. | .. | .. |
computer | 1000 | Dory | 0978-6543-asdf-yuio | GB | 2021-06-17T02:24:32.357Z | v98gh36f-e2f3-9867-9e11-f54hj7asd221 |
computer | 1000 | Marlin | 0978-6543-asdf-yuio | US | 2020-05-14T02:24:56.857Z | b83be58d-e8e4-4952-9e99-f20bd4ece530 |
Additional Examples​
Prompt only​
To generate synthetic results that don't use any real data, we can simply use the --prompt
flag and provide a natural language prompt describing the data set we would like to synthesize. Notice that zero arguments are provided.
Query​
synthesize
--prompt "add a random product (named product) which is either a tablet, a phone, or a computer, add a random price (named price), add a timestamp (named ts), a guid (named tx_id)"
--count 100
Results​
This will generate 100 results resembling the below:
product | price | ts | tx_id |
---|---|---|---|
phone | 206 | 2022-06-29T13:38:37.413Z | e00024d9-3969-4243-8388-53ec0b76e31f |
.. | .. | .. | .. |
computer | 600 | 2020-05-14T02:24:56.857Z | b83be58d-e8e4-4952-9e99-f20bd4ece530 |
Combination with thaw​
We can also use uploaded or frozen files using the thaw
command as the data set to select values from for our synthetic results.
First let's freeze a sample data set.
seed '[
{
"product": "phone",
"price": 800,
"country": "US",
"user": "Nemo",
"user_id": "zxcv-asdf-1234-5678"
},
{
"product": "computer",
"price": 1000,
"country": "FR",
"user": "Dory",
"user_id": "1234-5678-asdf-zxcv"
},
{
"country": "CA",
"user": "Marlin",
"user_id": "0978-6543-asdf-yuio"
},
{
"country": "GB",
},
]'
|| freeze synthetic-demo
This data set now exists in the cellar with the name synthetic-demo
and can be accessed with the thaw
command.
thaw synthetic-demo
|| synthesize "product,price" "user,user_id" "country"
--prompt "add a timestamp (named ts), a guid (named tx_id)"
--count 100
Results​
This query will generate 100 results resembling the below:
product | price | user | user_id | country | ts | tx_id |
---|---|---|---|---|---|---|
phone | 800 | Dory | 1234-5678-asdf-zxcv | FR | 2022-06-29T13:38:37.413Z | e00024d9-3969-4243-8388-53ec0b76e31f |
.. | .. | .. | .. | .. | .. | .. |
computer | 1000 | Dory | 0978-6543-asdf-yuio | GB | 2021-06-17T02:24:32.357Z | v98gh36f-e2f3-9867-9e11-f54hj7asd221 |
computer | 1000 | Marlin | 0978-6543-asdf-yuio | US | 2020-05-14T02:24:56.857Z | b83be58d-e8e4-4952-9e99-f20bd4ece530 |
Combination with api/open​
Similar to with thaw
, we can take the output of query using the api
/open
commands, and use the synthesize
command to generate synthetic data from those results.
Query​
api get https://pokeapi.co/api/v2/pokemon
|| normalize results
|| synthesize "name"
--prompt "add a first name (name first_name), and a guid (named tx_id)"
--count 100
Results​
This query will generate 100 results resembling the below:
pokemon | first_name | tx_id |
---|---|---|
charizard | Steve | e00024d9-3969-4243-8388-53ec0b76e31f |
.. | .. | .. |
bulbasaur | Linda | b83be58d-e8e4-4952-9e99-f20bd4ece530 |