Welcome to my first blog post! I hope this helps anyone looking for a tool to generate data via BigQuery.
Background
I often want to generate mock data in BigQuery to test the logic that fits with various types of data, whether it be a small or large amount.
Method
There are various ways to achieve this, including using a Python script with some random library to generate the data, inserting data directly by CSV/JSON new line, or using data from another table. In this article, I will focus on a method to generate data in BigQuery using table profiles and then configuring rules on top of that. After running this method, it will generate data that fits the schema of the table.
Here’s how it works
- Start by declaring the variables we will use in the script
- Read the schema from the metadata table.
- Set the expected number of records.
- Loop through each column and check the column type, then generate a random value that is related to the data type.
- Generate SQL that can be a random number/string/date, up to the data type that is specific in each “if” clause.
- Execute with UNNEST(GENERATE_ARRAY…) to generate the desired number of records.
- The result should be generated at the console.
Code
Do not forget to replace following variable before run the script
project_id
: This is the ID of the Google Cloud Platform project that contains the BigQuery dataset and table where you want to generate mock data. You will need to provide the project ID so that BigQuery knows which project to access in order to generate the data.
dataset_id
: This is the ID of the BigQuery dataset within the specified project where you want to generate mock data. You will need to provide the dataset ID so that BigQuery knows which dataset to access within the project.
table_id
: This is the ID of the BigQuery table within the specified dataset for which you want to generate mock data. Once you have provided the project ID and dataset ID, you can then specify the table ID in order to generate the mock data within that specific table.
expected_record
: This variable determines the number of records that you want to generate as mock data for the specified BigQuery table. Please set the value of expected_record to the desired number of records that you need to generate.
-- Set default values
DECLARE col_name ARRAY<STRING>;
DECLARE col_type ARRAY<STRING>;
DECLARE expected_record INT64 DEFAULT 10;
DECLARE column_cursor INT64 DEFAULT 0;
DECLARE query STRING DEFAULT "SELECT ";
DECLARE full_query STRING DEFAULT "";
-- Read schema from metadata table
SET col_name = (SELECT ARRAY_AGG(column_name) FROM `{project_id}.{dataset_id}.INFORMATION_SCHEMA.COLUMNS` WHERE table_name = '{table_id}');
SET col_type = (SELECT ARRAY_AGG(data_type) FROM `{project_id}.{dataset_id}.INFORMATION_SCHEMA.COLUMNS` WHERE table_name = '{table_id}');
-- set number of expected record here
SET expected_record = 120;
-- Loop through each column, check the column type, and generate a random value that is related to the data type
WHILE column_cursor < ARRAY_LENGTH(col_name) DO
IF (col_type[OFFSET(column_cursor)] = 'INT64') THEN
SET query = CONCAT(query, "CAST(ROUND(RAND() * 100) AS INT64)", " AS ", col_name[OFFSET(column_cursor)], ",\n") ;
END IF;
IF (col_type[OFFSET(column_cursor)] = 'FLOAT64') THEN
SET query = CONCAT(query, "RAND() * 100", " AS ", col_name[OFFSET(column_cursor)], ",\n") ;
END IF;
IF (col_type[OFFSET(column_cursor)] = 'NUMERIC') THEN
SET query = CONCAT(query, "CAST(ROUND(RAND() * 100, 2) AS NUMERIC)", " AS ", col_name[OFFSET(column_cursor)], ",\n") ;
END IF;
IF (col_type[OFFSET(column_cursor)] = 'STRING') THEN
SET query = CONCAT(query, "SUBSTR(CONCAT('string_', CAST(ROUND(RAND() * 100) AS STRING)), 1, 10)"," AS ", col_name[OFFSET(column_cursor)], ",\n") ;
END IF;
IF (col_type[OFFSET(column_cursor)] = 'DATE') THEN
SET query = CONCAT(query, "DATE_SUB(CURRENT_DATE(), INTERVAL CAST(ROUND(RAND() * 365) AS INT64) DAY)", " AS ", col_name[OFFSET(column_cursor)], ",\n") ;
END IF;
IF (col_type[OFFSET(column_cursor)] = 'TIMESTAMP') THEN
SET query = CONCAT(query, "DATE_SUB(CURRENT_TIMESTAMP(), INTERVAL CAST(ROUND(RAND() * 365) AS INT64) DAY)", " AS ", col_name[OFFSET(column_cursor)], ",\n") ;
END IF;
IF (col_type[OFFSET(column_cursor)] = 'BOOL') THEN
SET query = CONCAT(query, "MOD(CAST(10*RAND() AS INT64), 2) = 0", " AS ", col_name[OFFSET(column_cursor)], ",\n") ;
END IF;
SET column_cursor = column_cursor + 1;
END WHILE;
-- Generate up to the expected_record variable
SET full_query = CONCAT(query, " FROM UNNEST(GENERATE_ARRAY(1, ", expected_record, ")) AS id");
EXECUTE IMMEDIATE format(full_query);
Limitation
Limitations There are some limitations to this method, including the fact that some data types are not supported, and you may need to adjust the code yourself if your data type is different from those listed in the script. Additionally, this method is not suitable for generating very large amounts of data due to the UNNEST limit.