Machine Learning in SQL using BigQuery

A beginner's guide to training machine learning models with SQL using Google's BigQuery ML

Machine Learning in SQL using BigQuery

Machine learning is a rapidly growing field that has seen a lot of interest from businesses and organizations. With the growing amount of data that businesses collect, it has become essential to use machine learning to extract insights from that data. One of the best tools for this is BigQuery, a fully managed, cloud-native data warehouse that makes it easy to analyze large amounts of data quickly. In this article, we will explore how to perform machine learning with SQL using BigQuery.

But before we start, it is essential to note that machine learning with SQL is not the same as traditional machine learning methods such as regression, clustering, or classification. Instead, it focuses on using SQL to analyze and transform data into the format required for machine learning. With that said, let's get started.

Setting up BigQuery

Before we start with machine learning, we need to set up BigQuery. If you haven't done this before, you can follow these simple steps:

  1. Go to the BigQuery console and sign up for a Google Cloud account if you haven't already.

  2. Create a new project or select an existing one.

  3. Click on "Create Dataset" to create a new dataset to store your data.

  4. Upload your data to the dataset using the "Create Table" button.

Explore the Data

Once you have uploaded your data, it is essential to understand it. In this tutorial, we will use the Iris dataset, which is a popular dataset for classification tasks. It contains 150 instances of three classes, with each class having 50 instances.

To explore the data, we can use SQL queries to select and visualize the data. The following code snippet shows how to select the first five rows of the dataset:

SELECT *
FROM `<project_id>.<dataset_id>.<table-name>`
LIMIT 5

This will display the first five rows of the dataset. We can use the same query to select specific columns, such as the sepal length and sepal width:

SELECT sepal_length, sepal_width
FROM `<project_id>.<dataset_id>.iris`
LIMIT 5

We can also use SQL to calculate summary statistics of the data, such as the mean and standard deviation:

SELECT
  AVG(sepal_length) AS avg_sepal_length,
  STDDEV(sepal_length) AS stddev_sepal_length
FROM `<project_id>.<dataset_id>.iris`

Transform the Data

Before we can perform machine learning on the data, we need to transform it into the required format. In the case of the Iris dataset, we need to convert the categorical target variable into a numerical value. We can do this using a SQL CASE statement:

SELECT
  CASE class
    WHEN 'Iris-setosa' THEN 1
    WHEN 'Iris-versicolor' THEN 2
    WHEN 'Iris-virginica' THEN 3
  END AS target,
  sepal_length,
  sepal_width,
  petal_length,
  petal_width
FROM `<project_id>.<dataset_id>.iris`

This query will create a new column called target with the numerical value of the class.

Train a Machine Learning Model

Now that we have transformed the data, we can train a machine learning model using SQL. In BigQuery, we can use the CREATE MODEL statement to create a model. In this tutorial, we will use logistic regression to classify the Iris dataset

To train a logistic regression model, we can use the following SQL query:

CREATE MODEL `<project_id>.<dataset_id>.iris_model`
OPTIONS
  (model_type='logistic_reg',
  input_label_cols=['target'],
   max_iteration=50,
   l1_reg=1,
   l2_reg=0.1) AS
SELECT
  CASE class
    WHEN 'Iris-setosa' THEN 1
    WHEN 'Iris-versicolor' THEN 2
    WHEN 'Iris-virginica' THEN 3
  END AS target,
  sepal_length,
  sepal_width,
  petal_length,
  petal_width
FROM `<project_id>.<dataset_id>.iris`

This query will create a new model called "iris_model" and train it using logistic regression. We can specify options such as the maximum number of iterations and regularization parameters.

Click on GO TO MODEL button to see the training details of the model as shown below:

Evaluate the Model

Once the model is trained, we can evaluate its performance using SQL. In BigQuery, we can use the ML.EVALUATE function to evaluate the model. The following query shows how to evaluate the model on the training data:

SELECT
  *
FROM ML.EVALUATE(MODEL `<project_id>.<dataset_id>.iris_model`,
                (
                 SELECT
                   CASE species
                     WHEN 'Iris-setosa' THEN 1
                     WHEN 'Iris-versicolor' THEN 2
                     WHEN 'Iris-virginica' THEN 3
                   END AS target,
                   sepal_length,
                   sepal_width,
                   petal_length,
                   petal_width
                 FROM `<project_id>.<dataset_id>.iris`
                ))

This query will return metrics such as accuracy, precision, and recall.

Precision - 0.9534
Recall - 0.9533
Accuracy - 0.9533
F1 score - 0.9533
Log loss - 0.1231
ROC AUC - 0.9990

Make Predictions

Finally, we can use the trained model to make predictions on new data. In BigQuery, we can use the ML.PREDICT function to make predictions. The following query shows how to make predictions on new data:

SELECT
  predicted_target,
  predicted_target_probs
FROM ML.PREDICT(MODEL `<project_id>.<dataset_id>.iris_model`,
                (
                 SELECT
                   sepal_length,
                   sepal_width,
                   petal_length,
                   petal_width
                 FROM `<project_id>.<dataset_id>.new_data`
                ))

This query will return the predicted target value and the probability of each class for each row in the new data.

Conclusion

In this article, we have explored how to perform machine learning with SQL using BigQuery. We started by setting up BigQuery and exploring the data using SQL queries. We then transformed the data into the required format, trained a logistic regression model, and evaluated its performance. Finally, we made predictions on new data using the trained model. With BigQuery, it is possible to perform powerful machine learning tasks without leaving the SQL environment.