Lab 13: Regression II: Gaussian Process and Cross Validation#

For the class on Monday, March 18th

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.model_selection import cross_val_score, GridSearchCV

A. Gaussian Process Regression#

In this example, we have a simple data set with only one feature and one target. A simple linear regression does not fit this data set. We will test if Gaussian Process regression works well in this case.

A1. Compare Gaussian Process and Linear Regression#

df = pd.read_csv("https://yymao.github.io/phys7730/data/lab13a.csv")
df.head()
feature1 target
0 1.790917 0.411666
1 0.351093 -0.555546
2 0.755384 1.215104
3 0.923896 0.793547
4 0.275408 -1.181132
train_features, test_features, train_target, test_target = train_test_split(df[["feature1"]], df["target"], test_size=0.25, random_state=456)
# Note that, despite the difference between `reg_lr` and `reg_gp`,
# The code to fit the data and make prediction is almost identical.

reg_lr = LinearRegression()
reg_lr.fit(train_features, train_target)
train_predict_lr = reg_lr.predict(train_features)
test_predict_lr = reg_lr.predict(test_features)

reg_gp = GaussianProcessRegressor(normalize_y=True)
reg_gp.fit(train_features, train_target)
train_predict_gp = reg_gp.predict(train_features)
test_predict_gp = reg_gp.predict(test_features)
fig, ax = plt.subplots(ncols=2, figsize=(12, 5))

ax[0].scatter(train_predict_lr, train_target, c="C0", s=20, label="Training")
ax[0].scatter(test_predict_lr, test_target, c="C2", s=25, marker="^", label="Test")
ax[0].set_title("Linear regression")

ax[1].scatter(train_predict_gp, train_target, c="C0", s=20, label="Training")
ax[1].scatter(test_predict_gp, test_target, c="C2", s=25, marker="^", label="Test")
ax[1].set_title("Gaussian Process")

corners = [-7, 3]
for ax_this in ax:
    ax_this.plot(corners, corners, c="C1", ls="--", label="$y=x$")
    ax_this.set_xlim(*corners)
    ax_this.set_ylim(*corners)

    ax_this.set_xlabel("Predicted values")
    ax_this.set_ylabel("True target value")
    ax_this.legend(loc="lower right");
../_images/b173c0d9dea38b2e02f866c916448332de570c7a4087ae99237710583ac442dd.png

Part A1, Question 1: Based on the plots above, how would you compare the performace of the linear regression and the Gaussian Process regression?


// Write your answer here


x = np.linspace(0, 7, 201)
prediction_lr = reg_lr.predict(pd.DataFrame({"feature1": x}))
mean_prediction_gp, std_prediction_gp = reg_gp.predict(pd.DataFrame({"feature1": x}), return_std=True)

plt.scatter(train_features["feature1"], train_target, label="Observations")
plt.plot(x, prediction_lr, label="Linear model prediction", ls=":", c="C4")
plt.plot(x, mean_prediction_gp, label="GP mean prediction", c="C3")
plt.fill_between(
    x.ravel(),
    mean_prediction_gp - 1.96 * std_prediction_gp,
    mean_prediction_gp + 1.96 * std_prediction_gp,
    alpha=0.25,
    color="C3",
    label=r"GP 95% confidence interval",
    lw=0,
)
plt.legend(loc="lower right")

plt.ylim(-7, 6)
plt.xlabel("feature1")
plt.ylabel("target");
../_images/6f3dd207b1583bffdf5cc9aca4f981114c9e469650a2b28d0e6d4701c0d3e89b.png

Part A1, Questions 2-5:

  1. Based on the plot above, does your evaluation of the linear regression and the Gaussian Process regression change?

  2. Why don’t we see the GP 95% confidence interval on this plot?

  3. Which method do you think would give a better prediction when the feature1 value is around 1?

  4. Which method do you think would give a better prediction when the feature1 value is around 7?


// Write your answer here


A2. Including data uncertainty#

  1. Repeat A1 (including making those plots), but change GaussianProcessRegressor(normalize_y=True) to GaussianProcessRegressor(normalize_y=True, alpha=0.1).

    Tip

    What is alpha? Roughly speaking it’s the assumed uncertainty of data points (in target). See the documentation of GaussianProcessRegressor for detail.

  2. Manually try a few different values of alpha (no need to refactor or optimize your code, as you’ll see a better way in Part B). What value seems to give you the best result?

# Add your implementation to A2 here.

Part A2, Question 1: During Step 2, what value of alpha seems to work well? How did you define “work well”?


// Write your answer here


B. Cross Validation#

In Part B, we will use cross validation to find the optimal values for the hyperparameter in our ML models.

B1. Gaussian Process#

cross_val_score(GaussianProcessRegressor(normalize_y=True), train_features, train_target)
array([ 0.94466647,  0.8790451 ,  0.91188481, -7.1283826 ,  0.72405611])

Part B1, Question 1: What are these five values? Why are there 5 values but not other number of values? Check out the documentation of cross_val_score if you are not sure.


// Write your answer here


reg_cv = GridSearchCV(GaussianProcessRegressor(normalize_y=True), {"alpha": np.logspace(-8, 1, 31)})
reg_cv.fit(train_features, train_target);
plt.semilogx(reg_cv.cv_results_["param_alpha"].data, reg_cv.cv_results_["mean_test_score"])
plt.xlabel("alpha")
plt.ylabel("mean test score");
../_images/92d071878ae43a1cbc8a934d3e183c504352372b7f30ca4583735da4eb36a02c.png

Part B1, Question 2: Based on the plot above, what value of alpha give the Gaussian Process model the best performance on this data set? Is this value similar to what you found in Part A2?


// Write your answer here


# Add code to repeat A2, but with the `alpha` value you found here.

B2. Revisit Lab 12 Part B#

In Lab 12 Part B, we compare the logistic regression method with and without regularization. When regularization is included (the default option for LogisticRegression), we can in fact further adjust the regularization strength by changing the value supplied to the argument C. See the documentation of LogisticRegression for details.

In this part, we will use the cross validation method to justify our discussion in Lab 12 Part B.

df_12b = pd.read_csv("https://yymao.github.io/phys7730/data/lab12b.csv")
train_12b_features, test_12b_features, train_12b_target, test_12b_target = train_test_split(df_12b.iloc[:,:3], df_12b["target"], test_size=0.25, random_state=123)
reg_cv = GridSearchCV(LogisticRegression(), {"C": np.logspace(-3, 3, 31)})
reg_cv.fit(train_12b_features, train_12b_target);
plt.semilogx(reg_cv.cv_results_["param_C"].data, reg_cv.cv_results_["mean_test_score"])
plt.xlabel("C")
plt.ylabel("mean test score");
../_images/6f6a652d266fe60676e9033238c02d5b77dc708f056b55255605b0ab2cb72d81.png

Part B2, Question 1: Based on the plot above, do you think including regularization is helpful for this particular data set? Note that the conclusion here is not a general statement! What do you think is special about this data set that results in your conclusion?


// Write your answer here


Tip

How to submit this notebook on Canvas?

  1. Make sure all your answers, code, and desired results are properly displayed in the notebook.

  2. Save the notebook (press Ctrl+s or Cmd+s). The grey dot on the filename tab (indicating “unsaved”) should disappear.

  3. Run the following cell.

  4. Upload the resulting HTML file to Canvas under the corresponding assignment.

! jupyter nbconvert --to html ./13.ipynb