Lab 11: Introduction to scikit-learn#

For the class on Monday, March 11th

Tip

If you don’t have scikit-learn installed, you’ll want to install them first. Run the following in a terminal after you activate the environment you use for this course. (If running in a notebook, add ! at the very beginning of the line.)

conda install -y scikit-learn
# Run this cell to check if scikit-learn is installed
# scikit-learn version should be at least 1.3.0

import sklearn
print(sklearn.__version__)
1.4.2

A. Data Representation and Exploration#

A1. Basic information about a data set#

For each of the three datasets loaded below (data_digits, data_solarflare, data_transfusion), find out:

  • number of instances (samples)

  • number of features

  • number of targets

  • number of classes (for each target)

  • what question(s) this data set can help answer given the features and targets


// Write your answers to Part A1 here


from sklearn import datasets
# https://scikit-learn.org/stable/datasets/toy_dataset.html#optical-recognition-of-handwritten-digits-dataset
data_digits = datasets.load_digits(as_frame=True)
# https://openml.org/search?type=data&status=active&sort=runs&id=41489
data_solarflare = datasets.fetch_openml("sf2", version=2)
# https://openml.org/search?type=data&status=active&id=1464
data_transfusion = datasets.fetch_openml("blood-transfusion-service-center", version=1)

A2. Exploring a data set#

Explore the data_transfusion data set a little bit.

Based on our discussion in class, do some calculations or make some plots to gain a better understanding of this data set.

Suggestion: Pick 3 calculations/plots to do/make. Write down what you plan to do, implement them, and breifly discuss what you observe.


// Write your answers to Part A2 here


# Add your implementation to Part A2 here

B. scikit-learn API basics#

  1. Based on the example code below, repeat the classification task but with RidgeClassifier and the data_transfusion data set. (You don’t need to know how does the method work.)

  2. When using the RidgeClassifier method on the data_transfusion data set, what are the values of recall, precision, specificity, and accuracy? Here you can treat “2” (donating blood) as “True”, and “1” (not donating blood) as “False”.


// Write your answers to Part B here


import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.linear_model import RidgeClassifier
train_features, test_features, train_target, test_target = train_test_split(data_digits.data, data_digits.target, test_size=0.25, random_state=123)
clf = SVC()
clf.fit(train_features, train_target)
test_prediced = clf.predict(test_features)
print("% of correct prediction on the test set")
print("Manual calculation =", np.count_nonzero(test_prediced == test_target) / len(test_target))
print("W/ .score() method =", clf.score(test_features, test_target))
% of correct prediction on the test set
Manual calculation = 0.9844444444444445
W/ .score() method = 0.9844444444444445
# Add your implementation to Part B here

Tip

How to submit this notebook on Canvas?

  1. Make sure all your answers, code, and desired results are properly displayed in the notebook.

  2. Save the notebook (press Ctrl+s or Cmd+s). The grey dot on the filename tab (indicating “unsaved”) should disappear.

  3. Run the following cell.

  4. Upload the resulting HTML file to Canvas under the corresponding assignment.

! jupyter nbconvert --to html ./11.ipynb