Lab 11: Introduction to scikit-learn#

For the class on Monday, March 11th


If you don’t have scikit-learn installed, you’ll want to install them first. Run the following in a terminal after you activate the environment you use for this course. (If running in a notebook, add ! at the very beginning of the line.)

conda install -y scikit-learn
# Run this cell to check if scikit-learn is installed
# scikit-learn version should be at least 1.3.0

import sklearn

A. Data Representation and Exploration#

A1. Basic information about a data set#

For each of the three datasets loaded below (data_digits, data_solarflare, data_transfusion), find out:

  • number of instances (samples)

  • number of features

  • number of targets

  • number of classes (for each target)

  • what question(s) this data set can help answer given the features and targets

// Write your answers to Part A1 here

from sklearn import datasets
data_digits = datasets.load_digits(as_frame=True)
data_solarflare = datasets.fetch_openml("sf2", version=2)
data_transfusion = datasets.fetch_openml("blood-transfusion-service-center", version=1)

A2. Exploring a data set#

Explore the data_transfusion data set a little bit.

Based on our discussion in class, do some calculations or make some plots to gain a better understanding of this data set.

Suggestion: Pick 3 calculations/plots to do/make. Write down what you plan to do, implement them, and breifly discuss what you observe.

// Write your answers to Part A2 here

# Add your implementation to Part A2 here

B. scikit-learn API basics#

  1. Based on the example code below, repeat the classification task but with RidgeClassifier and the data_transfusion data set. (You don’t need to know how does the method work.)

  2. When using the RidgeClassifier method on the data_transfusion data set, what are the values of recall, precision, specificity, and accuracy? Here you can treat “2” (donating blood) as “True”, and “1” (not donating blood) as “False”.

// Write your answers to Part B here

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.linear_model import RidgeClassifier
train_features, test_features, train_target, test_target = train_test_split(,, test_size=0.25, random_state=123)
clf = SVC(), train_target)
test_prediced = clf.predict(test_features)
print("% of correct prediction on the test set")
print("Manual calculation =", np.count_nonzero(test_prediced == test_target) / len(test_target))
print("W/ .score() method =", clf.score(test_features, test_target))
% of correct prediction on the test set
Manual calculation = 0.9844444444444445
W/ .score() method = 0.9844444444444445
# Add your implementation to Part B here


