Lab 11: Introduction to scikit-learn#
For the class on Monday, March 11th
Tip
If you don’t have scikit-learn
installed, you’ll want to install them first.
Run the following in a terminal after you activate the environment you use for this course.
(If running in a notebook, add !
at the very beginning of the line.)
conda install -y scikit-learn
# Run this cell to check if scikit-learn is installed
# scikit-learn version should be at least 1.3.0
import sklearn
print(sklearn.__version__)
1.4.2
A. Data Representation and Exploration#
A1. Basic information about a data set#
For each of the three datasets loaded below (data_digits
, data_solarflare
, data_transfusion
), find out:
number of instances (samples)
number of features
number of targets
number of classes (for each target)
what question(s) this data set can help answer given the features and targets
// Write your answers to Part A1 here
from sklearn import datasets
# https://scikit-learn.org/stable/datasets/toy_dataset.html#optical-recognition-of-handwritten-digits-dataset
data_digits = datasets.load_digits(as_frame=True)
# https://openml.org/search?type=data&status=active&sort=runs&id=41489
data_solarflare = datasets.fetch_openml("sf2", version=2)
# https://openml.org/search?type=data&status=active&id=1464
data_transfusion = datasets.fetch_openml("blood-transfusion-service-center", version=1)
A2. Exploring a data set#
Explore the data_transfusion
data set a little bit.
Based on our discussion in class, do some calculations or make some plots to gain a better understanding of this data set.
Suggestion: Pick 3 calculations/plots to do/make. Write down what you plan to do, implement them, and breifly discuss what you observe.
// Write your answers to Part A2 here
# Add your implementation to Part A2 here
B. scikit-learn API basics#
Based on the example code below, repeat the classification task but with
RidgeClassifier
and thedata_transfusion
data set. (You don’t need to know how does the method work.)When using the
RidgeClassifier
method on thedata_transfusion
data set, what are the values of recall, precision, specificity, and accuracy? Here you can treat “2” (donating blood) as “True”, and “1” (not donating blood) as “False”.
// Write your answers to Part B here
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.linear_model import RidgeClassifier
train_features, test_features, train_target, test_target = train_test_split(data_digits.data, data_digits.target, test_size=0.25, random_state=123)
clf = SVC()
clf.fit(train_features, train_target)
test_prediced = clf.predict(test_features)
print("% of correct prediction on the test set")
print("Manual calculation =", np.count_nonzero(test_prediced == test_target) / len(test_target))
print("W/ .score() method =", clf.score(test_features, test_target))
% of correct prediction on the test set
Manual calculation = 0.9844444444444445
W/ .score() method = 0.9844444444444445
# Add your implementation to Part B here
Tip
How to submit this notebook on Canvas?
Make sure all your answers, code, and desired results are properly displayed in the notebook.
Save the notebook (press
Ctrl
+s
orCmd
+s
). The grey dot on the filename tab (indicating “unsaved”) should disappear.Run the following cell.
Upload the resulting HTML file to Canvas under the corresponding assignment.
! jupyter nbconvert --to html ./11.ipynb