Lab 11: Introduction to scikit-learn#
For the class on Monday, March 3rd
# Run this cell to confirm that scikit-learn is properly installed
# scikit-learn version should be at least 1.6.0
import sklearn
print(sklearn.__version__)
1.6.1
A. Data Representation and Exploration#
A1. Basic information about a data set#
π Tasks and Questions:
For each of the three datasets loaded below (data_digits
, data_solarflare
, data_transfusion
), find out:
number of instances (samples)
number of features
number of targets
number of classes (for each target)
what question(s) this data set can help answer given the features and targets
from sklearn import datasets
# https://scikit-learn.org/stable/datasets/toy_dataset.html#optical-recognition-of-handwritten-digits-dataset
data_digits = datasets.load_digits(as_frame=True)
# https://openml.org/search?type=data&status=active&sort=runs&id=41489
data_solarflare = datasets.fetch_openml("sf2", version=2)
# https://openml.org/search?type=data&status=active&id=1464
data_transfusion = datasets.fetch_openml("blood-transfusion-service-center", version=1)
// Write your answers to Part A1 here
A2. Exploring a data set#
Explore the data_transfusion
data set a little bit.
Based on our discussion in class, do some calculations or make some plots to gain a better understanding of this data set.
π Tasks and Questions:
Pick 3 calculations/plots to do/make. For each calculation or plot:
write down what you plan to do,
implement it, and
breifly discuss what you observe.
// Write your answers to Part A2 here
# Add your implementation (3 calculations/plots) to Part A2 here
B. scikit-learn API basics#
In this part we will run a classification task on the data sets we obtained above using scikit-learn.
The purpose of this part is merely to get you familiarized with scikit-learn API. You donβt need to know how these classification methods work; we will revisit them later this semester.
Below, you will first find an example code that runs Support Vector Classification (SVC) on the digits data.
π Tasks and Questions:
Following the example code below, repeat the classification task but with
RidgeClassifier
and thedata_transfusion
data set.When using the
RidgeClassifier
method on thedata_transfusion
data set, what are the values of recall, precision, specificity, and accuracy? Here you can treat β2β (donating blood) as βTrueβ, and β1β (not donating blood) as βFalseβ.
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.linear_model import RidgeClassifier
train_features, test_features, train_target, test_target = train_test_split(data_digits.data, data_digits.target, test_size=0.25, random_state=123)
clf = SVC()
clf.fit(train_features, train_target)
test_prediced = clf.predict(test_features)
print("% of correct prediction on the test set")
print("Manual calculation =", np.count_nonzero(test_prediced == test_target) / len(test_target))
print("W/ .score() method =", clf.score(test_features, test_target))
% of correct prediction on the test set
Manual calculation = 0.9844444444444445
W/ .score() method = 0.9844444444444445
# Add your implementation to Part B here
// Write your answers to Part B here
Tip
Submit your notebook
Follow these steps when you complete this lab and are ready to submit your work to Canvas:
Check that all your text answers, plots, and code are all properly displayed in this notebook.
Run the cell below.
Download the resulting HTML file
11.html
and then upload it to the corresponding assignment on Canvas.
!jupyter nbconvert --to html --embed-images 11.ipynb