Lab 11: Introduction to scikit-learn#

For the class on Monday, March 3rd

# Run this cell to confirm that scikit-learn is properly installed
# scikit-learn version should be at least 1.6.0

import sklearn
print(sklearn.__version__)
1.6.1

A. Data Representation and Exploration#

A1. Basic information about a data set#

πŸš€ Tasks and Questions:

For each of the three datasets loaded below (data_digits, data_solarflare, data_transfusion), find out:

  • number of instances (samples)

  • number of features

  • number of targets

  • number of classes (for each target)

  • what question(s) this data set can help answer given the features and targets

from sklearn import datasets
# https://scikit-learn.org/stable/datasets/toy_dataset.html#optical-recognition-of-handwritten-digits-dataset
data_digits = datasets.load_digits(as_frame=True)
# https://openml.org/search?type=data&status=active&sort=runs&id=41489
data_solarflare = datasets.fetch_openml("sf2", version=2)
# https://openml.org/search?type=data&status=active&id=1464
data_transfusion = datasets.fetch_openml("blood-transfusion-service-center", version=1)

// Write your answers to Part A1 here

A2. Exploring a data set#

Explore the data_transfusion data set a little bit.

Based on our discussion in class, do some calculations or make some plots to gain a better understanding of this data set.

πŸš€ Tasks and Questions:

  • Pick 3 calculations/plots to do/make. For each calculation or plot:

    • write down what you plan to do,

    • implement it, and

    • breifly discuss what you observe.

// Write your answers to Part A2 here

# Add your implementation (3 calculations/plots) to Part A2 here

B. scikit-learn API basics#

In this part we will run a classification task on the data sets we obtained above using scikit-learn.

The purpose of this part is merely to get you familiarized with scikit-learn API. You don’t need to know how these classification methods work; we will revisit them later this semester.

Below, you will first find an example code that runs Support Vector Classification (SVC) on the digits data.

πŸš€ Tasks and Questions:

  1. Following the example code below, repeat the classification task but with RidgeClassifier and the data_transfusion data set.

  2. When using the RidgeClassifier method on the data_transfusion data set, what are the values of recall, precision, specificity, and accuracy? Here you can treat β€œ2” (donating blood) as β€œTrue”, and β€œ1” (not donating blood) as β€œFalse”.

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.linear_model import RidgeClassifier
train_features, test_features, train_target, test_target = train_test_split(data_digits.data, data_digits.target, test_size=0.25, random_state=123)
clf = SVC()
clf.fit(train_features, train_target)
test_prediced = clf.predict(test_features)
print("% of correct prediction on the test set")
print("Manual calculation =", np.count_nonzero(test_prediced == test_target) / len(test_target))
print("W/ .score() method =", clf.score(test_features, test_target))
% of correct prediction on the test set
Manual calculation = 0.9844444444444445
W/ .score() method = 0.9844444444444445
# Add your implementation to Part B here

// Write your answers to Part B here

Tip

Submit your notebook

Follow these steps when you complete this lab and are ready to submit your work to Canvas:

  1. Check that all your text answers, plots, and code are all properly displayed in this notebook.

  2. Run the cell below.

  3. Download the resulting HTML file 11.html and then upload it to the corresponding assignment on Canvas.

!jupyter nbconvert --to html --embed-images 11.ipynb