Task 1

7.1.1. Task 1#

Learning Objectives#

Learn how a K-Nearest Neighbors (KNN) Classifier works and build one from scratch
Load a dataset and split it into training, validation, and testing sets
Understand what hyperparameters are and how to tune them using a validation set
Apply feature scaling (standardization) to improve model performance.

Introduction to a KNN Classifier#

A K-Nearest Neighbors (KNN) classifier is a simple machine learning algorithm used for classification tasks. It works by finding the ‘k’ closest examples in the ‘feature space’ of the training data to a new data point and assigning the most common class among those ‘k’ neighbors as the predicted class for that data point.

The distance between data points is typically measured using Euclidean distance, but other distance metrics can also be used.

Imagine classifying between a tiger and a cat based on features like weight and height. If you have a new animal’s weight and height, KNN would look at the ‘k’ closest animals in your training data (based on those two features) and predict whether it’s more likely to be a tiger or a cat based on the majority class of those neighbors. That is, if the new animal is $\qty{10}{\pound}$ and $\qty{1}{\foot}$ tall, KNN would find the ‘k’ closest animals in the memorized training data based on weight and height, and if most of those neighbors are cats, it would predict that the new animal is also cat.

Task Instructions#

Deliverable Reminder

Create a flowchart of your algorithm and save it as tp3_team_1_teamnumber.pdf. Save your program as tp3_team_1_teamnumber.py.

In this task, you’ll build a K-Nearest Neighbors (KNN) classifier. Your script will load a CSV dataset file that you created previously, split it into training, validation, and testing sets, and tune the model hyperparameters using the validation dataset. Remember to use the dataset containing the features you created in Section 6.2.3.

Note

For the functions described below, the order of arguments and outputs matters. Please ask TAs for guidance and explanation.

Step 1: Load Dataset Function#

Create a function named load_dataset.

Arguments:

file_path (str): Path to the CSV file containing image features.
feature_cols (list of str): Names of columns to use as features.
label_col (str): Name of the column containing binary labels ($0$ or $1$).
shuffle (bool, default True): Whether to shuffle the data.
seed (int, default $42$): Seed for reproducibility.

Returns:

X (2D NumPy array of floats, shape (n_samples, n_features)): Feature matrix.
y (1D NumPy array of ints, shape (n_samples,)): Label vector.

Note

You will need the scale_features function developed in Section 6.2.3. Copy it into your program.

The function must load the CSV file using pandas.read_csv(). Then, scale the specified features using the scale_features function you created in the previous task.

Next, extract the feature columns (feature_cols) into array X, and the label column (label_col) into array y. To convert extracted subsets of a Pandas DataFrame into a Numpy array, you can use the .to_numpy() method. If shuffle is True, set the seed using np.random.seed(seed) and generate a shuffled index using np.random.permutation(len(y)). You can use this shuffled index to reorder X and y.

Code Snippet: The code snippet below demonstrates how to load the dataset. (We are *giving it away, but please understand what it does!)

def load_dataset(file_path, feature_cols, label_col, shuffle, seed=42):
    """
    Loads a dataset from a CSV file, separates features and labels,
    and optionally shuffles the data.
    """
    df = pd.read_csv(file_path)

    # Use the existing feature scaling function from Checkpoint 2 Task 3
    scale_features(df, feature_cols)

    X = df[feature_cols].to_numpy()
    y = df[label_col].to_numpy()

    if shuffle:
        np.random.seed(seed)
        shuffled_indices = np.random.permutation(len(y))
        X = X[shuffled_indices]
        y = y[shuffled_indices]

    return X, y

Step 2: Train-Validation-Test Split Function#

Create a function named train_val_test_split that splits the data into an $80:10:10$ ratio for training, validation, and testing.

Note

What is the purpose of splitting the data?

The purpose of splitting the data is to allow for evaluating the performance of a machine learning model on unseen data.

The training set is used to train the model (or in the case of KNN, memorize the training data). The validation set is used to tune hyperparameters (such as the value of k). The test set is used to assess the final performance of the model after all tuning is complete. This helps to ensure that the model generalizes well to new, unseen data and prevents overfitting.

Arguments:

X (NumPy array): an array of image features
y (NumPy array): an array of image labels
train_ratio (float): a decimal value indicating the proportion of data for training (default=$0.8$)
val_ratio (float): a decimal value indicating the proportion of data for validation (default=$0.1$)
test_ratio (float): a decimal value indicating the proportion of data for testing (default=$0.1$)

Returns:

X_train (2D NumPy array of floats shape (n_train, n_features)): Training features.
y_train (1D NumPy array of ints, shape (n_train,)): Training labels.
X_val (2D NumPy array of floats, shape (n_val, n_features)): Validation features.
y_val (1D NumPy array of ints, shape (n_val,)): Validation labels.
X_test (2D NumPy array of floats, shape (n_test, n_features)): Test features.
y_test (1D NumPy array of ints, shape (n_test,)): Test labels.

This function should ensure that the three ratios sum up to $1$. Use the ratios to allocate rows of features and labels. For example, if the default ratios are used, allocate the first $\qty{80}{\percent}$ of rows to the training set, the next $\qty{10}{\percent}$ to the validation set, and the final $\qty{10}{\percent}$ to the test set.

Note

Ensure you do not skip any rows and that the number of rows in train, validation, and test sum up to the number of rows in the original dataset.

Step 3: KNN Individual Prediction#

Create a function named knn_single_prediction that predicts the class for a single data point given its features.

Arguments:

new_example (1D NumPy array of floats, shape (n_features,)): Features of one new image.
X_train (2D NumPy array of floats, shape (n_train, n_features)): Training feature matrix.
y_train (1D NumPy array of ints, shape (n_train,)): Training labels.
k (int): Number of nearest neighbors.

Returns:

predicted_label (int): $0$ for deer warning sign, $1$ for left turn ahead sign

The function should calculate the Euclidean distance from the new_example to every example in X_train using np.linalg.norm(). Find the k training examples with the least distances (the “nearest neighbors”) from the new_example’s features. Determine the most frequent label among those nearest neighbors (this is the “majority vote”). Return this label.

Note

We highly recommend asking a TA for help with this function as there are many ways to reduce the runtime of your code by making this function run efficiently.

Hints on sorting and keeping track of classes

Consider creating a list of distances containing tuples of (distance, ClassId) for each training image. The .sort() method can help you sort this array of tuples according to the distance.

You can then determine the prediction for the new example by selecting the majority class appearing in the first k elements of the sorted distances array.

Step 4: KNN Prediction Function#

Create a function named predict_labels_knn that predicts the class for every example in a given dataset of features.

Arguments:

X_new (2D NumPy array of floats, shape (n_new, n_features)): New feature data to classify.
X_train (2D NumPy array of floats, shape (n_train, n_features)): Training feature matrix.
y_train (1D NumPy array of ints, shape (n_train,)): Training labels.
k (int): Number of nearest neighbors.

Returns:

predicted_labels (1D NumPy array of ints, shape (n_new,)): Predicted labels for rows of X_new.

The function should iterate over the rows in X_new and use knn_single_prediction to get the predicted label of each row. Collect these predictions and return them as an array.

Step 5: Metrics Function#

Create a function named calculate_metrics that evaluates your model’s predictions by returning the accuracy and error.

Arguments:

predicted_labels (1D NumPy array of ints, shape (n_samples,)): Predicted ClassId values.
true_labels (1D NumPy array of ints, shape (n_samples,)): Ground truth ClassId values.

Returns:

accuracy (float): the proportion of correct predictions
error (float): the proportion of incorrect predictions

The function should compare the predicted labels to the true labels. Accuracy is the fraction of correct predictions. The error rate is $1 - \text{accuracy}$.

Step 6: Main Function#

Create a main function that collects the following inputs from the user:

The path of the CSV file containing the feature dataset
The choice for shuffling the dataset
The seed for shuffling the dataset (will be ignored if the user chooses not to shuffle)
The value of k to use

The function should then:

Load the dataset and shuffle (if requested by the user) using the function from Step 1: Load Dataset Function. You will need to pass in the feature column names and the label column. You can find the names of the columns by viewing the feature dataset CSV file in a text editor or spreadsheet program.
Split the dataset into train, validation and test sets. A ‘set’ is a pair of X (feature array) and corresponding y (the labels). Use the function you created in Step 2: Train-Validation-Test Split Function.
Display the sizes of the Training, Validation, and Test sets.
Use the predict_labels_knn function to predict the labels of the training set samples using the value of k that the user specified. Then, calculate and display the accuracy and error on the validation set using the calculate_metrics function.
Repeat the previous step for the validation and test sets.

Bonus

Why do you think the accuracy on the training set is 100% for k = 1?

Should we shuffle the dataset to judge the model’s performance? Why?

Ensure that your code output matches the sample outputs provided below.

Table 7.2 Feature Dataset#
File Name	Description
`img_features.csv`	Image feature dataset (from previous checkpoint)

Sample Output#

Use the values in Table 7.3 below to test your program.

Table 7.3 Test Cases#
Case	dataset	shuffle	seed	k
1	img_features.csv	no	70	1
2	img_features.csv	yes	70	5
3	img_features.csv	no	70	5
4	img_features.csv	no	70	15
5	img_features.csv	no	70	50

Ensure your program’s output matches the provided samples exactly. This includes all characters, white space, and punctuation. In the samples, user input is highlighted like this for clarity, but your program should not highlight user input in this way.

Case 1 Sample Output

$ python3 tp3_team_1_teamnumber.py Enter the path to the feature dataset: img_features.csv Shuffle the dataset? (yes/no): no Enter a seed for loading the dataset: 70 Enter the value of k to use for KNN: 1

Data loaded and split into

Training set: size: 864 Validation set: size: 108 Test set: size: 108

Evaluating KNN model on Training set with k = 1...

--- Model Performance on Training Set --- Training Set Accuracy: 1.0000 Training Set Error Rate: 0.0000

Evaluating KNN model on validation set with k = 1...

--- Model Performance on Validation Set --- Validation Set Accuracy: 0.9444 Validation Set Error Rate: 0.0556

Evaluating KNN model on test set with k = 1...

--- Model Performance on Test Set --- Test Set Accuracy: 0.9630 Test Set Error Rate: 0.0370

Case 2 Sample Output

$ python3 tp3_team_1_teamnumber.py Enter the path to the feature dataset: img_features.csv Shuffle the dataset? (yes/no): yes Enter a seed for loading the dataset: 70 Enter the value of k to use for KNN: 5

Data loaded and split into

Training set: size: 864 Validation set: size: 108 Test set: size: 108

Evaluating KNN model on Training set with k = 5...

--- Model Performance on Training Set --- Training Set Accuracy: 0.9965 Training Set Error Rate: 0.0035

Evaluating KNN model on validation set with k = 5...

--- Model Performance on Validation Set --- Validation Set Accuracy: 1.0000 Validation Set Error Rate: 0.0000

Evaluating KNN model on test set with k = 5...

--- Model Performance on Test Set --- Test Set Accuracy: 1.0000 Test Set Error Rate: 0.0000

Case 3 Sample Output

$ python3 tp3_team_1_teamnumber.py Enter the path to the feature dataset: img_features.csv Shuffle the dataset? (yes/no): no Enter a seed for loading the dataset: 70 Enter the value of k to use for KNN: 5

Data loaded and split into

Training set: size: 864 Validation set: size: 108 Test set: size: 108

Evaluating KNN model on Training set with k = 5...

--- Model Performance on Training Set --- Training Set Accuracy: 0.9942 Training Set Error Rate: 0.0058

Evaluating KNN model on validation set with k = 5...

--- Model Performance on Validation Set --- Validation Set Accuracy: 0.8796 Validation Set Error Rate: 0.1204

Evaluating KNN model on test set with k = 5...

--- Model Performance on Test Set --- Test Set Accuracy: 0.9167 Test Set Error Rate: 0.0833

Case 4 Sample Output

$ python3 tp3_team_1_teamnumber.py Enter the path to the feature dataset: img_features.csv Shuffle the dataset? (yes/no): no Enter a seed for loading the dataset: 70 Enter the value of k to use for KNN: 15

Data loaded and split into

Training set: size: 864 Validation set: size: 108 Test set: size: 108

Evaluating KNN model on Training set with k = 15...

--- Model Performance on Training Set --- Training Set Accuracy: 0.9931 Training Set Error Rate: 0.0069

Evaluating KNN model on validation set with k = 15...

--- Model Performance on Validation Set --- Validation Set Accuracy: 0.8519 Validation Set Error Rate: 0.1481

Evaluating KNN model on test set with k = 15...

--- Model Performance on Test Set --- Test Set Accuracy: 0.9074 Test Set Error Rate: 0.0926

Case 5 Sample Output

$ python3 tp3_team_1_teamnumber.py Enter the path to the feature dataset: img_features.csv Shuffle the dataset? (yes/no): no Enter a seed for loading the dataset: 70 Enter the value of k to use for KNN: 50

Data loaded and split into

Training set: size: 864 Validation set: size: 108 Test set: size: 108

Evaluating KNN model on Training set with k = 50...

--- Model Performance on Training Set --- Training Set Accuracy: 0.9711 Training Set Error Rate: 0.0289

Evaluating KNN model on validation set with k = 50...

--- Model Performance on Validation Set --- Validation Set Accuracy: 0.8241 Validation Set Error Rate: 0.1759

Evaluating KNN model on test set with k = 50...

--- Model Performance on Test Set --- Test Set Accuracy: 0.8889 Test Set Error Rate: 0.1111

Table 7.4 Deliverables#
Deliverables	Description
tp3_team_1_teamnumber.pdf	Flowchart(s) for this task.
tp3_team_1_teamnumber.py	Your completed Python code.

Task 1

Contents

7.1.1. Task 1#

Learning Objectives#

Introduction to a KNN Classifier#

Task Instructions#

Step 1: Load Dataset Function#

Step 2: Train-Validation-Test Split Function#

Step 3: KNN Individual Prediction#

Step 4: KNN Prediction Function#

Step 5: Metrics Function#

Step 6: Main Function#

Sample Output#