Apr 28, 2026 | 1464 words | 15 min read
7.1.1. Task 1#
Learning Objectives#
Learn how a K-Nearest Neighbors (KNN) Classifier works and build one from scratch
Load a dataset and split it into training, validation, and testing sets
Understand what hyperparameters are and how to tune them using a validation set
Apply feature scaling (standardization) to improve model performance.
Introduction to a KNN Classifier#
A K-Nearest Neighbors (KNN) classifier is a simple machine learning algorithm used for classification tasks. It works by finding the ‘k’ closest examples in the ‘feature space’ of the training data to a new data point and assigning the most common class among those ‘k’ neighbors as the predicted class for that data point.
The distance between data points is typically measured using Euclidean distance, but other distance metrics can also be used.
Imagine classifying between a tiger and a cat based on features like weight and height. If you have a new animal’s weight and height, KNN would look at the ‘k’ closest animals in your training data (based on those two features) and predict whether it’s more likely to be a tiger or a cat based on the majority class of those neighbors. That is, if the new animal is \(\qty{10}{\pound}\) and \(\qty{1}{\foot}\) tall, KNN would find the ‘k’ closest animals in the memorized training data based on weight and height, and if most of those neighbors are cats, it would predict that the new animal is also cat.
Task Instructions#
Deliverable Reminder
Create a flowchart of your algorithm and save it as tp3_team_1_teamnumber.pdf. Save your program as tp3_team_1_teamnumber.py.
In this task, you’ll build a K-Nearest Neighbors (KNN) classifier. Your script will load a CSV dataset file that you created previously, split it into training, validation, and testing sets, and tune the model hyperparameters using the validation dataset. Remember to use the dataset containing the features you created in Section 6.2.3.
Note
For the functions described below, the order of arguments and outputs matters. Please ask TAs for guidance and explanation.
Step 1: Load Dataset Function#
Create a function named load_dataset.
Arguments:
file_path(str): Path to the CSV file containing image features.feature_cols(list of str): Names of columns to use as features.label_col(str): Name of the column containing binary labels (\(0\) or \(1\)).shuffle(bool, default True): Whether to shuffle the data.seed(int, default \(42\)): Seed for reproducibility.
Returns:
X(2D NumPy array of floats, shape(n_samples, n_features)): Feature matrix.y(1D NumPy array of ints, shape(n_samples,)): Label vector.
Note
You will need the scale_features function developed in
Section 6.2.3. Copy it into your program.
The function must load the CSV file using pandas.read_csv(). Then, scale the
specified features using the scale_features function you created in the
previous task.
Next, extract the feature columns (feature_cols) into array X, and the label
column (label_col) into array y. To convert extracted subsets of a Pandas
DataFrame into a Numpy array, you can use the .to_numpy() method. If shuffle
is True, set the seed using np.random.seed(seed) and generate a
shuffled index using np.random.permutation(len(y)). You can use this shuffled
index to reorder X and y.
Code Snippet: The code snippet below demonstrates how to load the dataset. (We are *giving it away, but please understand what it does!)
def load_dataset(file_path, feature_cols, label_col, shuffle, seed=42):
"""
Loads a dataset from a CSV file, separates features and labels,
and optionally shuffles the data.
"""
df = pd.read_csv(file_path)
# Use the existing feature scaling function from Checkpoint 2 Task 3
scale_features(df, feature_cols)
X = df[feature_cols].to_numpy()
y = df[label_col].to_numpy()
if shuffle:
np.random.seed(seed)
shuffled_indices = np.random.permutation(len(y))
X = X[shuffled_indices]
y = y[shuffled_indices]
return X, y
Step 2: Train-Validation-Test Split Function#
Create a function named train_val_test_split that splits the data into
an \(80:10:10\) ratio for training, validation, and testing.
Note
What is the purpose of splitting the data?
The purpose of splitting the data is to allow for evaluating the performance of a machine learning model on unseen data.
The training set is used to train the model (or in the case of KNN, memorize the training data). The validation set is used to tune hyperparameters (such as the value of k). The test set is used to assess the final performance of the model after all tuning is complete. This helps to ensure that the model generalizes well to new, unseen data and prevents overfitting.
Arguments:
X(NumPy array): an array of image featuresy(NumPy array): an array of image labelstrain_ratio(float): a decimal value indicating the proportion of data for training (default=\(0.8\))val_ratio(float): a decimal value indicating the proportion of data for validation (default=\(0.1\))test_ratio(float): a decimal value indicating the proportion of data for testing (default=\(0.1\))
Returns:
X_train(2D NumPy array of floats shape(n_train, n_features)): Training features.y_train(1D NumPy array of ints, shape(n_train,)): Training labels.X_val(2D NumPy array of floats, shape(n_val, n_features)): Validation features.y_val(1D NumPy array of ints, shape(n_val,)): Validation labels.X_test(2D NumPy array of floats, shape(n_test, n_features)): Test features.y_test(1D NumPy array of ints, shape(n_test,)): Test labels.
This function should ensure that the three ratios sum up to \(1\). Use the ratios to allocate rows of features and labels. For example, if the default ratios are used, allocate the first \(\qty{80}{\percent}\) of rows to the training set, the next \(\qty{10}{\percent}\) to the validation set, and the final \(\qty{10}{\percent}\) to the test set.
Note
Ensure you do not skip any rows and that the number of rows in train, validation, and test sum up to the number of rows in the original dataset.
Step 3: KNN Individual Prediction#
Create a function named knn_single_prediction that predicts the class
for a single data point given its features.
Arguments:
new_example(1D NumPy array of floats, shape(n_features,)): Features of one new image.X_train(2D NumPy array of floats, shape(n_train, n_features)): Training feature matrix.y_train(1D NumPy array of ints, shape(n_train,)): Training labels.k(int): Number of nearest neighbors.
Returns:
predicted_label(int): \(0\) for deer warning sign, \(1\) for left turn ahead sign
The function should calculate the Euclidean distance from the new_example to every
example in X_train using np.linalg.norm(). Find the k training
examples with the least distances (the “nearest neighbors”) from the
new_example’s features. Determine the most frequent label among those nearest
neighbors (this is the “majority vote”). Return this label.
Note
We highly recommend asking a TA for help with this function as there are many ways to reduce the runtime of your code by making this function run efficiently.
Hints on sorting and keeping track of classes
Consider creating a list of distances containing tuples of (distance, ClassId) for each
training image. The .sort() method can help you sort this array of tuples
according to the distance.
You can then determine the prediction for the new example by selecting the majority class appearing in the first k elements of the sorted distances array.
Step 4: KNN Prediction Function#
Create a function named predict_labels_knn that predicts the class for
every example in a given dataset of features.
Arguments:
X_new(2D NumPy array of floats, shape(n_new, n_features)): New feature data to classify.X_train(2D NumPy array of floats, shape(n_train, n_features)): Training feature matrix.y_train(1D NumPy array of ints, shape(n_train,)): Training labels.k(int): Number of nearest neighbors.
Returns:
predicted_labels(1D NumPy array of ints, shape(n_new,)): Predicted labels for rows ofX_new.
The function should iterate over the rows in X_new and use
knn_single_prediction to get the predicted label of each row. Collect these
predictions and return them as an array.
Step 5: Metrics Function#
Create a function named calculate_metrics that evaluates your model’s
predictions by returning the accuracy and error.
Arguments:
predicted_labels(1D NumPy array of ints, shape(n_samples,)): Predicted ClassId values.true_labels(1D NumPy array of ints, shape(n_samples,)): Ground truth ClassId values.
Returns:
accuracy(float): the proportion of correct predictionserror(float): the proportion of incorrect predictions
The function should compare the predicted labels to the true labels. Accuracy is the fraction of correct predictions. The error rate is \(1 - \text{accuracy}\).
Step 6: Main Function#
Create a main function that collects the following inputs from the user:
The path of the CSV file containing the feature dataset
The choice for shuffling the dataset
The seed for shuffling the dataset (will be ignored if the user chooses not to shuffle)
The value of k to use
The function should then:
Load the dataset and shuffle (if requested by the user) using the function from Step 1: Load Dataset Function. You will need to pass in the feature column names and the label column. You can find the names of the columns by viewing the feature dataset CSV file in a text editor or spreadsheet program.
Split the dataset into train, validation and test sets. A ‘set’ is a pair of
X(feature array) and correspondingy(the labels). Use the function you created in Step 2: Train-Validation-Test Split Function.Display the sizes of the Training, Validation, and Test sets.
Use the
predict_labels_knnfunction to predict the labels of the training set samples using the value ofkthat the user specified. Then, calculate and display the accuracy and error on the validation set using thecalculate_metricsfunction.Repeat the previous step for the validation and test sets.
Bonus
Why do you think the accuracy on the training set is 100% for k = 1?
Should we shuffle the dataset to judge the model’s performance? Why?
Ensure that your code output matches the sample outputs provided below.
File Name |
Description |
|---|---|
Image feature dataset (from previous checkpoint) |
Sample Output#
Use the values in Table 7.3 below to test your program.
Case |
dataset |
shuffle |
seed |
k |
|---|---|---|---|---|
1 |
img_features.csv |
no |
70 |
1 |
2 |
img_features.csv |
yes |
70 |
5 |
3 |
img_features.csv |
no |
70 |
5 |
4 |
img_features.csv |
no |
70 |
15 |
5 |
img_features.csv |
no |
70 |
50 |
Ensure your program’s output matches the provided samples exactly. This includes all characters, white space, and punctuation. In the samples, user input is highlighted like this for clarity, but your program should not highlight user input in this way.
Case 1 Sample Output
$ python3 tp3_team_1_teamnumber.py Enter the path to the feature dataset: img_features.csv Shuffle the dataset? (yes/no): no Enter a seed for loading the dataset: 70 Enter the value of k to use for KNN: 1
Data loaded and split into
Training set: size: 864 Validation set: size: 108 Test set: size: 108
Evaluating KNN model on Training set with k = 1...
--- Model Performance on Training Set --- Training Set Accuracy: 1.0000 Training Set Error Rate: 0.0000
Evaluating KNN model on validation set with k = 1...
--- Model Performance on Validation Set --- Validation Set Accuracy: 0.9444 Validation Set Error Rate: 0.0556
Evaluating KNN model on test set with k = 1...
--- Model Performance on Test Set --- Test Set Accuracy: 0.9630 Test Set Error Rate: 0.0370
Case 2 Sample Output
$ python3 tp3_team_1_teamnumber.py Enter the path to the feature dataset: img_features.csv Shuffle the dataset? (yes/no): yes Enter a seed for loading the dataset: 70 Enter the value of k to use for KNN: 5
Data loaded and split into
Training set: size: 864 Validation set: size: 108 Test set: size: 108
Evaluating KNN model on Training set with k = 5...
--- Model Performance on Training Set --- Training Set Accuracy: 0.9965 Training Set Error Rate: 0.0035
Evaluating KNN model on validation set with k = 5...
--- Model Performance on Validation Set --- Validation Set Accuracy: 1.0000 Validation Set Error Rate: 0.0000
Evaluating KNN model on test set with k = 5...
--- Model Performance on Test Set --- Test Set Accuracy: 1.0000 Test Set Error Rate: 0.0000
Case 3 Sample Output
$ python3 tp3_team_1_teamnumber.py Enter the path to the feature dataset: img_features.csv Shuffle the dataset? (yes/no): no Enter a seed for loading the dataset: 70 Enter the value of k to use for KNN: 5
Data loaded and split into
Training set: size: 864 Validation set: size: 108 Test set: size: 108
Evaluating KNN model on Training set with k = 5...
--- Model Performance on Training Set --- Training Set Accuracy: 0.9942 Training Set Error Rate: 0.0058
Evaluating KNN model on validation set with k = 5...
--- Model Performance on Validation Set --- Validation Set Accuracy: 0.8796 Validation Set Error Rate: 0.1204
Evaluating KNN model on test set with k = 5...
--- Model Performance on Test Set --- Test Set Accuracy: 0.9167 Test Set Error Rate: 0.0833
Case 4 Sample Output
$ python3 tp3_team_1_teamnumber.py Enter the path to the feature dataset: img_features.csv Shuffle the dataset? (yes/no): no Enter a seed for loading the dataset: 70 Enter the value of k to use for KNN: 15
Data loaded and split into
Training set: size: 864 Validation set: size: 108 Test set: size: 108
Evaluating KNN model on Training set with k = 15...
--- Model Performance on Training Set --- Training Set Accuracy: 0.9931 Training Set Error Rate: 0.0069
Evaluating KNN model on validation set with k = 15...
--- Model Performance on Validation Set --- Validation Set Accuracy: 0.8519 Validation Set Error Rate: 0.1481
Evaluating KNN model on test set with k = 15...
--- Model Performance on Test Set --- Test Set Accuracy: 0.9074 Test Set Error Rate: 0.0926
Case 5 Sample Output
$ python3 tp3_team_1_teamnumber.py Enter the path to the feature dataset: img_features.csv Shuffle the dataset? (yes/no): no Enter a seed for loading the dataset: 70 Enter the value of k to use for KNN: 50
Data loaded and split into
Training set: size: 864 Validation set: size: 108 Test set: size: 108
Evaluating KNN model on Training set with k = 50...
--- Model Performance on Training Set --- Training Set Accuracy: 0.9711 Training Set Error Rate: 0.0289
Evaluating KNN model on validation set with k = 50...
--- Model Performance on Validation Set --- Validation Set Accuracy: 0.8241 Validation Set Error Rate: 0.1759
Evaluating KNN model on test set with k = 50...
--- Model Performance on Test Set --- Test Set Accuracy: 0.8889 Test Set Error Rate: 0.1111
Deliverables |
Description |
|---|---|
tp3_team_1_teamnumber.pdf |
Flowchart(s) for this task. |
tp3_team_1_teamnumber.py |
Your completed Python code. |