PROJECT

kNN Classification for Breast Cancer Diagnosis

Built a custom k-nearest neighbors classifier to distinguish malignant and benign breast cancer samples using normalized diagnostic features.

Project Overview

This project applies a k-nearest neighbors classification workflow to the Wisconsin Diagnostic Breast Cancer dataset. The analysis focuses on building the model from the ground up, selecting an optimal k value, and evaluating performance using cross-validation, ROC/AUC, and final test-set metrics.

Key Results

95.6%

Final test accuracy using the selected k value.

0.979

AUC score on the test set, showing strong class separation.

k = 3

Optimal k selected using repeated 10-fold cross-validation.

Leak-Free Scaling

Z-score normalization was fit only on training data and applied to test data.

Basic Workflow Diagram

What I Built

Implemented core kNN helper functions for distance calculation, neighbor selection, label prediction, and batch prediction.
Used z-score normalization based only on the training set to avoid data leakage.
Evaluated odd k values from 1 to 51 and confirmed model choice using repeated 10-fold cross-validation.
Calculated ROC curve and AUC to measure model discrimination beyond accuracy alone.
Reported final confusion-matrix metrics including true positives, true negatives, false positives, and false negatives.

Visualization

Accuracy vs. k

Compared model performance across odd k values from 1 to 51.

ROC Curve

Evaluated classification performance using probability-like malignant-neighbor scores.

Feature Distributions

Reviewed numeric feature distributions before model training.

Final Prediction Summary

	Predicted Malignant	Predicted Benign
Actual Malignant	49	4
Actual Benign	1	60

Accuracy: 95.6% AUC: 0.979 Optimal k: 3

Modeling Details

Distance Metric

Euclidean distance was used to measure similarity between normalized samples.

Target Classes

Diagnosis was modeled as a two-class outcome: malignant and benign.

Model Selection

Repeated 10-fold cross-validation was used to select a more reliable k value.

Evaluation

Performance was evaluated using accuracy, confusion-matrix counts, ROC curve, and AUC.

Interpretation

The final model achieved strong classification performance, with high accuracy and an AUC close to 1. The cross-validation results selected k = 3, which provided a more reliable choice than simply using the highest test-set accuracy from a single split. This project demonstrates the full machine learning workflow from preprocessing and model construction to validation and performance interpretation.

TOOLS USED

Programming & Analysis

R R Markdown

Machine Learning & Evaluation

caret pROC Custom kNN Cross-Validation ROC / AUC

Data Handling & Reporting

dplyr knitr Z-score Normalization Train/Test Split

R-Markdown

View Full Work