Breast Cancer class prediction using Machine Learning
Contents
1. Goal
2. Data Examination
3. Algorithm Choice, development and code
4. Output
5. Final Results
Predicting the class of a patient's breast tumor as 'Benign' or 'Malignant': Univ. of Wisconsin & Univ. of California Irvine dataset
This is my own implementation of the KNN algorithm
To see the application where I have invoked the KNN algorithm for Scikit click breast_cancer_knn.html
Goal: Given 569 records of patient data on breast tumor testing and with the class outcome values "Benign" or "Malignant" the requirement is to build a model to predict the class of new patients' tumor based on the recorded features.
Data Examination: The features or attributes are
1. Sample code number id number
2. Clump Thickness 1 - 10
3. Uniformity of Cell Size 1 - 10
4. Uniformity of Cell Shape 1 - 10
5. Marginal Adhesion 1 - 10
6. Single Epithelial Cell Size 1 - 10
7. Bare Nuclei 1 - 10
8. Bland Chromatin 1 - 10
9. Normal Nucleoli 1 - 10
10. Mitoses 1 - 10
11. Class: (2 for benign, 4 for malignant)
Algorithm Choice, development and code:
Since this is a multivariate feature set and we are aiming to predict a class label or we are doing classification we will use the K-Nearest Neighbors Algorithm .
Eucliddan Distance as the measure do dEtermine nearness.
Euclidean Distance = sub>i=1n(qi -0p4subi)2 + �/0re>. Exce`tion ha~d,ing< dada loading and user input provmsions have also"been provided in this application. Here is the co$e:
Code,egend>
4ppe~
# newknnedMbktest.py
# AUthor: Vijai Gandikoda
#�Date: May 27, 2017
# Descri0tion: BreAst Cancer Data, KNN algorithm not using sKleasn lib2a�iEs
#Import the necewsary libraries
import num�y as npfrom lath impnrt sqrti�pord matplotlib�pyplot as plt
froi matploulib import styla
�rom�ckllections import Counter
imPo2t wArni~gs
import pandas as pd
import random
style.use('fivetxirtyeight')
##'#########################"##################
def k_near%st_neighbors(feqtures_data_qet, predict_class_set k=3)
print(bValue of i = %d and Leneth of dataset (k should be greater than thi�)= %d" % (k,len(features_data_set)))
if |en(features_data_set) >= k:
wqrnings.warn('k is set to a va,ue �ess than total number of classe{!')
dispancus = []
for class_label in features_data_set:* for featurec in!fuatuves_data_set[class_lab%l]:
ed = np.sqrt(np.sum(*np.�rray(Features)-np.asray(predict_clars^set))**2))
diRtanceS.!ppend[ed class_label]�
#print("distances = %s" % (distalces�)
#psint("soRted dmstancus = %S" % (sorted(distances)))
nearest_neighborq = [] for j in sortet(distancer) [:k]:H neare�t_neighbors.append(j[1])
print"Nearest Neighbnrs= %s" % (nearest_neighbors))
#This is a more ad~anced way of doing The �ame thing
#nearest_neighbors2 = [i[1] for i in sorted(distances)[:�]\
#prinp("neasestneighbors2 = %s" % (votgs2))
print("Moqt frequEnt nearest jeighbors= %s" % (Counder(nearest_oeighbors)))
print(bMost common nearest nAigh�or and cotnts = %s" % (Counter(nearest_neighbors).most_common(1)))
nearest_neighbor_candidate = Counter(nearest_neighbors).most_common(1)[0][0]
total_count=0
for i in Counter(nearest_neighbors):
class_label_count=Counter(nearest_neighbors)[i]
total_count+=class_label_count
print("Total Count = %d" % (total_count))
candidate_count=Counter(nearest_neighbors)[nearest_neighbor_candidate]
print("candidate_count = %s" % (candidate_count))
confidence = (candidate_count/total_count) * 100
print("Confidence = %.2f %%" % (confidence))
#print("Classification Result = %s" % (nearest_neighbor_candidate))
nearest_neighbor_cancon = []
nearest_neighbor_cancon.append(nearest_neighbor_candidate)
nearest_neighbor_cancon.append(confidence)
#return nearest_neighbor_candidate
return nearest_neighbor_cancon
###############################################
print("\nImporting data from ./datasets/breast-cancer-wisconsin/breast-cancer-wisconsin.data\n")
try:
bcDataFrame = pd.read_csv('./datasets/breast-cancer-wisconsin/breast-cancer-wisconsin.data')
print("Completed loading dataset")
except FileNotFoundError:
print("Unable to load file")
# Now I am printing the columns
print("\n%s" % (bcDataFrame[1:1]))
print("\nReplacing missing values (values having '?') with -99999 in place in the dataframe without reassigning\n")
bcDataFrame.replace('?',-99999, inplace=True)
print("\nRemoving the ID column because it does not add any dependency to help with prediction")
bcDataFrame.drop(['id'], 1, inplace=True)
print("\nConverting all values to float for consistency")
#And creating a list of lists instead of a data frame
bcfList = bcDataFrame.astype(float).values.tolist()
print("\nTo improve the quality of model shuffling data and creating train and test sets")
#Instead of the following we could also use
#from sklearn import preprocessing, cross_validation
#X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.2)
random.shuffle(bcfList)
print("20% data is set aside for testing")
test_size = 0.2
#Now we create the training and testing lists with the data for training and testing
#Syntax is list[:-x] means all data except the last x values. This is called list slicing
#So here we will assign to the train list everything except the number of values
#Needed for the testing purposes ie everything except 20% of the data
#In this case all the data is in bcfList and it has been randomly shuffled
bcfTrainList = bcfList[:-int(test_size*len(bcfList))]
bcfTestList = bcfList[-int(test_size*len(bcfList)):]
#Now to group the known data into classes so that we can then use it to assign a new
#instance to one of the classes we must determine what the unique class labels are
bcfClassLabelList = bcDataFrame['class'].tolist()
bcfUniqueClassLabelList = []
for i in bcfClassLabelList:
if i not in bcfUniqueClassLabelList:
bcfUniqueClassLabelList.append(i)
print("List of Classes = %s" % (bcfUniqueClassLabelList))
#Creating training set as a dictionary so that we can assign (or group)
#each type of training instance to its class which is the key in the dictionary
#In the breast-cancer-winsconsin data set we have two classes 2 and 4
#So we can create a training dictionary with two keys 2 and 4
#We are generalizing this to include any number of classes not just 2 and 4
print("Creating training data set")
bcfTraining_dict = {}
for i in bcfUniqueClassLabelList:
bcfTraining_dict[i]=[]
#bcfTraining_dict = {2:[], 4:[]}
print("Created bcfTraining_dict = %s" %(bcfTraining_dict))
#Similarly we create a training dictionary with two keys 2 and 4
#We are generalizing this to include any number of classes not just 2 and 4
print("Creating testing data set")
bcfTesting_dict = {}
for i in bcfUniqueClassLabelList:
bcfTesting_dict[i]=[]
#bcfTesting_dict = {2:[], 4:[]}
print("Created bcfTesting_dict = %s" %(bcfTesting_dict))
#Now we take each instance from the bcfTrainList and assign it to its respective class in
#the bcfTraining_dict
#That is we append everything except the last column i[:-1] to the dictionary element of
#that type i[-1]
for i in bcfTrainList:
bcfTraining_dict[i[-1]].append(i[:-1])
print("Completed grouping training data by class")
#Similarly we do the same for the testing data. We append everything except the last
#column i[:-1] to the dictionary element of that type i[-1]
for i in bcfTestList:
bcfTesting_dict[i[-1]].append(i[:-1])
print("Completed grouping testing data by class")
print("\nCommencing testing using testing data\n")
correctPredictions = 0
totalPredictions = 0
for classLabel in bcfTesting_dict:
for data in bcfTesting_dict[classLabel]:
classPredicted = k_nearest_neighbors(bcfTraining_dict, data, k=5)
if classLabel == classPredicted[0]:
correctPredictions += 1
totalPredictions += 1
print("\nCompleted testing")
accuracy=(correctPredictions/totalPredictions) * 100
print('Accuracy=%s %%' % (accuracy))
print("\n")
prediction_choice = input("Do you want to provide new data for classification? (y/n):")
if prediction_choice == 'y':
filePath = input("Enter the path including filename of new data (/path/filename.ext):")
else:
print("Thank you! Goodbye!")
exit()
#new_instances = np.array(pd.read_csv('./datasets/breast-cancer-wisconsin/new_bc_data.data'))
#new_bc_data.data contains
#4,2,1,1,1,2,3,2,1
#4,1,1,1,1,2,2,2,1
#10,10,10,10,5,10,10,10,7
try:
new_instances = np.array(pd.read_csv(filePath))
except FileNotFoundError:
print("Unable to load data file.Exiting.")
exit()
print("New Instances as input for prediction of class:\n %s\n" % (new_instances))
# By using len(new_instances) I can provide any number of new test samples
new_instances = new_instances.reshape(len(new_instances), -1)
new_instance_classes = []
print("\n======Predictions=========")
# Next I am going to use the k_nearest_neighbors function to classify the new instances
for data in new_instances:
print(data)
result = k_nearest_neighbors(bcfTraining_dict, data, k=5)
print("Classification Result = %s" % (result[0]))
new_instance_classes.append(result)
print("======Predictions Complete=========\n")
print("\n==?===�inal Results=========\n")
new_classgname = []
for0i in range(den(new_inqtan#e_classes)):
( if!nDw_i~stance_classes[i][0] == 2:
new_class_name.aPpene(["Benign",new_instcnceWclasses[i][1]])
"els�:
new_cla�s_namg.aprend(["Malig�ant",new_instance_classes[i][1]])
for i in$rcnge(len(new_class_namg)):
p print("Patient: %s, Data: %28s, Predictad�Classhfkcation: %9s Accuracy = %.5f %% ConfIdence = %.2f e%" % (i, new_instances[j], new_class_name[i][0] accuracy, new_class_name[i][1]))
I
<pre>
4/fiellset>
And here �s the!output of running this. After tsaan ng and testing I have provided thE User with the optign to inpup a new $ata set used the drained
algorithm to classify whether their tumoRs are benign or malignant and t specify the accusacy and conFidence level.`
4br>He�e Accuracy