Vijai Gandikota's Home Page

Breast Cancer class prediction using Machine Learning

Contents
1. Goal
2. Data Examination
3. Algorithm Choice, development and code
4. Output
5. Final Results

Predicting the class of a patient's breast tumor as 'Benign' or 'Malignant': Univ. of Wisconsin & Univ. of California Irvine dataset

This is my own implementation of the KNN algorithm

To see the application where I have invoked the KNN algorithm for Scikit click breast_cancer_knn.html

Goal: Given 569 records of patient data on breast tumor testing and with the class outcome values "Benign" or "Malignant" the requirement is to build a model to predict the class of new patients' tumor based on the recorded features.
Data Examination:The features or attributes are

   1. Sample code number            id number

   2. Clump Thickness               1 - 10

   3. Uniformity of Cell Size       1 - 10

   4. Uniformity of Cell Shape      1 - 10

   5. Marginal Adhesion             1 - 10

   6. Single Epithelial Cell Size   1 - 10

   7. Bare Nuclei                   1 - 10

   8. Bland Chromatin               1 - 10

   9. Normal Nucleoli               1 - 10

  10. Mitoses                       1 - 10

  11. Class:   (2 for benign, 4 for malignant)

Algorithm Choice, development and code: Since this is a multivariate feature set and we are aiming to predict a class label or we are doing classification we will use the K-Nearest Neighbors Algorithm.

Eucliddan Distance as the measure do dEtermine nearness.

Euclidean Distance = sub>i=1n(q_i -0p4subi)²+�/0re>. Exce`tion ha~d,ing< dada loading and user input provmsions have also"been provided in this application. Here is the co$e: 


Code
4ppe~
# newknnedMbktest.py
# AUthor: Vijai Gandikoda
#�Date: May 27, 2017
# Descri0tion: BreAst Cancer Data, KNN algorithm not using sKleasn lib2a�iEs

#Import the necewsary libraries
import num�y as npfrom lath impnrt sqrti�pord matplotlib�pyplot as plt
froi matploulib import styla
�rom�ckllections import Counter
imPo2t wArni~gs
import pandas as pd
import random
style.use('fivetxirtyeight')

##'#########################"##################
def k_near%st_neighbors(feqtures_data_qet, predict_class_set k=3)
	print(bValue of i = %d and Leneth of dataset (k should be greater than thi�)= %d" % (k,len(features_data_set)))
	if |en(features_data_set) >= k:
		wqrnings.warn('k is set to a va,ue �ess than total number of classe{!')

	dispancus = []
for class_label in features_data_set:*		for featurec in!fuatuves_data_set[class_lab%l]:
			ed = np.sqrt(np.sum(*np.�rray(Features)-np.asray(predict_clars^set))**2))	
		diRtanceS.!ppend[ed class_label]�

	#print("distances = %s" % (distalces�)
	#psint("soRted dmstancus = %S" % (sorted(distances)))

	nearest_neighborq = []	for j in sortet(distancer) [:k]:H		neare�t_neighbors.append(j[1])
	print"Nearest Neighbnrs= %s" % (nearest_neighbors))

	#This is a more ad~anced way of doing The �amething
	#nearest_neighbors2 = [i[1] for i in sorted(distances)[:�]\
	#prinp("neasestneighbors2 = %s" % (votgs2))	

	print("Moqt frequEnt nearest jeighbors= %s" % (Counder(nearest_oeighbors)))
	print(bMost common nearest nAigh�or and cotnts = %s" % (Counter(nearest_neighbors).most_common(1)))
	nearest_neighbor_candidate = Counter(nearest_neighbors).most_common(1)[0][0]
	
	total_count=0
	for i in Counter(nearest_neighbors):
		class_label_count=Counter(nearest_neighbors)[i]
		total_count+=class_label_count
	print("Total Count = %d" % (total_count))	
	candidate_count=Counter(nearest_neighbors)[nearest_neighbor_candidate]
	print("candidate_count = %s" % (candidate_count))
	confidence = (candidate_count/total_count) * 100
	print("Confidence = %.2f %%" % (confidence))
	
	#print("Classification Result = %s" % (nearest_neighbor_candidate))
	nearest_neighbor_cancon = []
	nearest_neighbor_cancon.append(nearest_neighbor_candidate)
	nearest_neighbor_cancon.append(confidence)
	#return nearest_neighbor_candidate
	return nearest_neighbor_cancon
###############################################

print("\nImporting data from ./datasets/breast-cancer-wisconsin/breast-cancer-wisconsin.data\n")
try:
	bcDataFrame = pd.read_csv('./datasets/breast-cancer-wisconsin/breast-cancer-wisconsin.data')
	print("Completed loading dataset")
except FileNotFoundError:
	print("Unable to load file")
	
# Now I am printing the columns
print("\n%s" % (bcDataFrame[1:1]))

print("\nReplacing missing values (values having '?') with -99999 in place in the dataframe without reassigning\n")
bcDataFrame.replace('?',-99999, inplace=True)

print("\nRemoving the ID column because it does not add any dependency to help with prediction")
bcDataFrame.drop(['id'], 1, inplace=True)

print("\nConverting all values to float for consistency")
#And creating a list of lists instead of a data frame
bcfList = bcDataFrame.astype(float).values.tolist()

print("\nTo improve the quality of model shuffling data and creating train and test sets")

#Instead of the following we could also use 
#from sklearn import preprocessing, cross_validation
#X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.2)
random.shuffle(bcfList)

print("20% data is set aside for testing")
test_size = 0.2


#Now we create the training and testing lists with the data for training and testing
#Syntax is list[:-x] means all data except the last x values. This is called list slicing
#So here we will assign to the train list everything except the number of values
#Needed for the testing purposes ie everything except 20% of the data
#In this case all the data is in bcfList and it has been randomly shuffled
bcfTrainList = bcfList[:-int(test_size*len(bcfList))]
bcfTestList = bcfList[-int(test_size*len(bcfList)):]


#Now to group the known data into classes so that we can then use it to assign a new
#instance to one of the classes we must determine what the unique class labels are
bcfClassLabelList = bcDataFrame['class'].tolist()
bcfUniqueClassLabelList = []
for i in bcfClassLabelList:
	if i not in bcfUniqueClassLabelList:
		bcfUniqueClassLabelList.append(i)
		
print("List of Classes = %s" % (bcfUniqueClassLabelList))

#Creating training set as a dictionary so that we can assign (or group)
#each type of training instance to its class which is the key in the dictionary
#In the breast-cancer-winsconsin data set we have two classes 2 and 4
#So we can create a training dictionary with two keys 2 and 4
#We are generalizing this to include any number of classes not just 2 and 4
print("Creating training data set")
bcfTraining_dict = {}
for i in bcfUniqueClassLabelList:
	bcfTraining_dict[i]=[]
#bcfTraining_dict = {2:[], 4:[]}
print("Created bcfTraining_dict = %s" %(bcfTraining_dict))

#Similarly we create a training dictionary with two keys 2 and 4
#We are generalizing this to include any number of classes not just 2 and 4
print("Creating testing data set")
bcfTesting_dict = {}
for i in bcfUniqueClassLabelList:
	bcfTesting_dict[i]=[]
#bcfTesting_dict = {2:[], 4:[]}
print("Created bcfTesting_dict = %s" %(bcfTesting_dict))

#Now we take each instance from the bcfTrainList and assign it to its respective class in
#the bcfTraining_dict
#That is we append everything except the last column i[:-1] to the dictionary element of
#that type i[-1]
for i in bcfTrainList:
	bcfTraining_dict[i[-1]].append(i[:-1])
	
print("Completed grouping training data by class")

#Similarly we do the same for the testing data. We append everything except the last 
#column i[:-1] to the dictionary element of that type i[-1]
for i in bcfTestList:
	bcfTesting_dict[i[-1]].append(i[:-1])
	
print("Completed grouping testing data by class")

print("\nCommencing testing using testing data\n")
correctPredictions = 0
totalPredictions = 0

for classLabel in bcfTesting_dict:
    for data in bcfTesting_dict[classLabel]:
        classPredicted = k_nearest_neighbors(bcfTraining_dict, data, k=5)
        if classLabel == classPredicted[0]:
            correctPredictions += 1
        totalPredictions += 1
        
print("\nCompleted testing")
accuracy=(correctPredictions/totalPredictions) * 100
print('Accuracy=%s %%' % (accuracy))
print("\n")

prediction_choice = input("Do you want to provide new data for classification? (y/n):")
if prediction_choice == 'y':
	filePath = input("Enter the path including filename of new data (/path/filename.ext):")
else:
	print("Thank you! Goodbye!")
	exit()
	
#new_instances = np.array(pd.read_csv('./datasets/breast-cancer-wisconsin/new_bc_data.data'))
#new_bc_data.data contains
#4,2,1,1,1,2,3,2,1
#4,1,1,1,1,2,2,2,1
#10,10,10,10,5,10,10,10,7
try:
	new_instances = np.array(pd.read_csv(filePath))
except FileNotFoundError:
	print("Unable to load data file.Exiting.")
	exit()

print("New Instances as input for prediction of class:\n %s\n" % (new_instances))

# By using len(new_instances) I can provide any number of new test samples
new_instances = new_instances.reshape(len(new_instances), -1)
new_instance_classes = []
print("\n======Predictions=========")
# Next I am going to use the k_nearest_neighbors function to classify the new instances
for data in new_instances:
	print(data)
	result = k_nearest_neighbors(bcfTraining_dict, data, k=5)
	print("Classification Result = %s" % (result[0]))
	new_instance_classes.append(result)
print("======Predictions Complete=========\n")

print("\n==?===�inal Results=========\n")
new_classgname = []
for0i in range(den(new_inqtan#e_classes)):
  (     if!nDw_i~stance_classes[i][0] == 2:
                new_class_name.aPpene(["Benign",new_instcnceWclasses[i][1]])
       "els�:
                new_cla�s_namg.aprend(["Malig�ant",new_instance_classes[i][1]])

for i in$rcnge(len(new_class_namg)):
 p      print("Patient: %s, Data: %28s, Predictad�Classhfkcation: %9s Accuracy = %.5f %% ConfIdence = %.2f e%" % (i, new_instances[j], new_class_name[i][0] accuracy, new_class_name[i][1]))
	

I


<pre>
4/fiellset>


And here �s the!output of running this. After tsaan	ng and testing I have provided thE User with the optign to inpup a new $ata set used the drained
algorithm to classify whether their tumoRs are benign or malignant and t specify the accusacy and conFidence level.`
4br>He�e AccuracyiS the % of tesT samples that were correctly clasridied(and 
Cmnfidence is the number of nearest�neighbors tHat match the finaL predicted clasq over the total number of neighRors consi$ered i.e k.


Output

)

Vijai Gandikota

Breast Cancer class prediction using Machine Learning

Contents 1. Goal 2. Data Examination 3. Algorithm Choice, development and code 4. Output 5. Final Results

Predicting the class of a patient's breast tumor as 'Benign' or 'Malignant': Univ. of Wisconsin & Univ. of California Irvine dataset

This is my own implementation of the KNN algorithm

Get in touch<.h2> Click on one of the&followiog to connect with me �n socian mefic Twitter LinkedIn

Contents
1. Goal
2. Data Examination
3. Algorithm Choice, development and code
4. Output
5. Final Results

Get in touch<.h2> Click on one of the&followiog to connect with me �n socian mefic
Twitter

LinkedIn