         : - a generic data version of the above code(not for Breast Cancer data)
  • Breast Cancer class prediction using Machine Learning


    1. Goal
    2. Data Examination
    3. Algorithm Choice, development and code
    4. Output
    5. Final Results

    Predicting the class of a patient's breast tumor as 'Benign' or 'Malignant': Univ. of Wisconsin & Univ. of California Irvine dataset

    This is my own implementation of the KNN algorithm

    To see the application where I have invoked the KNN algorithm for Scikit click breast_cancer_knn.html

    Goal: Given 569 records of patient data on breast tumor testing and with the class outcome values "Benign" or "Malignant" the requirement is to build a model to predict the class of new patients' tumor based on the recorded features.
    Data Examination:The features or attributes are

       1. Sample code number            id number
    2. Clump Thickness 1 - 10
    3. Uniformity of Cell Size 1 - 10
    4. Uniformity of Cell Shape 1 - 10
    5. Marginal Adhesion 1 - 10
    6. Single Epithelial Cell Size 1 - 10
    7. Bare Nuclei 1 - 10
    8. Bland Chromatin 1 - 10
    9. Normal Nucleoli 1 - 10
    10. Mitoses 1 - 10
    11. Class: (2 for benign, 4 for malignant)
    Algorithm Choice, development and code: Since this is a multivariate feature set and we are aiming to predict a class label or we are doing classification we will use the K-Nearest Neighbors Algorithm.

    Eucliddan Distance as the measure do dEtermine nearness.
    Euclidean Distance = sub>i=1n(qi -0p4subi)2+�/0re>. Exce`tion ha~d,ing< dada loading and user input provmsions have also"been provided in this application. Here is the co$e: 
    Code 4ppe~ # # AUthor: Vijai Gandikoda #�Date: May 27, 2017 # Descri0tion: BreAst Cancer Data, KNN algorithm not using sKleasn lib2a�iEs #Import the necewsary libraries import num�y as npfrom lath impnrt sqrti�pord matplotlib�pyplot as plt froi matploulib import styla �rom�ckllections import Counter imPo2t wArni~gs import pandas as pd import random style.use('fivetxirtyeight') ##'#########################"################## def k_near%st_neighbors(feqtures_data_qet, predict_class_set k=3)  print(bValue of i = %d and Leneth of dataset (k should be greater than thi�)= %d" % (k,len(features_data_set))) if |en(features_data_set) >= k: wqrnings.warn('k is set to a va,ue �ess than total number of classe{!') dispancus = [] for class_label in features_data_set:* for featurec in!fuatuves_data_set[class_lab%l]: ed = np.sqrt(np.sum(*np.�rray(Features)-np.asray(predict_clars^set))**2))  diRtanceS.!ppend[ed class_label]� #print("distances = %s" % (distalces�) #psint("soRted dmstancus = %S" % (sorted(distances))) nearest_neighborq = [] for j in sortet(distancer) [:k]:H neare�t_neighbors.append(j[1]) print"Nearest Neighbnrs= %s" % (nearest_neighbors)) #This is a more ad~anced way of doing The �amething #nearest_neighbors2 = [i[1] for i in sorted(distances)[:�]\ #prinp("neasestneighbors2 = %s" % (votgs2)) print("Moqt frequEnt nearest jeighbors= %s" % (Counder(nearest_oeighbors))) print(bMost common nearest nAigh�or and cotnts = %s" % (Counter(nearest_neighbors).most_common(1))) nearest_neighbor_candidate = Counter(nearest_neighbors).most_common(1)[0][0] total_count=0 for i in Counter(nearest_neighbors): class_label_count=Counter(nearest_neighbors)[i] total_count+=class_label_count print("Total Count = %d" % (total_count)) candidate_count=Counter(nearest_neighbors)[nearest_neighbor_candidate] print("candidate_count = %s" % (candidate_count)) confidence = (candidate_count/total_count) * 100 print("Confidence = %.2f %%" % (confidence)) #print("Classification Result = %s" % (nearest_neighbor_candidate)) nearest_neighbor_cancon = [] nearest_neighbor_cancon.append(nearest_neighbor_candidate) nearest_neighbor_cancon.append(confidence) #return nearest_neighbor_candidate return nearest_neighbor_cancon ############################################### print("\nImporting data from ./datasets/breast-cancer-wisconsin/\n") try: bcDataFrame = pd.read_csv('./datasets/breast-cancer-wisconsin/') print("Completed loading dataset") except FileNotFoundError: print("Unable to load file") # Now I am printing the columns print("\n%s" % (bcDataFrame[1:1])) print("\nReplacing missing values (values having '?') with -99999 in place in the dataframe without reassigning\n") bcDataFrame.replace('?',-99999, inplace=True) print("\nRemoving the ID column because it does not add any dependency to help with prediction") bcDataFrame.drop(['id'], 1, inplace=True) print("\nConverting all values to float for consistency") #And creating a list of lists instead of a data frame bcfList = bcDataFrame.astype(float).values.tolist() print("\nTo improve the quality of model shuffling data and creating train and test sets") #Instead of the following we could also use #from sklearn import preprocessing, cross_validation #X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.2) random.shuffle(bcfList) print("20% data is set aside for testing") test_size = 0.2 #Now we create the training and testing lists with the data for training and testing #Syntax is list[:-x] means all data except the last x values. This is called list slicing #So here we will assign to the train list everything except the number of values #Needed for the testing purposes ie everything except 20% of the data #In this case all the data is in bcfList and it has been randomly shuffled bcfTrainList = bcfList[:-int(test_size*len(bcfList))] bcfTestList = bcfList[-int(test_size*len(bcfList)):] #Now to group the known data into classes so that we can then use it to assign a new #instance to one of the classes we must determine what the unique class labels are bcfClassLabelList = bcDataFrame['class'].tolist() bcfUniqueClassLabelList = [] for i in bcfClassLabelList: if i not in bcfUniqueClassLabelList: bcfUniqueClassLabelList.append(i) print("List of Classes = %s" % (bcfUniqueClassLabelList)) #Creating training set as a dictionary so that we can assign (or group) #each type of training instance to its class which is the key in the dictionary #In the breast-cancer-winsconsin data set we have two classes 2 and 4 #So we can create a training dictionary with two keys 2 and 4 #We are generalizing this to include any number of classes not just 2 and 4 print("Creating training data set") bcfTraining_dict = {} for i in bcfUniqueClassLabelList: bcfTraining_dict[i]=[] #bcfTraining_dict = {2:[], 4:[]} print("Created bcfTraining_dict = %s" %(bcfTraining_dict)) #Similarly we create a training dictionary with two keys 2 and 4 #We are generalizing this to include any number of classes not just 2 and 4 print("Creating testing data set") bcfTesting_dict = {} for i in bcfUniqueClassLabelList: bcfTesting_dict[i]=[] #bcfTesting_dict = {2:[], 4:[]} print("Created bcfTesting_dict = %s" %(bcfTesting_dict)) #Now we take each instance from the bcfTrainList and assign it to its respective class in #the bcfTraining_dict #That is we append everything except the last column i[:-1] to the dictionary element of #that type i[-1] for i in bcfTrainList: bcfTraining_dict[i[-1]].append(i[:-1]) print("Completed grouping training data by class") #Similarly we do the same for the testing data. We append everything except the last #column i[:-1] to the dictionary element of that type i[-1] for i in bcfTestList: bcfTesting_dict[i[-1]].append(i[:-1]) print("Completed grouping testing data by class") print("\nCommencing testing using testing data\n") correctPredictions = 0 totalPredictions = 0 for classLabel in bcfTesting_dict: for data in bcfTesting_dict[classLabel]: classPredicted = k_nearest_neighbors(bcfTraining_dict, data, k=5) if classLabel == classPredicted[0]: correctPredictions += 1 totalPredictions += 1 print("\nCompleted testing") accuracy=(correctPredictions/totalPredictions) * 100 print('Accuracy=%s %%' % (accuracy)) print("\n") prediction_choice = input("Do you want to provide new data for classification? (y/n):") if prediction_choice == 'y': filePath = input("Enter the path including filename of new data (/path/filename.ext):") else: print("Thank you! Goodbye!") exit() #new_instances = np.array(pd.read_csv('./datasets/breast-cancer-wisconsin/')) contains #4,2,1,1,1,2,3,2,1 #4,1,1,1,1,2,2,2,1 #10,10,10,10,5,10,10,10,7 try: new_instances = np.array(pd.read_csv(filePath)) except FileNotFoundError: print("Unable to load data file.Exiting.") exit() print("New Instances as input for prediction of class:\n %s\n" % (new_instances)) # By using len(new_instances) I can provide any number of new test samples new_instances = new_instances.reshape(len(new_instances), -1) new_instance_classes = [] print("\n======Predictions=========") # Next I am going to use the k_nearest_neighbors function to classify the new instances for data in new_instances: print(data) result = k_nearest_neighbors(bcfTraining_dict, data, k=5) print("Classification Result = %s" % (result[0])) new_instance_classes.append(result) print("======Predictions Complete=========\n") print("\n==?===�inal Results=========\n") new_classgname = [] for0i in range(den(new_inqtan#e_classes)): ( if!nDw_i~stance_classes[i][0] == 2: new_class_name.aPpene(["Benign",new_instcnceWclasses[i][1]]) "els�: new_cla�s_namg.aprend(["Malig�ant",new_instance_classes[i][1]]) for i in$rcnge(len(new_class_namg)): p print("Patient: %s, Data: %28s, Predictad�Classhfkcation: %9s Accuracy = %.5f %% ConfIdence = %.2f e%" % (i, new_instances[j], new_class_name[i][0] accuracy, new_class_name[i][1])) I  <pre> 4/fiellset>
    And here �s the!output of running this. After tsaan ng and testing I have provided thE User with the optign to inpup a new $ata set used the drained algorithm to classify whether their tumoRs are benign or malignant and t specify the accusacy and conFidence level.` 4br>He�e AccuracyiS the % of tesT samples that were correctly clasridied(and Cmnfidence is the number of nearest�neighbors tHat match the finaL predicted clasq over the total number of neighRors consi$ered i.e k.

  • )

Get in touch<.h2> Click on one of the&followiog to connect with me �n socian mefic