Code : knnclassify_bc.py
  • Breast Cancer class prediction using Machine Learning

    Predicting the class of a patient's breast tumor as 'Benign' or 'Malignant': Univ. of Wisconsin & Univ. of California, Irvine Dataset

    This imports SciKit-Learn's library's neighbors library. For my own version of the KNN algorithm using Euclidean Distance measure click breast_cancer_knn_vijai.html

    Contents

    1. Goal
    2. Data Examination
    3. Algorithm Choice, and code
    4. Output
    5. Final Results

    Goal: Given 569 records of patient data on breast tumor testing and with the class outcome values "Benign" or "Malignant" the requirement is to build a model to predict the class of new patients' tumor based on the recorded features.
    Data Examination:The features or attributes are

       1. Sample code number            id number
       2. Clump Thickness               1 - 10
    3. Uniformity of Cell Size 1 - 10
    4. Uniformity of Cell Shape 1 - 10
    5. Marginal Adhesion 1 - 10
    6. Single Epithelial Cell Size 1 - 10
    7. Bare Nuclei 1 - 10
    8. Bland Chromatin 1 - 10
    9. Normal Nucleoli 1 - 10
    10. Mitoses 1 - 10
    11. Class: (2 for benign, 4 for malignant)

    Algorithm choice and code:Since this is a multivariate feature set and we are aiming to predict a class label or we are doing classification we will use the K-Nearest Neighbors Algorithm.

    Here I am using the K-Nearest Neighbors algorithm from Scikit-Learn. Here is the code:

    Code
    """
    Model Creation, Testing and Prediction of Breast Cancer Data using K Nearest Neighbor Algorithm
    
    INSTANCE BASED LEARNING
    
    In this type of machine learning algorithms, rather than construct a set of rules as an intermediate stage,
    the instances (or features or experiments) are themselves directly employed. We dont infer a rule set or a
    decision tree. The work of classification is done at the time of classification and not when training is done.
    This can therefore be performance intensive. Both in terms of speed and storage.
    
    Knowledge representation structures (like trees or rules) are not created in Instance based learning.
    
    Normally in humans we use something called 'rote learning' where we commit a set of learning examples
    to memory and group similar items as a group or a class. Then when a new example comes we classify it
    as one of the groups or classes depending on how closely it resembles a class.
    K-Nearest Neighbor Algorithm is one such approach which calculates a distance from the new instance
    to the k nearest previously known instances and then depending on a majority of which class's instances
    are closet to this new instance, it assigns the class to be that class.
    
    The distance calculated is a Eucledian Distance measure. It is then reported as a confidence measure.
    The requirement would be that the attributes are normalized & of equal importance. When one attribute
    is deemed to be more important than another suitable weighting can be employed when calculating a distance
    measure.
    
    """
    
    print(__doc__)
    import numpy as np
    from sklearn import preprocessing, cross_validation, neighbors
    import pandas as pd
    
    
    
    print("\nWe are using the breast-cancer-wisconsin data set. I am reading a csv formatted file into a panda dataframe called bc.")
    
    bc = pd.read_csv('./datasets/breast-cancer-wisconsin/breast-cancer-wisconsin.data')
    
    print("\nLoaded data from ./datasets/breast-cancer-wisconsin/breast-cancer-wisconsin.data")
    
    # On examination of the data set we find some missing data items which are marked as ?
    # We will replace them with -9999
    bc.replace('?',-99999, inplace=True)
    print("Handled missing data by replacing ?s with -99999")
    
    # Here the ID column does not impact how the individual experiments are classified so we will remove it
    bc.drop(['id'],1,inplace=True)
    print("Dropped ID column as it doesn't contribute any useful information to help with model creation")
    
    # Now I am printing the columns
    print("\n%s" % (bc[1:1]))
    
    # Now I have to define my X and y labels
    
    # X is the features data so I am assigning the entire data array but dropping the class column
    # Here I am using the pandas datafram .drop function to drop a column. I specify that its a column
    # by using the "1" and I specify the column name "class"
    # the syntax is df = df.drop('column_name', <0 for rows, 1 for column>)
    # Also note that numpy.array function converts from other python structures in this case
    # a panda dataframe into a numpy array so I can use numpy processing functions on it
    
    print("\nCreated X features array having all columns except class")
    X = np.array(bc.drop(['class'], 1))
    
    print("Created y known class array by assigning it the class column")
    # Here I am just assigning the class column only to y
    y = np.array(bc['class'])
    
    # Now I create my training and test samples
    # I will use my training set to create my model
    # I will use my test set to test its accuracy of classification
    # For this I am using sklearn's cross_validation.train_test_split function
    # This function takes arguments
    # *arrays : sequence of indexables with same length / shape[0]
    # Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.
    # So for that we are using the features numpy arrays X, the class y
    # test_size : float, int, or None (default is None)
    # If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split.
    # If int, represents the absolute number of test samples.
    # If None, the value is automatically set to the complement of the train size.
    # If train size is also None, test size is set to 0.25.
    
    print("\nSplitting of data set into test and train sets completed")
    X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.2)
    
    # Now considering that this is a multivariate data set and I need to classify one column I will use
    # an algorithm that is useful for such data sets and classifications : K Nearest Neighbors or KNN
    
    myClassifier = neighbors.KNeighborsClassifier()
    print("\nCreated KNN Classifier object")
    
    # Next I need to use the training data to train the classifier
    # It takes the X_train numpy array and the y_train numpy array to train the classifier
    myClassifier.fit(X_train, y_train)
    print("\nTraining Complete")
    
    # Next I will run a test and score the accuracy using the test data that I set aside X_test and y_test
    
    print("Testing Accuracy")
    testAccuracy = myClassifier.score(X_test, y_test)
    
    print("\nAccuracy = %s\n" % (testAccuracy))
    
    # Ok now my model is trained and ready and I can use it to classify new incoming data
    # lets define an example new data
    # But np.array has to be passed as a 2D array to predict so we need to use X.reshape function
    
    new_exp = np.array([[4,2,1,1,1,2,3,2,1],[4,1,1,1,1,2,2,2,1]])
    print("New Instances as input for prediction of class:\n %s" % (new_exp))
    
    # By using len(new_exp) I can provide any number of new test samples
    new_exp = new_exp.reshape(len(new_exp), -1)
    
    # Next I am going to use the predict function to classify the new experiment
    new_class = myClassifier.predict(new_exp)
    
    print("\nPredicted Classes of the instances:%s \n" % (new_class))
    
    new_class_name = []
    for i in range(len(new_class)):
            if new_class[i] == 2:
                    new_class_name.append("Benign")
            else:
                    new_class_name.append("Malignant")
    
    for i in range(len(new_class_name)):
            print("Patient: %s, Predicted Classification: %s" % (i, new_class_name[i]))
    

    And here is the output of running this. After training and testing I have provided two new patient data and used the trained algorithm to classify whether their tumors are benign or malignant.
    Here the new unclassified data for prediction is hardcoded in the code but if needed can also be loaded in from a .csv file with a small modification.
    Output
    
    $ python3.6 knnclassify_bc.py 
    
    Model Creation, Testing and Prediction of Breast Cancer Data using K Nearest Neighbor Algorithm
    
    INSTANCE BASED LEARNING
    
    In this type of machine learning algorithms, rather than construct a set of rules as an intermediate stage,
    the instances (or features or experiments) are themselves directly employed. We dont infer a rule set or a
    decision tree. The work of classification is done at the time of classification and not when training is done. 
    This can therefore be performance intensive. Both in terms of speed and storage.  
    
    Knowledge representation structures (like trees or rules) are not created in Instance based learning.  
    
    Normally in humans we use something called 'rote learning' where we commit a set of learning examples
    to memory and group similar items as a group or a class. Then when a new example comes we classify it 
    as one of the groups or classes depending on how closely it resembles a class. 
    K-Nearest Neighbor Algorithm is one such approach which calculates a distance from the new instance
    to the k nearest previously known instances and then depending on a majority of which class's instances
    are closet to this new instance, it assigns the class to be that class. 
    
    The distance calculated is a Eucledian Distance measure. It is then reported as a confidence measure. 
    The requirement would be that the attributes are normalized & of equal importance. When one attribute 
    is deemed to be more important than another suitable weighting can be employed when calculating a distance
    measure.
    
    
    /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
      "This module will be removed in 0.20.", DeprecationWarning)
    
    We are using the breast-cancer-wisconsin data set. I am reading a csv formatted file into a panda dataframe called bc.
    
    Loaded data from ./datasets/breast-cancer-wisconsin/breast-cancer-wisconsin.data
    Handled missing data by replacing ?s with -99999
    Dropped ID column as it doesn't contribute any useful information to help with model creation
    
    Empty DataFrame
    Columns: [clump_thickness, uniform_cell_size, uniform_cell_shape, marginal_adhesion, single_epi_cell_size, bare_nuclei, bland_chromation, normal_nucleoli, mitoses, class]
    Index: []
    
    Created X features array having all columns except class
    Created y known class array by assigning it the class column
    
    Splitting of data set into test and train sets completed
    
    Created KNN Classifier object
    
    Training Complete
    Testing Accuracy
    
    Accuracy = 0.957142857143
    
    New Instances as input for prediction of class:
     [[4 2 1 1 1 2 3 2 1]
     [4 1 1 1 1 2 2 2 1]]
    
    Predicted Classes of the instances:[2 2] 
    
    Patient: 0, Predicted Classification: Benign
    Patient: 1, Predicted Classification: Benign