Code: treeclassify_bc.py
  • Breast Cancer class prediction using Machine Learning

    Contents

    1. Goal
    2. Data Examination
    3. Algorithm Choice, development and code
    4. Decision Tree Generated
    4. Output
    5. Final Results

    Predicting the class of a patient's breast tumor as 'Benign' or 'Malignant': Univ. of Wisconsin & Univ. of California Irvine dataset

    This uses decision tree algorithm for prediction

    Goal: Given 569 records of patient data on breast tumor testing and with the class outcome values "Benign" or "Malignant" the requirement is to build a model to predict the class of new patients' tumor based on the recorded features.
    Data Examination:The features or attributes are

       1. Sample code number            id number
    2. Clump Thickness 1 - 10
    3. Uniformity of Cell Size 1 - 10
    4. Uniformity of Cell Shape 1 - 10
    5. Marginal Adhesion 1 - 10
    6. Single Epithelial Cell Size 1 - 10
    7. Bare Nuclei 1 - 10
    8. Bland Chromatin 1 - 10
    9. Normal Nucleoli 1 - 10
    10. Mitoses 1 - 10
    11. Class: (2 for benign, 4 for malignant)
    Algorithm Choice, development and code: Since this is a multivariate feature set and we are aiming to predict a class label or we are doing classification we will use the Decision Tree Algorithm.

    Code
    import numpy as np
    from sklearn.datasets import load_breast_cancer
    from sklearn import tree
    from sklearn.externals.six import StringIO
    import pydotplus
    from IPython.display import Image
    
    print("Loading data")
    bc_data = load_breast_cancer()
    
    print("Selecting 3 records [0, 50, 100] as test samples")
    test_idx = [0, 50, 100]
    
    #training data
    print("Creating train target data set by removing the 3 test samples")
    train_target = np.delete(bc_data.target, test_idx)
    print("Creating train data data set by removing the 3 test samples")
    train_data = np.delete(bc_data.data, test_idx, axis=0)
    
    #testing data
    print("Creating test target data set with the three samples") 
    test_target = bc_data.target[test_idx]
    print("Creating test data dataset with the three samples")
    test_data = bc_data.data[test_idx]
    
    print("Creating Decision Tree Classifier")
    my_decision_tree = tree.DecisionTreeClassifier()
    print("Training the decision tree classifier")
    my_decision_tree.fit(train_data, train_target)
    
    predicted_values = my_decision_tree.predict(test_data)
    print("\nTESTING:\n")
    print("TEST TARGETS: %s" % (test_target))
    print("PREDICTED VALUES: %s\n" % (predicted_values))
    
    total_count = len(test_idx)
    correct_count = 0
    for i in range(len(test_idx)):
    	if test_target[i] == predicted_values[i]:
    		correct_count+=1	
    
    accuracy = (correct_count/total_count) * 100
    
    print("Accuracy = %.5f %%" % (accuracy))
    
    dot_data = tree.export_graphviz(my_decision_tree, out_file=None) 
    graph = pydotplus.graph_from_dot_data(dot_data) 
    graph.write_pdf("bc_tree.pdf") 
    
    dot_data = tree.export_graphviz(my_decision_tree, out_file=None, 
                             feature_names=bc_data.feature_names,  
                             class_names=bc_data.target_names,  
                             filled=True, rounded=True,  
                             special_characters=True)  
    graph = pydotplus.graph_from_dot_data(dot_data)  
    Image(graph.create_png())  
    
    
    

    The Decision Tree Generated



    Here is the output. Three samples were set aside for testing and the accuracy was computed after comparing their predictions to their known classes.
    Here the Accuracy is the number of samples that were correctly classified over the total number of samples.
    Output bc_tree.pdf: The Decsion Tree Model Generated
    $ python3.6 ./treeclassify_bc.py 
    Loading data
    Selecting 3 records [0, 50, 100] as test samples
    Creating train target data set by removing the 3 test samples
    Creating train data data set by removing the 3 test samples
    Creating test target data set with the three samples
    Creating test data dataset with the three samples
    Creating Decision Tree Classifier
    Training the decision tree classifier
    
    TESTING:
    
    TEST TARGETS: [0 1 0]
    PREDICTED VALUES: [0 1 0]
    
    Accuracy = 100.00000 %