Data Science Practice – Classifying Heart Disease

January 19, 2015

This post details a casual exploratory project I did over a few days to teach myself more about classifiers. I downloaded the Heart Disease dataset from the UCI Machine Learning respository and thought of a few different ways to approach classifying the provided data. ——————————————- “MANUAL” APPROACH USING EXCEL So first I started out by seeing if I could create a scoring model in Excel which could be used to classify the patients. I started with the Cleveland data set that was already processed, i.e. narrowed down to the most commonly used fields. I did some simple exploratory data analysis on each field using pivot tables and percentage calulations, and decided on a value (or values) for each column that appeared to correlate with the final finding of heart disease or no heart disease, then used those results to add up points for each patient. For instance, I found that 73% of patients with Chest Pain Type 4 ended up being diagnosed with heart disease, where no higher than 30% of patients with any other Chest Pain Type ended up with that result. So, the people that had a 4 in the “chest pain type” column got 2 points. For Resting Blood Pressure, I grouped the values into 10-point buckets, and found that patients with a resting blood pressure value of below 150 had a 33-46% chance of being diagnosed with heart disease, where those with 150 and up had a 55-100% chance. So, patients with 150 or above in that column got an additional point. Points ended up being scored for the following 12 categories: Age group >=60 Sex = male Chest Pain Type 4 [2 points] Resting Blood Pressure group >= 150 Cholestoerol >= 275 Resting ECG > 0 Max Heart Rate <= 120 [2 points], between 130 and 150 [1 point] Exercise-induced Angina = yes [2 points] ST depression group >=1 [1 point], >=2 [2 points] Slope of ST segment > 1 Major Vessels colored by Fluoroscopy = 1 [1 point], >1 [2 points] thal = 6 [1 point], thal = 7 [2 points] This was a very “casual” approach which I would refine dramatically for actual medial diagnosis, but this was just an exercise to see if I could create my own “hand-made” classifier that could actually predict anything. So at this point I had to decide how many points merited a “positive” (though not positive in the patients’ eyes) diagnosis of heart disease. I tried score thresholds of between 6 and 9 points, and also tried a percentage-based scoring system, and the result with the most correct classifications was at 8 points. (8+ points classified the patient as likely having heart disease.) However, the 8-point threshold had a higher false negative rate than the lower thresholds, so if you wanted to make sure you were not telling someone they didn’t have heart disease when they in fact did, you would use a lower threshold. The final results with the 8-point threshold were: true positive 112 37% false positive 18 6% true negative 146 48% false negative 27 9% correct class 258 85% I remembered that the symptoms for heart disease in males and females could look very different, so I also looked at the “correct class” percentages for each sex. It turns out that my classifier classified 82% of males correctly, and 93% of females. Because of how long it took to create this classifier “the manual way”, I decided not to redo this step to create a separate classification scheme for each sex, but I decided to test out training separate models for men and women when I try this again using Python. TESTING THE MODEL I made the mistake of not checking out the other data sets before dividing the data into training and test sets. Luckily the Cleveland dataset I used for training was fairly complete. However, the VA dataset had a lot of missing data. I used my scoring model on the VA data, but ended up having to change the score threshold for classifying, because the Cleveland (training) dataset had rows with an average of almost 11 data points out of 12, but the VA (test) dataset only averaged about 8. So, I lowered the threshold to 6 points and got these results: true positive 140 70% false positive 38 19% true negative 13 7% false negative 9 5% correct class 153 77% This time, separating out the sexes, it classified only 67% of the females correctly, but 77% of the males. However, there were only 6 females (out of 200 records) in the dataset, so that result doesn’t mean much. I think classifying...

Read More

Codecademy Python Course: Completed

September 21, 2014

I can cross off another item on my Goals list since i finally jumped back into the Codecademy “Python Fundamentals” course and completed the final topics this afternoon. I think the course would be good for people that have had at least an introductory programming course in the past. I didn’t have much trouble with the tasks (though a few were pretty tricky), but I have programming experience (and taught myself some advanced Python outside of the course for my Machine Learning class) and can imagine that someone that had never programmed before and was unfamiliar with basic concepts might get totally stuck at points in the course. I think they need 2 levels of “hints” per topic so that if you just need hints on the most common difficult things that trip people up, you click it once and get the hints they show now. But if you’re truly stuck and need to be walked through it, they should have more in-depth hints for true beginners. The site estimates it will take you 13 hours to complete the course. I don’t know how much time I spent on it total, since it was broken up over months. It took me about an hour to finish the final 10% of the course, covering classes, inheritance, overrides, file input/output and reviews, then also going back and figuring out where the final 1% was that it said I hadn’t completed (apparently I skipped some topic mid-course accidentally) so I could get the 100% topic complete status. The topics covered are: Python Syntax Strings and Console Output Conditionals and Control Flow Functions Lists & Dictionaries Loops Iteration over Data Structures Bitwise Operators Classes File Input & Output I thought this was a good set of topics for an intro course. If they dropped anything, I think Bitwise Operators was a “bit” unnecessary for beginners. I liked the projects they included to test out the skills you learned, like writing a program as if you are a teacher and need to calculate statistics on your class’ test scores. Overall, I think Codecademy did a good job with this course, and I would point other programmers that want to quickly get up to speed on Python to take this course. I would also point beginners to the course, but with a warning that there are tricky spots they may need outside resources to get...

Read More

Machine Learning Project 4

May 11, 2014

So immediately after I turned in project 3, I started on Project 4, our final project in Machine Learning grad class. We had a few options that the professor gave us, but could also propose our own. One of the options was learning how to implement Random Forest (an ensemble learning method using many decision trees) and analyzing a given data set, so I proposed using Random Forest on University Advancement (Development/Fundraising) data I got from my “day job”. The professor approved it, so I started learning about Random Forest Classification.

Read More

ML Projects 2 & 3 Results

April 29, 2014

I was in such a rush to finish Project 3 by Sunday night, I didn’t post about the rest of my results, and now before I got a chance to write about it, the professor has already graded it! I got 100% on this one I just turned in, and also just found out I got 100% on Project 2!! This makes me feel so good, especially since I didn’t do so well on the midterm, and confirms that I can do this!

Read More

ML Project 3 (Post 2)

April 27, 2014

Tonight I have learned how to use PyBrain’s Feed-Forward Neural Networks for Classification. Yay! I had already created a neural network and used it on the project’s regression data set earlier this week, then used those results to “manually” classify (by picking which class the output was closer to, then counting up how many points were correctly classified), but tonight I fully implemented the PyBrain classification, using 1-of-k method of encoding the classes, and it appears to be working great! The neural network still takes a while to train, but it’s much quicker on this 2-input 2-class data than it was on the 8-input 7-output data for part 1 of the project. I’m actually writing this as it trains for the next task (see below). The code I wrote is: pybrain FNN classification Python print("\nImporting training data...") from pybrain.datasets import ClassificationDataSet #bring in data from training file traindata = ClassificationDataSet(2,1,2) f = open("classification.tra") for line in f.readlines(): #using classification data set this time (subtracting 1 so first class is 0) traindata.appendLinked(list(map(float, line.split()))[0:2],int(list(map(float, line.split()))[2])-1) print("Training rows: %d " % len(traindata) ) print("Input dimensions: %d, output dimensions: %d" % ( traindata.indim, traindata.outdim)) #convert to have 1 in column per class traindata._convertToOneOfMany() #raw_input("Press Enter to view training data...") #print(traindata) print("\nFirst sample: ", traindata['input'][0], traindata['target'][0], traindata['class'][0]) print("\nCreating Neural Network:") #create the network from pybrain.tools.shortcuts import buildNetwork from pybrain.structure.modules import SoftmaxLayer #change the number below for neurons in hidden layer hiddenneurons = 2 net = buildNetwork(traindata.indim,hiddenneurons,traindata.outdim, outclass=SoftmaxLayer) print('Network Structure:') print('\nInput: ', net['in']) #can't figure out how to get hidden neuron count, so making it a variable to print print('Hidden layer 1: ', net['hidden0'], ", Neurons: ", hiddenneurons ) print('Output: ', net['out']) #raw_input("Press Enter to train network...") #train neural network print("\nTraining the neural network...") from pybrain.supervised.trainers import BackpropTrainer trainer = BackpropTrainer(net,traindata) trainer.trainUntilConvergence(dataset = traindata, maxEpochs=100, continueEpochs=10, verbose=True, validationProportion = .20) print("\n") for mod in net.modules: for conn in net.connections[mod]: print conn for cc in range(len(conn.params)): print conn.whichBuffers(cc), conn.params[cc] print("\nTraining Epochs: %d" % trainer.totalepochs) from pybrain.utilities import percentError trnresult = percentError( trainer.testOnClassData(dataset = traindata), traindata['class'] ) print(" train error: %5.2f%%" % trnresult) #result for each class trn0, trn1 = traindata.splitByClass(0) trn0result = percentError( trainer.testOnClassData(dataset = trn0), trn0['class']) trn1result = percentError( trainer.testOnClassData(dataset = trn1), trn1['class']) print(" train class 0 samples: %d, error: %5.2f%%" % (len(trn0),trn0result)) print(" train class 1 samples: %d, error: %5.2f%%" % (len(trn1),trn1result)) raw_input("\nPress Enter to start testing...") print("\nImporting testing data...") #bring in data from testing file testdata = ClassificationDataSet(2,1,2) f = open("classification.tst") for line in f.readlines(): #using classification data set this time (subtracting 1 so first class is 0) testdata.appendLinked(list(map(float, line.split()))[0:2],int(list(map(float, line.split()))[2])-1) print("Test rows: %d " % len(testdata) ) print("Input dimensions: %d, output dimensions: %d" % ( testdata.indim, testdata.outdim)) #convert to have 1 in column per class testdata._convertToOneOfMany() #raw_input("Press Enter to view training data...") #print(traindata) print("\nFirst sample: ", testdata['input'][0], testdata['target'][0], testdata['class'][0]) print("\nTesting...") tstresult = percentError( trainer.testOnClassData(dataset = testdata), testdata['class'] ) print(" test error: %5.2f%%" % tstresult) #result for each class tst0, tst1 = testdata.splitByClass(0) tst0result = percentError( trainer.testOnClassData(dataset = tst0), tst0['class']) tst1result = percentError( trainer.testOnClassData(dataset = tst1), tst1['class']) print(" test class 0 samples: %d, error: %5.2f%%" % (len(tst0),tst0result)) print(" test class 1 samples: %d, error: %5.2f%%" % (len(tst1),tst1result)) 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687 print("\nImporting training data...")from pybrain.datasets import ClassificationDataSet#bring in data from training filetraindata = ClassificationDataSet(2,1,2)f = open("classification.tra")for line in f.readlines():    #using classification data set this time (subtracting 1 so first class is 0)    traindata.appendLinked(list(map(float, line.split()))[0:2],int(list(map(float, line.split()))[2])-1)     print("Training rows: %d " %  len(traindata) )print("Input dimensions: %d, output dimensions: %d" % ( traindata.indim, traindata.outdim))#convert to have 1 in column per classtraindata._convertToOneOfMany()#raw_input("Press Enter to view training data...")#print(traindata)print("\nFirst sample: ", traindata['input'][0], traindata['target'][0], traindata['class'][0]) print("\nCreating Neural Network:")#create the networkfrom pybrain.tools.shortcuts import buildNetworkfrom pybrain.structure.modules import SoftmaxLayer#change the number below for neurons in hidden layerhiddenneurons = 2net = buildNetwork(traindata.indim,hiddenneurons,traindata.outdim, outclass=SoftmaxLayer)print('Network Structure:')print('\nInput: ', net['in'])#can't figure out how to get hidden neuron count, so making it a variable to printprint('Hidden layer 1: ', net['hidden0'], ", Neurons: ", hiddenneurons )print('Output: ', net['out']) #raw_input("Press Enter to train network...")#train neural networkprint("\nTraining the neural network...")from pybrain.supervised.trainers import BackpropTrainertrainer = BackpropTrainer(net,traindata)trainer.trainUntilConvergence(dataset = traindata, maxEpochs=100, continueEpochs=10, verbose=True, validationProportion = .20) print("\n")for mod in net.modules:    for conn in net.connections[mod]:        print conn        for cc in range(len(conn.params)):            print conn.whichBuffers(cc), conn.params[cc] print("\nTraining Epochs: %d" % trainer.totalepochs) from pybrain.utilities import percentErrortrnresult = percentError( trainer.testOnClassData(dataset = traindata),                              traindata['class'] )print("  train error: %5.2f%%" % trnresult)#result for each classtrn0, trn1 =  traindata.splitByClass(0)trn0result = percentError( trainer.testOnClassData(dataset = trn0), trn0['class'])trn1result = percentError( trainer.testOnClassData(dataset = trn1), trn1['class'])print("  train class 0 samples: %d, error: %5.2f%%" % (len(trn0),trn0result))print("  train class 1 samples: %d, error: %5.2f%%" % (len(trn1),trn1result)) raw_input("\nPress Enter to start testing...") print("\nImporting testing data...")#bring in data from testing filetestdata = ClassificationDataSet(2,1,2)f = open("classification.tst")for line in f.readlines():    #using classification data set this time (subtracting 1 so first class is 0)    testdata.appendLinked(list(map(float, line.split()))[0:2],int(list(map(float, line.split()))[2])-1)     print("Test rows: %d " %  len(testdata) )print("Input dimensions: %d, output dimensions: %d" % ( testdata.indim, testdata.outdim))#convert to have 1 in...

Read More

Machine Learning Project 3

April 24, 2014

I’m in the midst of working on Project 3 for my Machine Learning class. This one has the following tasks: Train a 3-layer (input, hidden, output) neural network with one hidden layer based on the given training set which has 8 inputs and 7 outputs. Obtain training & testing errors with the number of hidden units set at 1, 4, and 8. Design a neural network for classification and train on the given training set with 2 inputs and 2 classes. Apply the trained network to the testing data. Let the number of hidden units be 1, 2, and 4 respectively, and obtain training and testing classification accuracies for each. Repeat task 2 on the training data set with 16 inputs and 10 classes, using hidden units of 5, 10, and 13 Repeat tasks 2 and 3 using an SVM classifier. Choose several kernel functions and parameters and report the training and testing accuracies for each. Thank goodness we’re allowed to use built-in functions this time! The prof recommended matlab, but said I could use python if I could find a good library for neural networks, so I decided to try PyBrain. I had a hard time attempting to install PyBrain because I was using Python 3.3. Realizing it was incompatible and I didn’t want to try to make the modifications necessary to get it to work with a 1-week project turnaround, I went looking for another package that could do neural networks. I tried neurolab and just couldn’t get it to work, and everywhere I read online with problems, people suggested the solution was to use PyBrain. I already had python 2.7 installed, so I configured my computer to install pybrain for 2.7 and run python 2.7 and use it in Visual Studio (my current IDE), and finally got it up and running. As of last night, I had some preliminary solutions for task 1, but I don’t fully trust the results, so I’m playing around with it a bit tonight. I do have a little more time to experiment since the due date got moved from Friday night to Monday (once I pointed out that handing out a project on Saturday of Easter weekend – when I was actually working on a major project for my other grad course Risk Analysis – and having it due the following Friday wasn’t very workable for those of us that have full time jobs, and extending it to even give one weekend day would be beneficial). So, that’s underway, and I’m actually writing this blog post while I wait for my latest neural network setup to train to 100 epochs in pybrain! I’ll update when I have some results to...

Read More