IPython, Requests, lxml, and the NPR API

June 7, 2015

Last week, I decided to learn how to use python to get data from an API. I started with the Codecademy “Introduction to APIs in Python” course, which got me oriented to how requests work, and in the subsequent NPR API lesson, specifically how the NPR stories API works. Certain parts of the course assumed you knew more python than you had learned in the course, so heads-up that there are places you will probably have to google for help since the hints aren’t always related to what you’re stuck on. The course isn’t really a requirement for learning this stuff (and I thought it could use a lot of improvement), but it does give you a guided walk-through, which is nice when you are totally new to a topic. Then, I tweeted about my experience, and got 2 responses encouraging me to use the requests library instead of urllib that codecademy used. @BecomingDataSci the urllib api is terrible. You should take a look at http://t.co/CzIPob2tBV — Daniel Moisset (@dmoisset) June 1, 2015 @dmoisset @BecomingDataSci 2nding using of requests over urllib; esp. with HTTPS, requests tends to do saner things (e.g., cert validation) — Cheng H. Lee (@chenghlee) June 1, 2015 I decided to redo what I had learned from scratch, but using requests. I also wanted to learn how to use IPython, so I used an IPython notebook to play around with the code. Below is the HTML export of my IPython notebook, with comments explaining what I was doing. I’m sure there are better ways to do what I did (feel free to comment with suggestions!), but this was my first time doing any of this without any guidance, so I don’t mind posting it even if it’s a little ugly :) I definitely spent a lot of time understanding the hierarchy of the NPR XML and how to loop through it and display it. If you have done something similar in a more elegant way, please point me to your code! Here are the main resources I used to learn how to do what is in the code: python requests library documentation NPR API documentation python lxml library documentation iPython videos I also wanted to mention that there are a lot of frustrations you can run up against when you’re a python beginner. I was having a lot of problems with seemingly basic stuff (like installing packages with pip) and it took a couple hours of googling and asking someone for help to figure out there was a problem with my path environment variables in windows. I’ll post about that another time, but I just wanted to 1) encourage people not to give up if you get stuck on something that seems to be so basic that most “intro” articles don’t even cover it, and 2) encourage people writing intro articles to make some suggestions about what could go wrong and how to problem-solve. Here’s one example: When I tried to export my IPython notebook to HTML, it gave me a 500 server error saying I needed python packages I didn’t already have. After I installed the first, it told me I needed pandoc, so I installed that as well, but it kept giving me the same error. It turns out that you have to run IPython Notebook as an Administrator in Windows in order to get the HTML export to work properly, but the error message didn’t indicate that at all. This is the kind of frustration that may make beginners think they’re not “getting it” and give up, when it fact it’s something outside the scope of what you’re learning. Python seems to require a lot of this sort of problem-solving. (Note: on my other laptop, I installed python and the scipy stack using Anaconda, and have had a lot fewer issues like this.) Without further ado, here’s my iPython notebook! (I’m having issues making it look readable while embedded in wordpress, so click the link to view in a new tab for now, and I’ll fix for viewing later!) Renee’s 1st IPython Notebook (NPR API using requests and lxml) Here’s the actual ipynb file if you have IPython installed and want to run it yourself: First Python API Usage** **NOTE: WordPress wouldn’t let me upload it with the IPython notebook extension for security reasons, so after you download it, change the “.txt” extension to...

Read More

Data Science Practice – Classifying Heart Disease

January 19, 2015

This post details a casual exploratory project I did over a few days to teach myself more about classifiers. I downloaded the Heart Disease dataset from the UCI Machine Learning respository and thought of a few different ways to approach classifying the provided data. ——————————————- “MANUAL” APPROACH USING EXCEL So first I started out by seeing if I could create a scoring model in Excel which could be used to classify the patients. I started with the Cleveland data set that was already processed, i.e. narrowed down to the most commonly used fields. I did some simple exploratory data analysis on each field using pivot tables and percentage calulations, and decided on a value (or values) for each column that appeared to correlate with the final finding of heart disease or no heart disease, then used those results to add up points for each patient. For instance, I found that 73% of patients with Chest Pain Type 4 ended up being diagnosed with heart disease, where no higher than 30% of patients with any other Chest Pain Type ended up with that result. So, the people that had a 4 in the “chest pain type” column got 2 points. For Resting Blood Pressure, I grouped the values into 10-point buckets, and found that patients with a resting blood pressure value of below 150 had a 33-46% chance of being diagnosed with heart disease, where those with 150 and up had a 55-100% chance. So, patients with 150 or above in that column got an additional point. Points ended up being scored for the following 12 categories: Age group >=60 Sex = male Chest Pain Type 4 [2 points] Resting Blood Pressure group >= 150 Cholestoerol >= 275 Resting ECG > 0 Max Heart Rate <= 120 [2 points], between 130 and 150 [1 point] Exercise-induced Angina = yes [2 points] ST depression group >=1 [1 point], >=2 [2 points] Slope of ST segment > 1 Major Vessels colored by Fluoroscopy = 1 [1 point], >1 [2 points] thal = 6 [1 point], thal = 7 [2 points] This was a very “casual” approach which I would refine dramatically for actual medial diagnosis, but this was just an exercise to see if I could create my own “hand-made” classifier that could actually predict anything. So at this point I had to decide how many points merited a “positive” (though not positive in the patients’ eyes) diagnosis of heart disease. I tried score thresholds of between 6 and 9 points, and also tried a percentage-based scoring system, and the result with the most correct classifications was at 8 points. (8+ points classified the patient as likely having heart disease.) However, the 8-point threshold had a higher false negative rate than the lower thresholds, so if you wanted to make sure you were not telling someone they didn’t have heart disease when they in fact did, you would use a lower threshold. The final results with the 8-point threshold were: true positive 112 37% false positive 18 6% true negative 146 48% false negative 27 9% correct class 258 85% I remembered that the symptoms for heart disease in males and females could look very different, so I also looked at the “correct class” percentages for each sex. It turns out that my classifier classified 82% of males correctly, and 93% of females. Because of how long it took to create this classifier “the manual way”, I decided not to redo this step to create a separate classification scheme for each sex, but I decided to test out training separate models for men and women when I try this again using Python. TESTING THE MODEL I made the mistake of not checking out the other data sets before dividing the data into training and test sets. Luckily the Cleveland dataset I used for training was fairly complete. However, the VA dataset had a lot of missing data. I used my scoring model on the VA data, but ended up having to change the score threshold for classifying, because the Cleveland (training) dataset had rows with an average of almost 11 data points out of 12, but the VA (test) dataset only averaged about 8. So, I lowered the threshold to 6 points and got these results: true positive 140 70% false positive 38 19% true negative 13 7% false negative 9 5% correct class 153 77% This time, separating out the sexes, it classified only 67% of the females correctly, but 77% of the males. However, there were only 6 females (out of 200 records) in the dataset, so that result doesn’t mean much. I think classifying...

Read More

Codecademy Python Course: Completed

September 21, 2014

I can cross off another item on my Goals list since i finally jumped back into the Codecademy “Python Fundamentals” course and completed the final topics this afternoon. I think the course would be good for people that have had at least an introductory programming course in the past. I didn’t have much trouble with the tasks (though a few were pretty tricky), but I have programming experience (and taught myself some advanced Python outside of the course for my Machine Learning class) and can imagine that someone that had never programmed before and was unfamiliar with basic concepts might get totally stuck at points in the course. I think they need 2 levels of “hints” per topic so that if you just need hints on the most common difficult things that trip people up, you click it once and get the hints they show now. But if you’re truly stuck and need to be walked through it, they should have more in-depth hints for true beginners. The site estimates it will take you 13 hours to complete the course. I don’t know how much time I spent on it total, since it was broken up over months. It took me about an hour to finish the final 10% of the course, covering classes, inheritance, overrides, file input/output and reviews, then also going back and figuring out where the final 1% was that it said I hadn’t completed (apparently I skipped some topic mid-course accidentally) so I could get the 100% topic complete status. The topics covered are: Python Syntax Strings and Console Output Conditionals and Control Flow Functions Lists & Dictionaries Loops Iteration over Data Structures Bitwise Operators Classes File Input & Output I thought this was a good set of topics for an intro course. If they dropped anything, I think Bitwise Operators was a “bit” unnecessary for beginners. I liked the projects they included to test out the skills you learned, like writing a program as if you are a teacher and need to calculate statistics on your class’ test scores. Overall, I think Codecademy did a good job with this course, and I would point other programmers that want to quickly get up to speed on Python to take this course. I would also point beginners to the course, but with a warning that there are tricky spots they may need outside resources to get...

Read More

Machine Learning Project 4

May 11, 2014

So immediately after I turned in project 3, I started on Project 4, our final project in Machine Learning grad class. We had a few options that the professor gave us, but could also propose our own. One of the options was learning how to implement Random Forest (an ensemble learning method using many decision trees) and analyzing a given data set, so I proposed using Random Forest on University Advancement (Development/Fundraising) data I got from my “day job”. The professor approved it, so I started learning about Random Forest Classification.

Read More

ML Projects 2 & 3 Results

April 29, 2014

I was in such a rush to finish Project 3 by Sunday night, I didn’t post about the rest of my results, and now before I got a chance to write about it, the professor has already graded it! I got 100% on this one I just turned in, and also just found out I got 100% on Project 2!! This makes me feel so good, especially since I didn’t do so well on the midterm, and confirms that I can do this!

Read More

ML Project 3 (Post 2)

April 27, 2014

Tonight I have learned how to use PyBrain’s Feed-Forward Neural Networks for Classification. Yay! I had already created a neural network and used it on the project’s regression data set earlier this week, then used those results to “manually” classify (by picking which class the output was closer to, then counting up how many points were correctly classified), but tonight I fully implemented the PyBrain classification, using 1-of-k method of encoding the classes, and it appears to be working great! The neural network still takes a while to train, but it’s much quicker on this 2-input 2-class data than it was on the 8-input 7-output data for part 1 of the project. I’m actually writing this as it trains for the next task (see below). The code I wrote is: pybrain FNN classification Python print("\nImporting training data...") from pybrain.datasets import ClassificationDataSet #bring in data from training file traindata = ClassificationDataSet(2,1,2) f = open("classification.tra") for line in f.readlines(): #using classification data set this time (subtracting 1 so first class is 0) traindata.appendLinked(list(map(float, line.split()))[0:2],int(list(map(float, line.split()))[2])-1) print("Training rows: %d " % len(traindata) ) print("Input dimensions: %d, output dimensions: %d" % ( traindata.indim, traindata.outdim)) #convert to have 1 in column per class traindata._convertToOneOfMany() #raw_input("Press Enter to view training data...") #print(traindata) print("\nFirst sample: ", traindata['input'][0], traindata['target'][0], traindata['class'][0]) print("\nCreating Neural Network:") #create the network from pybrain.tools.shortcuts import buildNetwork from pybrain.structure.modules import SoftmaxLayer #change the number below for neurons in hidden layer hiddenneurons = 2 net = buildNetwork(traindata.indim,hiddenneurons,traindata.outdim, outclass=SoftmaxLayer) print('Network Structure:') print('\nInput: ', net['in']) #can't figure out how to get hidden neuron count, so making it a variable to print print('Hidden layer 1: ', net['hidden0'], ", Neurons: ", hiddenneurons ) print('Output: ', net['out']) #raw_input("Press Enter to train network...") #train neural network print("\nTraining the neural network...") from pybrain.supervised.trainers import BackpropTrainer trainer = BackpropTrainer(net,traindata) trainer.trainUntilConvergence(dataset = traindata, maxEpochs=100, continueEpochs=10, verbose=True, validationProportion = .20) print("\n") for mod in net.modules: for conn in net.connections[mod]: print conn for cc in range(len(conn.params)): print conn.whichBuffers(cc), conn.params[cc] print("\nTraining Epochs: %d" % trainer.totalepochs) from pybrain.utilities import percentError trnresult = percentError( trainer.testOnClassData(dataset = traindata), traindata['class'] ) print(" train error: %5.2f%%" % trnresult) #result for each class trn0, trn1 = traindata.splitByClass(0) trn0result = percentError( trainer.testOnClassData(dataset = trn0), trn0['class']) trn1result = percentError( trainer.testOnClassData(dataset = trn1), trn1['class']) print(" train class 0 samples: %d, error: %5.2f%%" % (len(trn0),trn0result)) print(" train class 1 samples: %d, error: %5.2f%%" % (len(trn1),trn1result)) raw_input("\nPress Enter to start testing...") print("\nImporting testing data...") #bring in data from testing file testdata = ClassificationDataSet(2,1,2) f = open("classification.tst") for line in f.readlines(): #using classification data set this time (subtracting 1 so first class is 0) testdata.appendLinked(list(map(float, line.split()))[0:2],int(list(map(float, line.split()))[2])-1) print("Test rows: %d " % len(testdata) ) print("Input dimensions: %d, output dimensions: %d" % ( testdata.indim, testdata.outdim)) #convert to have 1 in column per class testdata._convertToOneOfMany() #raw_input("Press Enter to view training data...") #print(traindata) print("\nFirst sample: ", testdata['input'][0], testdata['target'][0], testdata['class'][0]) print("\nTesting...") tstresult = percentError( trainer.testOnClassData(dataset = testdata), testdata['class'] ) print(" test error: %5.2f%%" % tstresult) #result for each class tst0, tst1 = testdata.splitByClass(0) tst0result = percentError( trainer.testOnClassData(dataset = tst0), tst0['class']) tst1result = percentError( trainer.testOnClassData(dataset = tst1), tst1['class']) print(" test class 0 samples: %d, error: %5.2f%%" % (len(tst0),tst0result)) print(" test class 1 samples: %d, error: %5.2f%%" % (len(tst1),tst1result)) 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687 print("\nImporting training data...")from pybrain.datasets import ClassificationDataSet#bring in data from training filetraindata = ClassificationDataSet(2,1,2)f = open("classification.tra")for line in f.readlines():    #using classification data set this time (subtracting 1 so first class is 0)    traindata.appendLinked(list(map(float, line.split()))[0:2],int(list(map(float, line.split()))[2])-1)     print("Training rows: %d " %  len(traindata) )print("Input dimensions: %d, output dimensions: %d" % ( traindata.indim, traindata.outdim))#convert to have 1 in column per classtraindata._convertToOneOfMany()#raw_input("Press Enter to view training data...")#print(traindata)print("\nFirst sample: ", traindata['input'][0], traindata['target'][0], traindata['class'][0]) print("\nCreating Neural Network:")#create the networkfrom pybrain.tools.shortcuts import buildNetworkfrom pybrain.structure.modules import SoftmaxLayer#change the number below for neurons in hidden layerhiddenneurons = 2net = buildNetwork(traindata.indim,hiddenneurons,traindata.outdim, outclass=SoftmaxLayer)print('Network Structure:')print('\nInput: ', net['in'])#can't figure out how to get hidden neuron count, so making it a variable to printprint('Hidden layer 1: ', net['hidden0'], ", Neurons: ", hiddenneurons )print('Output: ', net['out']) #raw_input("Press Enter to train network...")#train neural networkprint("\nTraining the neural network...")from pybrain.supervised.trainers import BackpropTrainertrainer = BackpropTrainer(net,traindata)trainer.trainUntilConvergence(dataset = traindata, maxEpochs=100, continueEpochs=10, verbose=True, validationProportion = .20) print("\n")for mod in net.modules:    for conn in net.connections[mod]:        print conn        for cc in range(len(conn.params)):            print conn.whichBuffers(cc), conn.params[cc] print("\nTraining Epochs: %d" % trainer.totalepochs) from pybrain.utilities import percentErrortrnresult = percentError( trainer.testOnClassData(dataset = traindata),                              traindata['class'] )print("  train error: %5.2f%%" % trnresult)#result for each classtrn0, trn1 =  traindata.splitByClass(0)trn0result = percentError( trainer.testOnClassData(dataset = trn0), trn0['class'])trn1result = percentError( trainer.testOnClassData(dataset = trn1), trn1['class'])print("  train class 0 samples: %d, error: %5.2f%%" % (len(trn0),trn0result))print("  train class 1 samples: %d, error: %5.2f%%" % (len(trn1),trn1result)) raw_input("\nPress Enter to start testing...") print("\nImporting testing data...")#bring in data from testing filetestdata = ClassificationDataSet(2,1,2)f = open("classification.tst")for line in f.readlines():    #using classification data set this time (subtracting 1 so first class is 0)    testdata.appendLinked(list(map(float, line.split()))[0:2],int(list(map(float, line.split()))[2])-1)     print("Test rows: %d " %  len(testdata) )print("Input dimensions: %d, output dimensions: %d" % ( testdata.indim, testdata.outdim))#convert to have 1 in...

Read More