Machine Learning Project 1

I mentioned in my “Why Data Science” post that I thought about starting this blog when the first project was assigned in my graduate Machine Learning class.

The class to date has been mostly “math review” (which is not a review at all for me) and doing homework problems that take 4 pages of work to solve, such as “Show that the Gamma distribution is appropriately normalized”. Needless to say, I’m terrified of the impending midterm I have to take in about a month.

But in the meantime, we just got assigned our first project, having to do with Bayesian Classification! It’s due in 2 weeks and I barely know Python (we can choose any language for this project, and Python is one I’ve been wanting to learn and eventually master since it seems to be used frequently for Data Science work), and barely understand the Bayes classifier, but I’m still excited since learning how to do tasks like this are why I signed up for the class.

Here’s the full description of the project:

Given a training data set and a testing data set (text files), design different classifiers based on the training data set and test the designed classifiers on the test data set.

1. Bayes Classifier: Assume a Gaussian distribution for each class using maximum likelihood to estimate the conditional distribution P(x|ci) and p(ci). Then use Bayes Theory to compute the posterior probability.

2. Naive Bayes Classifier: Assume Gaussian Distribution for each class and assume conditional independence for each variable. Use maximum likelihood to estimate parameters and design a classifier.

3. Use nonparametric method, kernel method, to estimate p(x|ci), use maximum likelihood to estimate p(ci), and design a Bayes Classifier. Determine a suitable value for “h”.

4. Use K-Nearest-Neighbor method to classify the testing data using different k values.

5. Compare your results.

I think this is a lot to do in 2 weeks, especially since we have another difficult homework due in 1 week in this class, and I have homework for another class as well, but I’m going for it!

I had already installed Python into Visual Studio (which I already use for .NET development), and last night I installed scipy, mathplotlib, and numpy along with all of their dependencies. We’re not allowed to use pre-built functions and have to write our own classifiers, but I figure it can’t hurt to have a plan to check all of my work with something established!

My first triumph was importing the data from a text file, discovering it contained 3 columns of data, and creating this 3D plot last night:

3D Scatterplot

Here’s the code for my first-ever scatterplot in three dimensions:

import matplotlib.pyplot as plt
import numpy as np

#get data from training file
data = np.genfromtxt(‘train.txt’, delimiter = ‘ ‘,
dtype=”float, float, float”, names = “col1, col2, col3”)

#plot columns 1,2,3 as 3d scatterplot
from mpl_toolkits.mplot3d import Axes3D
import pylab
fig = pylab.figure()
ax = Axes3D(fig)
ax.scatter(data[“col1”],data[“col2”],data[“col3”],zdir=’z’)
plt.title(“Column 1 vs Column 2 vs Column 3”)
plt.show()

(note to self, get a syntax-highlighting wordpress plugin!)

Don’t tell me how to do any of this, since I don’t want to cheat! I’ll post my progress as I go along (though I won’t be back to working on this for about a week since the other homework is due first). Once I’m done with it, I’ll ask for feedback from those of you that know how I could have done it better, so I can keep learning.

I’m excited, and feeling time-crunched!

5 Comments

  1. Renee
    Feb 17, 2014

    Oh, I forgot to mention that I’m also learning Python on Codecademy http://www.codecademy.com/tracks/python
    I’m about 50% of the way through that track.

  2. Scott edwards
    Feb 22, 2014

    Hi Renee,
    Congrats on some real progress on your first project. Thanks for starting this blog, as I find myself in a similar situations, having mostly worked in SQL type positions, with some past formal statistics courses. The math review sounds tough, but I would love for the opportunity to have the support of a prof. During that process, as I’ve found that to be the hardest part in some machine learning classes online. How has that gone for you? Was the book really helpful, and if so, would you mind sharing the title? Or was it mainly having the support of being in a real course? What school are you earning your online degree in? Good luck with everything and I look forward to following. (And learning from) your progress!

    • Renee
      Feb 22, 2014

      Hi Scott! Thanks for the comment! I’m about to post tonight or tomorrow with a little more progress.

      The textbook has been helpful with the mathematical side of things so far. It’s Pattern Recognition and Machine Learning by C. Bishop. However, I’ve had to do a lot of online research to really “get” most of what is being said in class. Luckily, I am married to a physicist, and he has been really helpful with explaining the math, too!

      Unfortunately, I don’t feel I get a lot of support, even though I’m in “real” courses. Being an online student alongside in-class students is difficult in terms of getting support. Sometimes the professors forget we’re there, though my ML prof has been good so far, but not being able to stop by a prof’s office and just talk things over, or work through problems with other students in person really adds a level of difficulty.

      I’m in the CGEP Systems Engineering program at UVA. Honestly I wouldn’t recommend it to anyone, it has been a struggle to enroll in courses and get any kind of support. I have had some classes at ODU and they are definitely more supportive online so far, but it’s still tough with topics this advanced!

      I’ll continue to post resources as I find them!

    • Scott Edwards
      Feb 23, 2014

      Thanks, Renee! I appreciate you reminding me of Bishop’s book. I have it, but didn’t get far into it, and had forgotten about it.

      It did strike me as pretty dense, and pretty mathematical – but very good. I’ll give it another go.

      Wow, you are so lucky to be married to a physicist, who can help you with the math! I’m afraid that, although I have a few comp sci friends, I don’t have any that can really help me with heavy math. I’m stuck with trying to find online resources and/or books, but it’s really hard to know what to read/buy.

      I’d love to know any math resources that you have found helpful, books or online.

      It inspires me that you were able to “Show that the Gamma distribution is appropriately normalized” in four pages of math. I’m geeky enough to aspire to be able to do that some day!

      Thanks also for the advice on taking the class. I probably wouldn’t be able to afford it anyway right now, but good to know I’m not missing out on a ton. Does the class have a decent site?

      It’s nice, as well, to read about someone whose trying to learn all this stuff while holding down a full-time job — owning your own firm, no less. As someone who’s struggled through designing fully-normalized databases (as well as dimensionally-modeled ones as well), that encapsulate as much as possible the business logic in the design, I appreciate that that is quite a feat as well.

    • Renee
      Feb 23, 2014

      The site for the class is on Blackboard at the school, and you have to log into get the materials, so it’s not publicly visible. Pretty sure it’s against the honor code to share them online, otherwise I would.

      I appreciate your interaction here on the site and hope you will get something out of watching my struggles! :)

      In terms of math help, one site I’ve found that helps is Wolfram Alpha. When I get stuck, I can sometimes paste in what I’m doing there and get some visualizations and explanations that help. (Though the free version is somewhat limited.) I often google a topic and find lecture notes from schools that post them publicly (probably not always on purpose!), so sometimes that helps to get an alternate explanation.

      Khan Academy is good for the more straightforward statistics stuff, and sometimes I can find videos on YouTube explaining more advanced concepts, so it’s always worth looking!

      But yes, I’m spoiled to have an in-house tutor :)