ML Project 1: Summary

In class on Thursday, it turned out that I was 1 of only 2 of the 12 students that had actually finished the project, and the professor gave an extension because he realized that he had forgotten to post a file that would help us understand the data and what we needed to do! I had done all but one item on the full project description anyway, so I added that small item tonight and it’s all submitted. I’ll update you when I get a grade!

I wanted to have one post summarizing the project since I was posting bit by bit as I worked on it.

The description of the project is in this first post: Machine Learning Project 1

It involved creating 4 classification algorithms:

  • Naive Bayes
  • Bayes
  • Gaussian Kernel Density Estimator
  • K-Nearest Neighbor

Naive Bayes involves finding the mean and variance of each class in the training data, and creating a 2-dimensional Gaussian based on the 2 columns in each class. Then, for each 2-column point in the Test set, you check it against each class’ probability distribution and determine which class it has a higher probability of being in. With the data provided, mine classified 86.5% of the points in the test set correctly. [my code, more info]

The Gaussian Bayes Classifier was the same as the Naive Bayes, except it included the covariance between the two columns. (The “naive” part of naive Bayes is that you assume the columns of data are independent of one another and drop the covariances.) This one classified 88% of the data correctly. [my code, more info (pdf)]

Gaussian Kernel Density Estimator (KDE) involves creating a tiny normal distribution “hat” over each point, then adding up these distribution to generate an overall “smoothed” distribution of the points in each class. This is apparently often used for image classification. For the provided data, when I used an h-value of 1.6, it classified 88.25% of the data points correctly. [my code, more info]

The K-Nearest-Neighbors classifier takes each point, creates a spherical volume around it that contains “k” points, then classifies the point based on which class has the most points within that sphere. So, it looks at the “neighbors” closest to each point to determine how to classify it. I got the best results with a k of 9, which correctly classified 87.25% of the test data. [my code, more info]

All of my python files for this project are attached (zip file here), along with a file where I created a scatterplot of the data, and a 2nd version of the Bayes classifier where I tidied it up a bit with new Python coding tricks I’m learning such as this shortcut to get a subset of items within a list without having to loop through:

Xtrn_1 = data_train[data_train[:,2] == 1, 0:2]

(This gets the first 2 columns in my training set for each row where the value of the 3rd column is 1, allowing me to quickly split each class in the training data into its own list.)

If you are more experienced than I am at Machine Learning or Python, I would love any suggestions on improving these algorithms! (I know one thing I could do is make each into a more general reusable function.)

If you are new to Machine Learning like I am, I’d love to hear from you about what you think about this project!