Machine Learning Demystified

Several friends encouraged me to apply to a Data Scientist position at ID Insights, an organization I greatly admire, and for a position which I would be passionate about. Before doing so, I familiarized myself throrougly with numpy, pandas and sklearn, three of the most important libraries for machine learning in Python.

I used a dataset from Kaggle: Health Care Cost Analysis, referenced as “insurance.csv” thoughout the code. The reader will also have to change the variable “directory” to fit their needs.

Otherwise, the current files in this directory are:

Thoughts on sklearn

The exercise proved highly, highly instructive, because sklearn is really easy to use, and the documentation is also extremely nice. The following captures my current state of mind:

It came as a surprise to me that understanding and implementing the algorithm were two completely different steps.

All KMeans plots produced by the code.

Three highlights.

The code produces the above visualizations for all algorithms. Here are three highlights.