Thursday, February 6, 2014

How To Learn Data Mining

Data mining is the science and art of extracting patterns and relationships between variables in datasets.  This is an exciting field that blends techniques from statistics and computer science and is increasingly more important as we enter the world of Big Data.  Google, Microsoft, Apple, Amazon, Facebook, and many other tech companies use data mining to learn more about your computer-using habits to increase their profits.  Our government has (and probably still does) use data mining to combat terrorism (even though this may violate the Fourth Amendment of the US Constitution).  For scientifically inclined individuals, I think this is one of the best fields to go into.  If you want to become very employable, learn how to mine data.

Learning to data mine is not always an easy process.  Make sure you have a sound foundation in linear algebra, multivariate calculus, probability, and statistics.  Lots of more basic data mining techniques are quite old such as linear/non-linear regression and linear/quadratic discriminant analysis.  After that, knowing at least one computer programming language is beneficial.  MATLAB, R, and STATA are popular among scientists.  Computers are used because the datasets are often very large, even if the statistical algorithm is fairly straightforward.  However, there are some techniques that are purely from computer science such as support vector machines, requiring no knowledge of statistics or probability.

I used this textbook for a data mining course at Columbia University (which was mostly geared towards statistics graduate students): http://www-bcf.usc.edu/~gareth/ISL/
Here is a free, more advanced textbook you can use that is fairly popular: http://statweb.stanford.edu/~tibs/ElemStatLearn/

I would follow the order presented in the books.  They focus on supervised learning where you know what sort of relationship you are looking for between the variables.  For example, you look through sample data containing information about 10,000 Americans (e.g. sex, age, height, weight, IQ, religion, ethnicity, region, income) and you want to know if there's a relationship between weight and income.  Unsupervised learning is where you do not know what sort of relationship you are looking for because the dataset is not labeled.  An example would be seeing a bunch of data points on a graph but noticing that they are spread in certain clusters or in a certain shape.

This is definitely a subject that is more fun to do than to write or read about.  The point of this article was simply to give more exposure to an important topic.  If you have an interest in anything science-related, look into data mining and data science more generally.  Taking a course in it is probably one of the quickest ways to land a decently paying job.  I even know an actuary and a physicist who left their jobs to work as data scientists.  Those are just some things to consider.

No comments:

Post a Comment