A collection of notebooks and methods for exploring data and implementing machine learning methods. Machine Learning, Neural Networks, Feature Selection, Time-Series Prediction
This is a compilation of notebooks which I created for data analysis or for exploration of machine learning algorithms. This includes everything from data visualization and wrangling, feature selection, to classifiers, and neural networks.
Given historical weather data, can I predict whether tomorrow will be rainy in New York? In this example, I use historical weather data from NY to ask, given previous days, can I predict whether any next day (tomorrow) be rainy? In this example, I largely use LSTM neural networks and Keras/Tensorflow to address this.
Datasource: Kaggle, SelfishGene, https://www.kaggle.com/selfishgene/historical-hourly-weather-data
One useful data science approach in manufacturing is to make predictions about the pass/fail or quality of a product given sensor readouts from the manufacturing line. The SECOM dataset has a large number of such sensor readings and pass/fail labels for Semiconductor manufacturing. However, the data set suffers from some common problems
In the first notebook, I deal with missing data, perform feature selection using three methods (a wrapper, filter, and Lasso/L1 embedded approach), and deal with class imbalance using Synthetic Minority Over-sampling Technique (SMOTE). Then SVM, decision trees, and random forests classifiers were used for manufacturing classification
As a follow-up to using non-neural network classifiers, I tested three neural network models classifiers (simple neural network, deep neural network, LSTM neural network) on the Secom dataset. The goal here was to make a classifier which predicted the class (pass/fail) for the manufacturing product.
Neural networks can be used for text prediction and natural language processing. Keras includes a dataset of 11,228 newswires from Reuters, with labeled over 46 topics. Predicting the topic depends not just on the the previous words, but the sequence in which they are presented. Therefore they are an ideal test case for LSTM neural networks which take sequences as inputs.
One of the potential transformative applications of machine learning is in the use of medical diagnostics. In this notebook, I explored using decision trees using Gini Indexing or Entropy for splitting. This notebook also demonstrates decision tree visualization, allowing the user to see which attributes and values are useful for predicting cancer in this dataset.
A useful business application of machine learning is to predict a customers purchase choice based on their demographics. In this notebook, I explore random forests and gradient-boosted decision trees for predicting whether bank customers will purchase a specific product.
I demonstrate the use support vector machines (SVM) for predicting the number of rings in abalaone shells using other parameters in the dataset such as length and weight. The number of rings are related to the age of the abalone. This notebook also demonstrates the use of GridsearchCV for hyperparameter tuning.