Machine Learning is the science and art of programming computers so they can learn from data.
For example, your spam filter is a Machine Learning program that can learn to flag spam given examples of spam emails (flagged by users, detected by other methods) and examples of regular (non-spam, also called “ham”) emails.
The examples that the system uses to learn are called the training set. The new ingested data is called the test set. The performance measure of the prediction model is called accuracy and it’s the objetive of this project.
The tools
To tackle this, Python (version 3) will be used, among the package scikit-learn. You can find more info about this package on the official page.
https://scikit-learn.org/stable/tutorial/basic/tutorial.html
Supervised learning
In general, a learning problem considers a set of n samples of data and then tries to predict properties of unknown data. If each sample is more than a single number and, for instance, a multi-dimensional entry (aka multivariate data), it is said to have several attributes or features.
Supervised learning consists in learning the link between two datasets: the observed data X and an external variable y that we are trying to predict, usually called “target” or “labels”. Most often, y is a 1D array of length n_samples.
All supervised estimators in scikit-learn implement a fit(X, y) method to fit the model and a predict(X) method that, given unlabeled observations X, returns the predicted labels y.
If the prediction task is to classify the observations in a set of finite labels, in other words to “name” the objects observed, the task is said to be a classification task. On the other hand, if the goal is to predict a continuous target variable, it is said to be a regression task.
When doing classification in scikit-learn, y
is a vector of integers or strings.
The Models
LinearRegression, in its simplest form, fits a linear model to the data set by adjusting a set of parameters in order to make the sum of the squared residuals of the model as small as possible.
LogisticRegression, which has a very counter-intuitive model, is a better choice when linear regression is not the right approach as it will give too much weight to data far from the decision frontier. A linear approach is to fit a sigmoid function or logistic function.
The Data
Data is presented on a CSV file. It has around 2500 rows, with 5 columns. Correct formatting and integrity of values cannot be assured, so additional processing will be needed. The sample file is like this.
The Code
We need three main libraries to start:
- numpy, which basically is a N-dimensional array object. It also has tools for linear algebra, Fourier transforms and random numbers.
It can be used as an efficient multi-dimensional container of generic data, where arbitrary data-types can be defined. - pandas, which provides high-performance and easy-to-use data structures and data analysis tools simple and efficient tools for data mining and data analysis
- sklearn, the main machine learning library. It has capabilities for classification, regression, clustering, dimensionality reduction, model selection and data preprocessing.
A non essential, but useful library is matplotlib, to plot sets of data.
In order to provide data for sklearn models to work, it has to be encoded first. As the sample data has strings, or labels, a LabelEncoder is needed. Next, the prediction model is declared, where a LogisticRegression model is used.
The input data file path is also declared, in order to be loaded with pandas.read_csv().
import pandas as pd
import numpy as np
import matplotlib.pyplot as pyplot
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
encoder = LabelEncoder()
model = LogisticRegression(
solver='lbfgs', multi_class='multinomial', max_iter=5000)
# Input dataset
file = "sample_data.csv"
The CSV file can be loaded into a pandas dataframe in a single line. The library also provides a convenient method to remove any rows with missing values.
# Use pandas to load csv. Pandas can eat mixed data with numbers and strings
data = pd.read_csv(file, header=0, error_bad_lines=False)
# Remove missing values
data = data.dropna()
print("Valid data items : %s" % len(data))
Once loaded, the data needs to be encoded in order to be fitted into the prediction model. This is handled by the previously declared LabelEncoder. Once encoded, the x and y datasets are selected. The pandas library provides a way to drop entire labels from a dataframe, which allows to easily select data.
encoded_data = data.apply(encoder.fit_transform)
x = encoded_data.drop(columns=['PREDICTION'])
y = encoded_data.drop(columns=['DRAFT', 'ACT', 'SLAST', 'FLAST'])
The main objective is to test against different lengths of train and test data, to find out how much data provides the best accuracy. The lengths of data will be incremented in steps of 100 to get a broad variety of results.
length = 100
scores = []
lenghts = []
while length < len(x):
x_train = x[:length]
y_train = y[:length]
x_test = x.sample(n=length)
y_test = y.sample(n=length)
print("Fitting model for %s training values" % length)
trained = model.fit(x_train, y_train.values.ravel())
score = model.score(x_test, y_test)
print("Score for %s training values is %0.6f" % (length, score))
length = length + 100
scores.append(score)
lenghts.append(length)
Finally, a plot is made with the accuracy scores.
pyplot.plot(lenghts,scores)
pyplot.ylabel('accuracy')
pyplot.xlabel('values')
pyplot.show()