Machine Learning – Weighted Train Data

Last post talked about an introduction to Machine Learning and how outcomes can be predicted using sklearn’s LogisticReggression.

Sometimes, the input data could require additional processing to prefer certain classes of information, that it considered more valuable or more representative to the outcome.

The LogisticRegression model allows to set the preference, or weight, at the time of being created, or later when being fitted.

The data used on the previous entry had four main classes: DRAFT, ACT, SLAST and FLAST. Once it is encoded and fitted, it can be selected by its index. I prefer to initialize some mnemonics selectors to ease the coding and make the entire code more human friendly.

x_columns_names = ['DRAFT', 'ACT', 'SLAST', 'FLAST']
y_columns_names = ['PREDICTION']

# Indexes for columns, used for weighting
DRAFT = 0
ACT = 1
SLAST = 2
FLAST = 3

# Weights
DRAFT_WEIGHT = 1
ACT_WEIGHT = 1
SLAST_WEIGHT = 1
FLAST_WEIGHT = 1

The model can be initialized lated using the following method, where the class_weight parameter is used referencing the previous helpers.

model = LogisticRegression(
    solver='lbfgs',
    multi_class='multinomial',
    max_iter=5000,
    class_weight={
        DRAFT: DRAFT_WEIGHT,
        ACT: ACT_WEIGHT,
        SLAST: SLAST_WEIGHT,
        FLAST: FLAST_WEIGHT,
    })

Machine Learning – Classification and Regression Analysis

Machine Learning is the science and art of programming computers so they can learn from data.

For example, your spam filter is a Machine Learning program that can learn to flag spam given examples of spam emails (flagged by users, detected by other methods) and examples of regular (non-spam, also called “ham”) emails.

The examples that the system uses to learn are called the training set. The new ingested data is called the test set. The performance measure of the prediction model is called accuracy and it’s the objetive of this project.

The tools

To tackle this, Python (version 3) will be used, among the package scikit-learn. You can find more info about this package on the official page.

https://scikit-learn.org/stable/tutorial/basic/tutorial.html

Supervised learning

In general, a learning problem considers a set of n samples of data and then tries to predict properties of unknown data. If each sample is more than a single number and, for instance, a multi-dimensional entry (aka multivariate data), it is said to have several attributes or features.

Supervised learning consists in learning the link between two datasets: the observed data X and an external variable y that we are trying to predict, usually called “target” or “labels”. Most often, y is a 1D array of length n_samples.

All supervised estimators in scikit-learn implement a fit(X, y) method to fit the model and a predict(X) method that, given unlabeled observations X, returns the predicted labels y.

If the prediction task is to classify the observations in a set of finite labels, in other words to “name” the objects observed, the task is said to be a classification task. On the other hand, if the goal is to predict a continuous target variable, it is said to be a regression task.

When doing classification in scikit-learn, y is a vector of integers or strings.

The Models

LinearRegression, in its simplest form, fits a linear model to the data set by adjusting a set of parameters in order to make the sum of the squared residuals of the model as small as possible.

LogisticRegression, which has a very counter-intuitive model, is a better choice when linear regression is not the right approach as it will give too much weight to data far from the decision frontier. A linear approach is to fit a sigmoid function or logistic function.

../../_images/sphx_glr_plot_logistic_001.png

The Data

Data is presented on a CSV file. It has around 2500 rows, with 5 columns. Correct formatting and integrity of values cannot be assured, so additional processing will be needed. The sample file is like this.

The Code

We need three main libraries to start:

  • numpy, which basically is a N-dimensional array object. It also has tools for linear algebra, Fourier transforms and random numbers.
    It can be used as an efficient multi-dimensional container of generic data, where arbitrary data-types can be defined.
  • pandas, which provides high-performance and easy-to-use data structures and data analysis tools simple and efficient tools for data mining and data analysis
  • sklearn, the main machine learning library. It has capabilities for classification, regression, clustering, dimensionality reduction, model selection and data preprocessing.

A non essential, but useful library is matplotlib, to plot sets of data.

In order to provide data for sklearn models to work, it has to be encoded first. As the sample data has strings, or labels, a LabelEncoder is needed. Next, the prediction model is declared, where a LogisticRegression model is used.

The input data file path is also declared, in order to be loaded with pandas.read_csv().

import pandas as pd
import numpy as np
import matplotlib.pyplot as pyplot

from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression

encoder = LabelEncoder()
model = LogisticRegression(
    solver='lbfgs', multi_class='multinomial', max_iter=5000)

# Input dataset
file = "sample_data.csv"

The CSV file can be loaded into a pandas dataframe in a single line. The library also provides a convenient method to remove any rows with missing values.

# Use pandas to load csv. Pandas can eat mixed data with numbers and strings
data = pd.read_csv(file, header=0, error_bad_lines=False)
# Remove missing values
data = data.dropna()

print("Valid data items : %s" % len(data))

Once loaded, the data needs to be encoded in order to be fitted into the prediction model. This is handled by the previously declared LabelEncoder. Once encoded, the x and y datasets are selected. The pandas library provides a way to drop entire labels from a dataframe, which allows to easily select data.

encoded_data = data.apply(encoder.fit_transform)
x = encoded_data.drop(columns=['PREDICTION'])
y = encoded_data.drop(columns=['DRAFT', 'ACT', 'SLAST', 'FLAST'])

The main objective is to test against different lengths of train and test data, to find out how much data provides the best accuracy. The lengths of data will be incremented in steps of 100 to get a broad variety of results.

length = 100
scores = []
lenghts = []
while length < len(x):
    x_train = x[:length]
    y_train = y[:length]
    x_test = x.sample(n=length)
    y_test = y.sample(n=length)
    print("Fitting model for %s training values" % length)
    trained = model.fit(x_train, y_train.values.ravel())
    score = model.score(x_test, y_test)
    print("Score for %s training values is %0.6f" % (length, score))
    length = length + 100
    scores.append(score)
    lenghts.append(length)

Finally, a plot is made with the accuracy scores.

pyplot.plot(lenghts,scores)
pyplot.ylabel('accuracy')
pyplot.xlabel('values')
pyplot.show()

Customizing NetBox Templates

NetBox is an IP address management (IPAM) and data center infrastructure management (DCIM) tool. Initially conceived by the network engineering team at DigitalOcean, NetBox was developed specifically to address the needs of network and infrastructure engineers.

Image result for netbox device types

When I started using NetBox on my daily job, I planned to use it as a replacement for all the spreadsheets I had for switch configurations, IP address management, secrets, and VLAN assignments. NetBox can handle all of this and more, but the interface didn’t suit my needs.

NetBox is built using the Python Django framework, which I have used for another projects. I used Visual Studio Code to clone the repository and debug, as it has native support for the Django template language.

I keep a copy of the repository on my local machine for ease of modifications. Prior, I have set DEBUG=TRUE on netbox/configuration.py, and allowed localhost and my local network to access the development server. Also, I set the correct settings to connect to the existing postgresql database.

Connecting the existing DB to my local development server

This environment works for test purposes, but the best you can do is to set up separated development and production environments, and commit your changes to production once everything is tested.

Using VSCode to debug Django

The URL definition for the single device view is around line #147 of the netbox/dcim/urls.py file, and it looks like this.

 url(r'^devices/(?P<pk>\d+)/$', views.DeviceView.as_view(), name='device'),

Heading to the DeviceView view, I put a breakpoint on the interfaces
QuerySet of the view definition, and launched the debugger. The default location is at http://localhost:8000.

Setting up the debugger
Breakpoints

I headed to http://localhost:8000/dcim/devices/570/, where I had defined a switch with several VLANs, to hit the breakpoint and find out if the
QuerySet had information about the VLANs, or if they were queried in a per-interface basis, on the interface view.

QuerySet returns this

Lucky me, the QuerySet recovered all the information I needed, and it is passed to the template via a render() call.

All the information I want is rendered on this table. This is the power of the Django framework. I added line #513 as an additional header for the VLANs column.

This table has a for loop which iterates for each interface of the device, so I edited the included template file at dcim/inc/interface.html.

Both tagged and untagged VLANs groups have a bolded title, and the VID and VLAN name is shown after it. I used the dictsort filter, which is part of the Django framework, to sort all the VLANs by their VID.

dcim/inc/interface.html

The final result looks like the following image, and it allows to keep track of all the VLANs on all ports, at first sight. This is easier and more user friendly than getting that information interface per interface, or making a new custom view.

New Template Rendering