Machine Learning – Classification and Regression Analysis

Machine Learning is the science and art of programming computers so they can learn from data.

For example, your spam filter is a Machine Learning program that can learn to flag spam given examples of spam emails (flagged by users, detected by other methods) and examples of regular (non-spam, also called “ham”) emails.

The examples that the system uses to learn are called the training set. The new ingested data is called the test set. The performance measure of the prediction model is called accuracy and it’s the objetive of this project.

The tools

To tackle this, Python (version 3) will be used, among the package scikit-learn. You can find more info about this package on the official page.

https://scikit-learn.org/stable/tutorial/basic/tutorial.html

Supervised learning

In general, a learning problem considers a set of n samples of data and then tries to predict properties of unknown data. If each sample is more than a single number and, for instance, a multi-dimensional entry (aka multivariate data), it is said to have several attributes or features.

Supervised learning consists in learning the link between two datasets: the observed data X and an external variable y that we are trying to predict, usually called “target” or “labels”. Most often, y is a 1D array of length n_samples.

All supervised estimators in scikit-learn implement a fit(X, y) method to fit the model and a predict(X) method that, given unlabeled observations X, returns the predicted labels y.

If the prediction task is to classify the observations in a set of finite labels, in other words to “name” the objects observed, the task is said to be a classification task. On the other hand, if the goal is to predict a continuous target variable, it is said to be a regression task.

When doing classification in scikit-learn, y is a vector of integers or strings.

The Models

LinearRegression, in its simplest form, fits a linear model to the data set by adjusting a set of parameters in order to make the sum of the squared residuals of the model as small as possible.

LogisticRegression, which has a very counter-intuitive model, is a better choice when linear regression is not the right approach as it will give too much weight to data far from the decision frontier. A linear approach is to fit a sigmoid function or logistic function.

../../_images/sphx_glr_plot_logistic_001.png

The Data

Data is presented on a CSV file. It has around 2500 rows, with 5 columns. Correct formatting and integrity of values cannot be assured, so additional processing will be needed. The sample file is like this.

The Code

We need three main libraries to start:

  • numpy, which basically is a N-dimensional array object. It also has tools for linear algebra, Fourier transforms and random numbers.
    It can be used as an efficient multi-dimensional container of generic data, where arbitrary data-types can be defined.
  • pandas, which provides high-performance and easy-to-use data structures and data analysis tools simple and efficient tools for data mining and data analysis
  • sklearn, the main machine learning library. It has capabilities for classification, regression, clustering, dimensionality reduction, model selection and data preprocessing.

A non essential, but useful library is matplotlib, to plot sets of data.

In order to provide data for sklearn models to work, it has to be encoded first. As the sample data has strings, or labels, a LabelEncoder is needed. Next, the prediction model is declared, where a LogisticRegression model is used.

The input data file path is also declared, in order to be loaded with pandas.read_csv().

import pandas as pd
import numpy as np
import matplotlib.pyplot as pyplot

from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression

encoder = LabelEncoder()
model = LogisticRegression(
    solver='lbfgs', multi_class='multinomial', max_iter=5000)

# Input dataset
file = "sample_data.csv"

The CSV file can be loaded into a pandas dataframe in a single line. The library also provides a convenient method to remove any rows with missing values.

# Use pandas to load csv. Pandas can eat mixed data with numbers and strings
data = pd.read_csv(file, header=0, error_bad_lines=False)
# Remove missing values
data = data.dropna()

print("Valid data items : %s" % len(data))

Once loaded, the data needs to be encoded in order to be fitted into the prediction model. This is handled by the previously declared LabelEncoder. Once encoded, the x and y datasets are selected. The pandas library provides a way to drop entire labels from a dataframe, which allows to easily select data.

encoded_data = data.apply(encoder.fit_transform)
x = encoded_data.drop(columns=['PREDICTION'])
y = encoded_data.drop(columns=['DRAFT', 'ACT', 'SLAST', 'FLAST'])

The main objective is to test against different lengths of train and test data, to find out how much data provides the best accuracy. The lengths of data will be incremented in steps of 100 to get a broad variety of results.

length = 100
scores = []
lenghts = []
while length < len(x):
    x_train = x[:length]
    y_train = y[:length]
    x_test = x.sample(n=length)
    y_test = y.sample(n=length)
    print("Fitting model for %s training values" % length)
    trained = model.fit(x_train, y_train.values.ravel())
    score = model.score(x_test, y_test)
    print("Score for %s training values is %0.6f" % (length, score))
    length = length + 100
    scores.append(score)
    lenghts.append(length)

Finally, a plot is made with the accuracy scores.

pyplot.plot(lenghts,scores)
pyplot.ylabel('accuracy')
pyplot.xlabel('values')
pyplot.show()

Spiceworks Customization

Andrew Foster at Topland Communications reached me via Upwork looking to customize and fine tune a existing Spiceworks installation.

After a quick inspection, I decided to tackle the project by compacting the DB in first place. Spiceworks keeps a lot of logs regarding the system activity, which are located on C:\Program Files\Spiceworks\Log. In order to clean them, the first step is to stop Spiceworks service.

Logs are stored in two main locations:

  • C:\Program Files\Spiceworks\Log, for the Spiceworks service
  • C:\Program Files\Spiceworks\httpd\log\, where the Apache server keeps them

Once the logs are cleaned, I compacted the DB to increase the performance, and I started the service again.

Ticket rules were configured to auto assign support tickets, thus saving time to the support operators.

And the user portal was customized to match the company colors and logo.

ISPConfig 3 in Digital Ocean Droplet

Client wanted to set up a ISPConfig 3 Control Panel onto a Digital Ocean droplet.

Digital Ocean works best for this kind of services, because they provision the public addresses directly on the server. The configuration is easier to build and mantain, thanks to the Digital Ocean integrated firewall.

ISPConfig allows to manage servers and hosting plans from a friendly GUI.

A new order of IT

There is a new order of IT. In the last years, a very disruptive element appeared in the field, with the name of IT as a Service (ITaaS). On its top there is a crystal-clear examination and understanding of business and technology needs, and at its bottom there is a foundation built by a massive set of virtualized resources.

Now, IT administrators can find a set of previously configured building blocks that can be combined and deployed very quickly. Using this technology, the IT departments can respond to the changing needs of the business with optimized yet highly standardized solutions.

Using the ITaaS model, most of the information technology solutions can be deployed when they are needed, at any time, paying only for what is used. It is a shift on operational and organizational procedures to run IT like a business and service provider.

This approach allows IT areas to be a strategic partner of the business.

This service model requires a platform or catalog comprising information about the users and the services each one consumes. It also should bring information about to which services a user is subscribed, and how the services use will be charged back to the respective business unit.

Once all the services are cataloged and published,

  • Can the business units act upon it?
  • It is just a static document or is it a dynamic tool?
  • The services can be directly requested from within the catalog?
  • Is it easy to use as any online store?

Most IT departments already have a set of tools manage and monitor their infrastructure. These tools often also keep track of cost, orders, helpdesk requests and many other functions within IT. Maybe there even is another service catalog in another division of the organization. All of these possibilities must be considered when selecting a service catalog tool.

  • Can the new catalog integrate with the existing tools?
  • Will it replace an existing tool?
  • Also, the process automation is already bundled into the platform, or the IT department will need to engineer it? It is scalable?
  • As any service to a business, the catalog tool carries a cost with it. When any updates of fixes become available, will the vendor charge for it? How is the licensing scheme calculated?

Answers to these and more questions will be needed before to know how much the new service catalog tool will really cost to the organization, and how to design a business case for its acquisition. Even when all these questions are answered, it takes time to retrain the staff and restructure the habitual policies and procedures.

In the traditional IT approach, everything is organized in a vertical form. There is a storage team, a networking team, system administration team and a DBA team. But in the ITasS world, the approach now is horizontal. There is a cloud services architecture team and most of the nfrastructure is virtualized and abstracted, so everybody in the IT team can work across different functions.

This newer horizontal organization usually produces highly skilled personnel for cloud computing implementations. These kind of employees is very rare and in high demand.

When the ITaaS model is deployed in a company or organization, sometimes there can be difficulties retaining the skilled cloud personnel. Sometimes the solution is found in service providers because the talent now is working for them.

The first step in the transformation is to understand what the organization is dealing with today. IT infrastructures are complex and usually have an unstructured approach to the delivery of IT services.

Mobile users, helpdesk request, are sometimes serviced ad-hoc, often without attention to business requirements. This leads to a complex mesh of user requirements and available services that can be difficult to untangle.

Also, does the IT team should try to preserve the actual user experience? Should it set a breakpoint where many elements are replaced with a brand new user experience?

In conclusion, IT teams should discover what services are delivering today, take control of these services, and put in place a delivery platform capable to deliver current services and future ones. Also, they must ensure that the platform can integrate with the largest desktop and application delivery approaches, simplifying the user experience, meeting all security and compliance requirements.

The service delivery should not just focus on application installation; it must consider other requirements so the services can be delivered fully. The solution should integrate the existing tools and processes, but also giving enough flexibility to enable any other services that your users need.

When the service delivery platform is ready to go, then the catalog of services should be distributed. All users should receive a services offer relevant to their necessities and their position in the organization. Also, in an optimal service delivery catalog, users should be able to select a service and, subject to previously established rules and approvals, the service should be delivered directly to the user, in an automated process,
fully provisioned and working.

A well designed and efficient service catalog can result in huge advantages for the IT department and for the business.

  • Better communication between the IT team and users, because of ease of administration and the service-oriented approach
  • Improved understanding of the business requirements, issues and challenges
  • Costs are allocated specific business units
  • Standards are established and consistency is achieved
  • IT operational costs are reduced by identification and elimination of non- necessary IT services
  • Computing resources are reallocated to critical business systems

It is important also that, whatever platform is used to provide the catalog, the solution should be adapted to the user base and to the services delivered. This information is critical for implementing chargeback, so a good services catalog platform should be capable to answer some questions.

  • What is the cost of delivering each service?
  • How much should be charged for each service?
  • Who consumes each service?
  • Who should be billed?
  • There are some services provided free of charge?

ITaaS doesn’t have to be an additional layer of complexity. IT departments and organizations should partner with a vendor who understands the process, so you can get solutions that help you to address each step of the way.

It can deliver huge benefits to you and your business.