## Introduction

In preparation for my third year at university, in the Summer of 2021 I read through Tom Mitchell's 1997 textbook *Machine Learning*. Despite being written in the 90s the book covers machine learning techniques that are employed in industry today, and it provided me a thorough introduction to the foundations of the field.

While I have tried to stay away from equations and technical jargon on this page, there are a handful of terms that I can't really avoid using and that benefit from a proper definition. So before we dive in, some vocabulary:

**Machine learning**- a machine is said to*"learn"*if it improves at a task as it gains experience. The performance metric for "improvement" is defined on a task-by-task basis, and we'll discuss experience more later.- (Machine learning)
**model**- the algorithm and data structures used to actually achieve this learning. **Instance**- a single object from the data set you are using. For example, if your data set is country demographics then an instance would be the demographics for a single country.**Training and test data**- "training data" refers to the data the model uses to learn how to perform its task, and "test data" is used to assess the model's performance.**Classification**- attempting to correctly predict the*"class"*of a given instance. To use the countries example again, the class of a country instance might be the continent it is on. Generally, the performance metric used for classification tasks is the accuracy of the model on classifying test data.

In order to solidify my understanding of what I had read in the Mitchell textbook, I have implemented from scratch three classification models that were introduced: decision trees, *k*-nearest-neighbours, and artificial neural networks. Below I describe these models and their behaviour, using the Iris Data Set as example data for all three.

Speaking of which,

### The Iris Data Set

The Iris Data Set is a classic set of data records that is widely used to demonstrate various machine learning principles. It comprises measurements for 150 iris flowers, 50 from each of the species *Iris versicolor*, *Iris virginica*, and *Iris setosa*. The measurements taken are the sepal height (SH), sepal width (SW), petal height (PH), and petal width (PW).

150 examples is not a lot, and as we'll see later this leads to limitations with model selection and design. Additionally there are challenges with properly classifying instances specific to this data set - in our case the classes of our instances are the 3 species of iris considered. The biggest issue at play is that the 3 classes are not *linearly separable*. Let's say that you don't need to predict all the classes properly, but instead just know if an instance's class is "SPECIES-X" or "NOT-SPECIES-X". To make this classification, you might want to see if you could just plot the instances and draw a line that would cleanly separate our instances into the two categories.

Strictly speaking, the Iris Data Set has 4 measurements and so would be plotted in a 4-D space with a 3-D hyperplane as our divider of choice, but whatever. What's relevant for our discussion is that in this data set it is actually impossible to draw these lines for each SPECIES-X - that is, for some species there is no perfect straight line that separates it from the others. Natural variances in the flowers lead to edge cases overlapping, so no perfect linear divider is possible and we say that the data is not linearly separable.

## Decision Trees

The classic decision tree is a simpler version of a flow chart without loops. At each node an attribute (measurement) of the instance is considered, and child nodes are constructed for each possible outcome of that consideration - a child node per value the attribute could take. If at a given node all instances in the training data that reach it have the same class, we can stop there and return their class as our final classification. If we have run out of attributes to consider, then we return the most common classification of instances that reach that node. If we still have attributes left to consider and we have not completely separated our instances, then we choose an attribute to ask questions about and go from there.

The above explanation gives a general idea of what a decision tree is and the basis of how an algorithm might construct one. However, there's an important piece of information missing. If you were to actually try and build a decision tree for a given data set, you would need a root node and a first attribute to select. Which attribute should that be? Putting it a bit more formally, how do we evaluate attributes of the data at non-terminal nodes? There are a couple of metrics available, but the most commonly used one comes from information theory.

### Information Gain

The point of asking a question is (arguably) to gain knowledge, so we'll want to ask questions that maximise this gain. Consider the subset of the data that actually makes it to a given node in the tree. How much space do you have to spend storing the class of each of those instances? If every instance in a set has the same class, then you don't have to do anything - you just label the set itself. Compare this to when you have a set of two instances (A and B) of different classes - now you need at least a bit of space to describe the class of an instance. We say that the *entropy* of a set is the theoretical amount of space needed to store the class of any instance in that set. If a set's entropy decreases then we must have learned something about its classes - information we don't have to store anymore must be information we have gained. We can find the *information gain* of a node by considering the entropy of the data set before and after it is split among the child nodes, and so we pick the attribute to consider that maximises that gain.

One thing to note about using information gain is that it won't favour an attribute outside of its entropy. Because of this, real-world knowledge about the data may not necessarily be reflected in the resulting tree. Consider our toy countries example - let's say that there were another 50 entries, but by chance the only "High" countries were in Europe. If you partitioned the data on population size, the entropy of the "High" set would be 0 as they all have the same class. As a result, it's likely that population would give the best information gain and so would be the first attribute considered. This would imply that our model of the real world thinks population size is the most important metric for determining country location, which is of course not representative of reality - there are plenty of highly populated countries outside of Europe, and there are several tiny and moderately-sized countries within it.

The problem is that our model is tailored too closely to our training data and it "hallucinates" structures within the data set that aren't really there. We refer to this as *overfitting*, and it is an issue models throughout the field have to combat. The tell-tale sign of it is that while the accuracy of a model in training increases, its test accuracy starts to decrease after a certain point. There are techniques to help avoid overfitting that we'll look at later, but an issue with the Iris Data Set specifically is that overfitting is rife among smaller data sets. The smaller the set, the more likely it will vary from a representative sample of the real world, and so the more likely the model will learn patterns from it that would not necessarily exist in a larger sample.

### Continuous Domains

That's not the only issue with the Iris Data Set. In our description of how a decision tree is constructed, we stated that child nodes are created for each possible value of the attribute considered. This works well for attributes with a discrete domain (the number of values is limited), but all the attributes for the Iris set are real numbers and are therefore continuous (an infinite domain, as there is an infinity of numbers). Clearly you can't create a child node for every possible outcome when there are endless outcomes.

The solution employed by the C4.5 algorithm is to choose a value from the data set. Using it as a threshold, you split the data into all instances with values above the threshold and those with values below. An example of this is illustrated below for a dataset with 2 continuous attributes. Notice how the boundries drawn at each stage are straight lines, but the overall boundaries between classes are not necessarily straight lines (i.e. they have corners). Decision trees are therefore *nonlinear models* as they can approximate these non-straight decision boundaries. However, while they might approximate a nonlinear boundary overall, they still do so with linear segments, so we call them *piecewise linear* models. The Iris Data Set is not linearly separable, so it would require a nonlinear model like this one to be perfectly classified. In practice, decision trees constructed in this way still aren't powerful enough to get more than about 90% accuracy on the set.

Using all the techniques described above, here is an example tree created using a random half of the Iris Data Set. When tested on the other half, it achieved an accuracy of 0.86667.

*k*-Nearest-Neighbours

*k*-nearest-neighbours (kNN) is one of the simplest models to describe and implement, but it offers some interesting insights into how we conceptualise machine learning. In kNN, you represent each training instance as a point in Euclidean space (i.e. you plot them). To find the classification of a test instance, you pick an integer *k* and consider the nearest *k* points to the test instance in the model's space, returning the most popular class among them as the classification.

Consider how the behaviour of the model changes with our choice of *k*. When *k* is 1, its lowest value, we are essentially asking "which training instance does this most resemble?" - intuitively, we are asking for the "closest" example we have seen. With decision trees, we partitioned data at a given node among the child nodes based on an attribute. A form of partitioning is also at play here: each training example uniquely classifies instances in an area around it defined by how close it is to other examples. Instead of partitioning the data set, we are instead divvying the space of all possible instances itself. This is also referred to as a *Voronoi Tessellation*. In contrast with decision trees which partition the instances one attribute at a time, kNN partitions the instance space across all attributes simultaneously.

As *k* increases, we start to see partitions that contain training instances with different classes but that get averaged out to the same class. This has the effect of controlling for random variation within the training data, and can be useful depending on the data set. Once *k* takes on it's maximal value, that of the size of the training set, the entire instance space shares the same classification. In terms of the entropy of these partitions, it increases from 0 to the entropy of the training set itself as k increases. Here we see that there is a trade-off between entropy (and accuracy) and controlling for variance, which is another recurring idea in machine learning.

If this is a machine learning model, when did the learning happen? Previously we defined learning as improving at a task as experience increases, but what exactly was experienced? In the case of decision trees, we could continue the logic of information gain and roughly say that we learn every time we partition the data set - our experiences are the questions we ask. For kNN we partition the data set all at once, so does that mean we learn all at once? Well, another way to define learning is as the search for the final model. Consider the space/set/whathaveyou of all possible decision trees for the Iris Data Set. With our algorithm, we gradually reduce the size of the subset that we consider for our final tree each time we add another node. As for kNN, the set we search over is this shattering of the instance space, each instance adding another shard and narrowing down the possible partitions we could end up with. With this way of looking at it, we can see that storing the data itself was the learning. The model is implicitly constructed rather than the explicitly with decision trees, but it is still made.

What kind of classifier is kNN then? Well, just like decision trees it is a piecewise linear classifier. In the example below, with *k* = 1 we can see that the curved boundary between the classes is modeled quite well. However, the boundary between any pair of instances is still a straight line segment. Interestingly, we can see here a limitation of decision trees - what set of threshold values could possibly replicate this decision boundary?

There are a few ways to improve the kNN algorithm, the one explored here being that of centering the training instances before plotting them. Consider a dataset of 2 attributes and a class label. Say the values of the first attribute range from 0 to 1, but those of the second range from 0 to 100. A change in value in the second attribute is going to affect the distance from one instance to another far more than a change in the first one, unduly dictating how the instance space is partitioned just by virtue of the measurement unit. In order to ensure that attributes are evenly weighted, we work out the means and standard deviations of all attributes in the training set and center each datapoint, subtracting the mean and dividing by the standard deviation for each attribute. In essence this moves the data to be clustered around the origin point, and then evens the spread of the data along each axis. For example:

With the same split of the Iris Data Set used for training and testing before, I considered the values of *k* between 1 and 5 to see how it affected accuracy. Additionally, I employed the centering method described above. Here are my results:

Value of k |
Accuracy |
---|---|

1 | 0.86667 |

2 | 0.84 |

3 | 0.78667 |

4 | 0.77333 |

5 | 0.74667 |

## Artificial Neural Networks

Neurons in this context are mathematical models of biological ones. They take in inputs from neurons below them, and as in biology, they *activate* and send out an output. An artificial neural network (ANN) is a collection of neurons wired together, typically organised into a layered structure as illustrated below. The final output layer has one neuron per class, and the output with the highest activation for a given instance is taken as its class. In the mathematical model, the inputs a neuron receives are weighted by some value and a constant bias is added to their sum. What separates ANNs from the previous two models is that the activation function chosen is often a nonlinear function of this weighted sum - it's curved. The previous piecewise linear models had no way of truly representing curved boundaries between classes, as everything was based off of straight lines and dividers to try and separate instances. But with using curved activation functions, as well as increasing the number of layers within the network, it is proven that an ANN can faithfully represent any class boundary.

The exact way that we choose the weights and biases for each neuron is found via *backpropagation*. The idea is that we find the error of the network on classifying a given instance and pass it back through the network to determine how each weight and bias should change to reduce it. Once we are sufficiently accurate, or the error gradient is 0, we stop the process. This is because it may not be possible to reduce the error itself all the way to 0, but if the gradient is 0 then this signals that no more improvement is possible with these weights.

Or at least it's meant to signal that, but the reality is much messier. Considering the graph below, there are 3 values for *x* that produce a gradient of 0. Only one of them is a minimum of the error, one being a plateau or region of no change and the other being a maximum. If we started the algorithm with *x* = 3, then it is possible we would terminate upon reaching the plateau at *x* = 4.8. The strategy used here to resolve this is by modelling momentum in the updates of weights: by imagining the value of *x* as being a ball, we can see that it would build up speed on its way down and pass right through the plateau, eventually settling in the true minimum. Real-world examples are not this easily analysable and may contain multiple minima (of which only one is the ideal), but the momentum method is shown to work extremely well in practice.

Going through the data set, making small adjustments to our weights to gradually reduce the error, is our learning process. Characterising learning as search, we are searching through the space of all possible weight assignments. Another space we can search through is the size of the network we actually use - the more neurons and hidden layers we have, the higher our training accuracy will be. But, as we have more weights to optimise we also have more chances to overfit to patterns not present in the greater data distribution. In practice, the behaviour of large networks trained on small data sets is that they memorise each individual instance without properly learning the underlying patterns, and for this reason (as well as space efficiency) smaller networks are often preferred.

Another reason to prefer smaller networks is that this is by far the hardest model to interpret the behaviour of. For example, if I tell you the weights of a neuron are 0.5, 3.4, 1.432 with a bias of -0.8, what does that mean? You would have to consider it in the context of the entire network, which may include hundreds of neurons and weights in larger ones. Comparing this model against the other two, neural networks are by far the most powerful of the three discussed, but they are also the furthest removed from what we can effectively reason about.

As an aside, there's a useful way to build intuition about why networks with more weights perform better in training. If you were to graph the error from all the assignments of weights and biases, you could imagine it as a high-dimensional surface with our ball trying to find the lowest point it can. If the ball is at a local minimum in one dimension, the more other dimensions there are the more likely it will not be a local minumum in one of those, giving it a way to progress. Because of this, it is less likely to get trapped in local minima early in training. On the flip side, more dimensions leads to more ways to overfit the training data, so again we have a trade-off.

For the purposes of this project, I used a sigmoid activation function with momentum updates. My network had an input layer of 4 neurons and an output layer of 3, and I searched over the size of a single hidden layer in the middle. The resultant accuracies are:

Size of hidden layer | Accuracy |
---|---|

1 | 0.23333 |

2 | 0.94667 |

3 | 0.94667 |

4 | 0.96 |

5 | 0.96 |

## Final Thoughts

These three models are just a sample of those used in machine learning but still offer a broad view of the techniques available, and despite their simplicity they still produce respectable results. They vary not only on the spaces they search to produce their final model, but also on how they interpret what it means to "*learn*". As for my own experience in implementing these from scratch, kNN and decision trees were the easiest to actually code by hand, with kNN only taking an afternoon to get working correctly. Getting a functional version of the ANNs with all the bells and whistles took much longer, despite my familiarity with the math behind them. ANNs are ubiquitous in industry however, and there are a multitude of free code libraries available to implement them (Tensorflow, Pytorch, etc.). Due to their flexibilty in what they can handle and their inhuman efficiency in exploiting patterns in data, it's not a real surprise to me that the ANNs consistently produced the best results in this project across random restarts. I was personally surprised by the effectiveness of kNN, as I had thought it too simplistic to properly account for variability within the data set. I imagine that kNNs effectiveness might depend on the actual data you give it, whereas ANNs can work with pretty much anything.

All of the code used for this project is available on my GitHub. All images are my own.

Thank you for reading!