“Machine Learning”, “Deep Learning” or even “Artificial Intelligence” are major buzzwords where hype often obscures the true value they can bring. If you want to learn how to separate fact from science-fiction, read on.
In this post we’ll explain to a lay audience exactly what machine learning is, starting from the basic concepts, and build to an overview of more sophisticated methods. We’ll show you the difference between statistics and machine learning.
As it turns out, the real expertise comes from picking the right tool for the job. Experience helps a data scientist judge which tool is best for any given combination of business question and data source.
Often, the harsh reality is that many companies are likely not ready to handle, store, or use massive amounts of data without experienced help. We often observe companies who hire data scientists before setting clear objectives, forgetting the goal of integrating this data with business strategy, to produce tangible returns.
What’s the difference between statistics and machine learning?
Most people have a reasonable gut feeling of what statistics is – basically a branch of maths where you try to explain or ‘model’ some phenomenon. Maybe you’re modelling the weather to predict whether it will rain tomorrow. Maybe you’re running a DIY store and looking at customer transaction histories to work out which products to put in a bundle. Either way, you’re looking for a pattern (such as differences in averages, or observing correlations) and applying some sensible rules e.g. a weather rule that the temperature is unlikely to exceed 60 degrees C, or a business rule that the price of a DIY bundle should be less than the total price of its component products.
Statistics 101 says that you get your data, find the relationship you want to test and the formula that can test it – then plug your data in and crank the handle. Out come insights and correlations between the data – such as showing thick cloud and low pressure usually precede rain. In DIY land, your data might show that people buy smoke alarms and ladders together in September, just before their student tenants go back to University for the year. At this point it usually helps to produce graphical output (visualisations) for human interpretation, and stakeholder buy-in. So, statistics is just a formalised and rules-based method to glean insights from data.
Next, machine learning. This can be defined as a growing collection of algorithms and methods that can learn from the data without relying on a clear set of rules-based strategies. You let the “machine” train itself on your data (or a subset of it) and then get a set of predicted results. The key differences are:
- Machine learning uses computational power to achieve its outcomes, often iteratively. Statistics is a subfield of mathematics that summarizes your data from a sample using indexes such as the mean or standard deviation.
- To build a model the initial stage is to do feature engineering. This asks: which attributes can be used and which attributes have a role to play in influencing the output. If we’re trying to predict a customer’s propensity to buy a new car, is their age important? Is their income bracket important? Or is the dominant factor the distance between their home and a train station? You can see there are a huge number of possible attributes in play.
- In order to derive the right features, you must identify a correlation between the independent variables or data points.
- Without statistics you can’t select features to use in your model in an intelligent way. In many situations it’s not helpful to only apply statistical analysis to your data. Most roads lead to building a model.
- In terms of the applications, machine learning and statistics are coupled in a way that one leads to other.
- Even after you’ve built a model or applied a machine learning procedure, statistics will still come into play for measuring its performance and validating results.
Basic examples of machine learning algorithms
Linear regression and classification is the simplest “beginner step” in machine learning – these are the familiar-looking plots of x versus y and the line of best fit. Despite its simplicity, linear regressions are really useful in a large number of cases where more complicated algorithms suffer from overfitting. An example would be in the retail sector where the number of stores could correlate as a linear regression to their total sales.
Logistic regression is the next step on from linear regression where the classifier moves from linear to non-linear meaning it goes from a straight line (linear combination of parameters) to a non-linear function such as binary classifier or sigmoid. Binary classifiers might be ‘loyal (1) vs lapsing customer (0)’. For example, the correlation between ice cream demand and temperature. It is important to consider the context for such correlations, so in the UK 18C is the threshold where you see demand starting to increase.
Decision trees are conceptually easy to understand as at first sight they appear similar to flowcharts or peoples’ decision processes and therefore it is easy to interpret. The most commonly used methods for these are random forest and CART (stay tuned to our blog for more on these).
K-means clustering is a type of unsupervised learning used when you have unlabeled data and the goal is to find groups in that data that have not been explicitly labeled beforehand.
This might be used to confirm business assumptions about what types of groups exist or to identify unknown groups in complex data sets.
Imagine a school disco dancefloor, viewed from above. Cluster on age and you can draw a circle around two distinct clusters – parents and students. Cluster by gender and you have two clusters on opposite sides of the room (boys and girls) plus a mixed group of parents.
Interesting results happen when you don’t go in with a pre-conceived idea, and let the algorithm find new clusters on its own. Suddenly, out of thousands of possible variables, your model clusters on ‘distance of home address from a body of water’ and now you understand why some people at the disco are doing the fish dance and others haven’t a clue what the fish dance is.
This gives you a feel for how clustering works. Applying it to your customer data, across tens of thousands of potential variables, can reveal very interesting and unexpected trends in their buying patterns. Exploit these trends to make your marketing more useful, and less intrusive.
Neural Networks are a new era of machine learning algorithms and can be applied to many tasks, but their training needs huge computational complexity. Similar to biological neural networks (the interconnected web of neurons in the brain which transmit information via their connections) a neural network consists of layers of nodes that transmit data to each other. This approach is particularly useful for highly complex Bayesian problems such as distinguishing chihuahuas from blueberry muffins – and let’s face it, who hasn’t given that one a go on their lunch break?
How do you know which algorithm to use for the problem at hand?
This all depends on a combination of factors including the type of data you are handling and whether it is structured or unstructured, and what the aims of the analysis are. For business applications this could either be very open ended (for example a client might want an exploratory overview of what they have) or there could be one clear goal or hypothesis that the task is aiming at proving or disproving.
Other more technical factors that would have a significant effect include:
- Accuracy – getting the most accurate answer possible isn’t always necessary. This is where the dialogue between the data science specialists and the company stakeholders is crucial. Having a partner that takes the time to understand the context of your data, and business objectives of the challenge, will help determine the sufficient accuracy needed. This will ensure that the business realises the right value of that effort relative to the cost. Sometimes an approximation is adequate, depending on what you want to use it for. Maybe a rough prediction of customer spend next year to the nearest £100 is fine, when deciding who to short-cut into your VIP programme. If that’s the case, you may be able to cut your processing time dramatically by sticking with more approximate methods. Another advantage of more approximate methods is that they naturally tend to avoid overfitting
- Training time – the number of minutes or hours necessary to train a model varies a great deal between algorithms. Training time is often closely tied to accuracy—one typically accompanies the other. In addition, some algorithms are more sensitive to the number of data points than others. Limited time may drive your choice of algorithms, especially when handling big datasets.
- Linearity – lots of machine learning algorithms make use of linearity. Linear classification algorithms assume that classes can be separated by a straight line (or its higher-dimensional analog). Linear regression algorithms assume that data trends follow a straight line. These assumptions aren’t bad for some problems, but on others they bring accuracy down. Despite their dangers, linear algorithms are very popular as a first line of attack. They tend to be algorithmically simple and fast to train.
- Number of parameters – parameters are the knobs a data scientist gets to turn when setting up an algorithm. They are numbers that affect the algorithm’s behaviour, such as error tolerance or number of iterations, or options between variants of how the algorithm behaves. The training time and accuracy of the algorithm can sometimes be quite sensitive to getting just the right settings. Typically, algorithms with large numbers of parameters require the most trial and error to find a good .
The upside is that having many parameters typically indicates that an algorithm has greater flexibility. It can often achieve very good accuracy, provided you can find the right combination of parameter settings.
After finding the available algorithms and categorizing the problem, you can identify the algorithms that are applicable and practical to implement using the tools at your disposal.
Then, in the final step: throw all of the algorithms at the problem… even the kitchen sink!
This requires setting up a machine learning pipeline that compares the performance of each algorithm on the dataset using a set of carefully selected evaluation criteria. The best one should then be automatically selected. You can either do this once or have a service running that does this in intervals when new data is added.
So, for the record – while you can probably outrun the army of Roombas, we’re making no guarantees about Skynet. At least now you know the fundamentals of how it will think.