Seven key points generally not listed in modeling tutorials

Jan 3

Besides the technical aspects relative to the procedure itself (training, testing, cross validation etc…), there are seven points I found non trivial and useful.

1. Intelligence versus memory

I would like to start with emphasizing the difference between learning and photographic memory.

Regardless of the memorisation technique, the aim of photographic memory is to restitute or retrieve the exact same object, let it be a unique instance or a set of 10,000 units. It does not require a minimum of examples nor the notion of what is essential versus what is secondary: it takes both signal and noise in input and restitutes both in output.

Learning is a totally different process. First, a unique instance does not allow the learning process to operate - contrary to photographic memory. It consists in extracting “something” out of a set of items. We can call this thing to be extracted a signal, an invariant, a universal feature, a general principle, a conjunction of criteria – whatever: something necessary AND sufficient to identify or discriminate the elements of the set, and above that set.

What is invariant and universal can be generalised to other items. If the object to be learnt is a face, then learning will consist in (i) extracting the invariant features of a collection of faces (ii) identifying a face based on these invariant features and/or (iii) discriminating/classifying new faces based on these invariant features. Eventually these invariants can be used to build ad hoc datasets with similar features and distribution.

So, the first point to bear in mind when building models is that Machine Learning, as indicated by its name, is supposed to learn, not to memorise. If a model restitutes a photographic image of what it has been exposed to, it means it has failed to extract an invariant that could be generalised to new instances or used as a template.

2. Intelligence is parsimonious: less is more

This distinction between photographic memory and learning was made on purpose, as the classical pitfall of overfitting can be compared to a model that would “photographically memorise” the dataset and eventually learn noise (variance) instead of getting rid of it.

The current state of the art is somewhat in its adolescence. It is not (yet) a fully mature branch of mathematics able to provide some theorem allowing to determine in advance, on the basis of (hidden) variables characterising the dataset, the most appropriate type of model and, where applicable, calculate the optimal values of hyperparameters or of beta values.

But one can still find one’s way following a parsimonious and methodical serial process, starting with a simple solution, testing it, maybe increasing the number of predictors if the simple version is not satisfactory, testing the performance (accuracy first, speed second) of each predictor separately, see if/how they cumulate. It is all about to assess what additional step / change brings up and select it on this basis.

3. Planning and Building on clear assumptions

All models, all theories rest on assumptions and these assumptions ultimately impact the predicted outcomes. It is worth making these assumptions explicit. Being clear on what we know and what we simply hold as true before starting, keeping in mind the assumptions of the model is very powerful to diagnose and solve issues that will show up later during the process.

For example, we can opt for linear regression models that are powerful tools widely used in life science, economics, physics, social sciences… maybe because many phenomena are linear or because assuming they are, allows to make satisfactory predictions. In some cases, linear regression is not applicable – for instance when the variables are known to be non-linearly related or the data not normally distributed. This is a very simple case, but some are much more trickier.

4. Complex versus complicated

When building a model and coping with a bias/variance trade-off, something useful to bear in mind is that the “bricks” complex intelligent systems are made of are always very simple, all articulated with a simple rule without exception. This simplicity of elements and rules that makes possible outstanding performances, and distinguish them from "complicated" system that suffer from ad hoc proliferation of parameters and as many rules as exceptions.

If we zoom onto how a computer or even the brain works, we can see that they are constituted of very simple, often binary, units or processes. The architecture of the nervous system rests on ubiquitous binary divisions: sensory versus motor neurons, resting versus action potentials, bottom-up versus top-down connections... Colour vision of the human eye has a quite wide span, but rests on the combination of only three types of receptors (the cones). Who could suspect that motor outputs are encoded with two variables only, namely direction and strength of the movement? The principles of its architecture are actually very simple.

The same applies or should apply to models. This is exactly what we do when we grow simple decision trees separately, combine them into a random forest to average their predictions. An assembly of multiple very simple units articulated for a clear purpose makes a complex system fluid that does not suffer from exceptions.

5. The poverty of stimulus/data

I really like this principle because it insists on the active nature of learning, and the creative nature of modelling in science.

There is a famous principle in psychology, neuroscience, linguistics that originates from epistemology. The so-called principle of “the poverty of the stimulus” claims that the stimuli (or empirical data) are necessary but not sufficient to account for learning (or powerful predictive theories). Examples abound : speaking to babies is enough for them to understand English (or any language) after 2/3 years, while speaking to a kitten does not work. The visual system can identify many objects, words or faces very accurately and quickly based on very poor visual inputs (important parts of the objects are missing, light on the face is very poor or strong, prescription was written by a medical doctor... :) ).

The stimuli (data) provided cannot explain the performance. Only the underlying neuronal architecture (model) actively processing the signal (data) can.

Let’s consider vision again. Light stimulates receptors at a specific location on the plan of the retina – generating a signal encoded within a 2-dimensional space. The depth (third spatial dimension), and by extension of the 3D space, is not included in the raw signal coming from the retina, it is calculated by the model implemented in the cortex. Having two sensors (two eyes) allows to calculate the depth – adding a dimension - with a simple operation (vector product). This is exactly what a Support Vector Machine model does when adding a third dimension to a two-dimensional raw dataset that resists to be classified.

The point to bear in mind is that passively sticking to the raw variables is not enough to learn, let it be for a natural or artificial learning system. The data (stimuli) are generally poor, and the model (the mind) needs to create something from it, by extracting a more abstract relationship between variables provided by the data. It can be something as simple as a slope, a ratio or more complex such as a third dimension or a non-linear transformation.

6. Some knowledge of the domain

Having (access to) expert knowledge of the domain of the data being manipulated is useful and sometimes essential. Consider a predictive model used for clinical or medical applications.
Even if it provides excellent predictions of an illness based on various biological markers, this remains a probabilistic relationship between several variables, that may be contingent on a another single hidden one.
Only an expert can identify a hidden variable that would be the best predictor. Another way to say it is that only an expert can identify an underlying cause behind a suite of correlations.

7. Be aware of the limits of learning / modelling

In light of the previous point, I would like to raise kind of epistemological red flags.
As scientists, our job is to track and try to predict events. We use the notion of cause for this purpose, but in practice, a “cause” is only the best predictor(s) of an event. Being an expert in statistics does not change anything. Every model rests on assumptions, often held as true rather than true. (Dynamic) causal modelling is nothing else than a type of probabilistic model – a famous instance being Granger causality, extensively used but having its limits as well (see here).
These limits are also relevant for the brain, that seems to work as a powerful Bayesian predictor, for the best and the worst: these Bayesian predictions explain both correct perceptions and illusions.

This last point can seem far-fetched, but it is very practical, and reminds us of keeping humble and aware of the limits of modelling. There is no 100% performance.

Sarah Kouhou

Seven key points generally not listed in modeling tutorials

Storytelling therapy in children