This is the second part of a three-part blog series. You can read the first part here.
Why are kids so annoying by age 4?
Most toddlers can say about 20 words by the time they are 18 months old. By age two, they start combining two words to make simple sentences, such as "baby crying". As all parents are poainfully aware, by the time they are three or four years old, they become extremely curious and can’t stop asking questions. In Russian they have a special word for kids of that age: Pochemuchka, from the word pochemu, which means why.
But what happens in children's brains as they start discovering the world around them? What happens is that their brains start building models - and we shall see later what building a model implies.
As we grow up, we learn many different things. Some of them we learn by doing, and some of them by other people’s experiences (from stories, books, movies… ). So, we learn for instance that ice is slippery. But why is ice slippery? Now, the simple answer can be, well, because everyone knows that ice is slippery! That answer was satisfactory-enough to humans for millions of years, till we have started building knowledge based on maths and science. And then we have discovered that ice being slippery has something to do with Brownian motion, alignment of water molecules as water turns to a solid state, and the fact that water under pressure can change from solid state back to liquid. For a better explanation - enjoy this video!
We don’t learn only by observing things around and labelling them as facts, but rather, by trying to understand the underlying principles that guide these events.
What does it really mean, to learn something?
In very simply terms, if we look at a given phenomena described by formula y = f(x), we are not only interested in what happens with output Y as the input X changes, but we are even more intrigued by the transfer function itself, by f! And this is where knowledge modelling comes in - in understanding the transfer function. Once we learn models, we can then quickly start deriving all sorts of ideas from them, by combing all of these transfer functions together, like in one giant convolution network. And as mentioned earlier, some of these functions we learn ourselves, while others we have been told to believe in (which sometimes can get us into all sorts of other problems).
That is also the reason we say that humans are good at working with small data sets - and it is also the reason we tend to take shortcuts when confronted with larger data sets.
As explained in the first blog post of these three-post series onthe subject, Deep Learning nets are not modelling the relations. Going back to the world of y=f(x), they are trying to fit the best set of matrix decompositions inside f, while having huge sets of X and Y available to play with, and while at the same time, making no assumptions about what f actually represents. That is the reason why Deep Learning labels a goat as a bird or as a giraffe (even though one may argue that seeing a goat in the tree is not as unusual as it seems).
As we will see later on, the best knowledge modelling comes from Bayesian casual understanding of the world around us, so there's no surprise that a lot of the latest research effort in Deep Learning is based on Bayesian methods, somethting called Bayesian Deep Learning. But even that direction of research is focused more on getting better parameter estimations rather than on fixing the explainability problem of Deep Learning.
Achieving human level intelligence
In one of his latest articles, Theoretical Impediments to Machine Learning With Seven Sparks from the Causal Revolution, Judea Pearl says that in order "to achieve human level intelligence, learning machines need the guidance of a model of reality, similar to the ones used in causal inference tasks."
Further on, he describes a three-level causal hierarchy, together with the characteristic questions that can be answered at each level. The levels are titled Association, Intervention and Counterfactual.
So the real question is then how to combine Deep Learning, which is superior in so many applications, with Bayesian nets?
My view is that rather than “fixing” Deep Learning with learning inference models from the Bayesian world, or bringing some of the ideas of Deep Learning into Bayesian nets, a much more natural solution would be to place Deep Learning nets below Bayesian nets, and use Deep Learning nets as one of many sensory inputs into the Bayesian rule engine. That way, we can combine the best out of both worlds.
Deep Learning and Bayesian Modelling, building the automation of the future
I wish I could have said that this is a novel idea, but while browsing the literature, I bumped into Attention Schema Theory (AST) - a neuroscientist theory on how we came to be aware of ourselves: A New Theory Explains How Consciousness Evolved. Here are some important parts of that article (with my emphasis):
“Even before the evolution of a central brain, nervous systems took advantage of a simple computing trick: competition. Neurons act like candidates in an election, each one shouting and trying to suppress its fellows. At any moment only a few neurons win that intense competition, their signals rising up above the noise and impacting the animal’s behavior. This process is called selective signal enhancement, and without it, a nervous system can do almost nothing.Selective signal enhancement is so primitive that it doesn’t even require a central brain. [..] The cortex is like an upgraded tectum. Unlike the tectum, which models concrete objects like the eyes and the head, the cortex must model something much more abstract. According to the AST, it does so by constructing an attention schema—a constantly updated set of information that describes what covert attention is doing moment-by-moment and what its consequences are.”
In short, the AST theory of brain evolution is very similar to the idea of having Deep Learning nets, pattern matching and Bayesian models put together, one on top of the other!
- on top of 3rd party APIs
- based on real-time data
- based on Deep Learning prediction models (learn parameters from ML models)
- based directly on top of the ML model
In my next blog post, I will take one real example to show how all these blocks work together. Stay tuned!
PS: from Judea Pearl
The philosopher Stephen Toulmin (1961) identifies model-based vs. model-blind dichotomy as the key to understanding the ancient rivalry between Babylonian and Greek science. According to Toulmin, the Babylonians astronomers were masters of black-box prediction, far surpassing their Greek rivals in accuracy and consistency (Toulmin, 1961, pp. 27–30). Yet Science favored the creative-speculative strategy of the Greek astronomers which was wild with metaphysical imagery: circular tubes full of fire, small holes through which celestial fire was visible as stars, and hemispherical earth riding on turtle backs. Yet it was this wild modeling strategy, not Babylonian rigidity, that jolted Eratosthenes (276-194 BC) to perform one of the most creative experiments in the ancient world and measure the radius of the earth. This would never have occurred to a Babylonian curve-fitter.
Coming back to strong AI, we have seen that model-blind approaches have intrinsic limitations on the cognitive tasks that they can perform. We have described some of these tasks and demonstrated how they can be accomplished in the SCM framework, and why a model-based approach is essential for performing these tasks. Our general conclusion is that human-level AI cannot emerge solely from model-blind learning machines; it requires the symbiotic collaboration of data and models. Data science is only as much of a science as it facilitates the interpretation of data – a two-body problem, connecting data to reality. Data alone is hardly a science, regardless how big it gets and how skillfully it is manipulated.