Introducing Machine Wisdom

The first post where we explain what Machine Wisdom is all about

Posted by Younes Abouelnagah on 4 Sep, 2016
In Predictive Modeling, Stories Of Brilliance, Prescriptive Modeling, Software Development, Big Data, Editorials
Tags math, stats, applied math, control theory, computer science, distributed systems, managing expectations

NOTE: If you have read the bits and pieces of text on our homepage, you have read all the content of this post. I am just putting it all together here in one place. Please follow us on Twitter and stay tuned for our blog posts. Also, please join the conversation on our Gitter rooms.

To be wise is to find patterns in what has passed to foresee what is yet to become. To be conscious of such patterns, wise people spend a lot of time contemplating and patiently exercising loops of thought that takes them from one idea to another. A computer program can exercise billions of loops in a single wall clock second, allowing it to go through an amount of data that would take years for a human to read. This is the edge machines have in terms of wisdom; they are fast and they don’t know boredom.

Machines don’t have any edge in terms of cognitive capacity, because it is us humans who have to tell them what to do in each iteration of the loop in order to make any progress. Even the notion of progress has to be simplified in order for machines to comprehend it. Machines, of the Von Neumann sort currently prevalent, can only deal with numbers and they are very clever when it comes to comparing something to zero. Humans devised a mathematical trick to allow machines to comprehend progress, only if the goal is to minimize a particular outcome function. This trick is called convex optimization, and it is at the heart of almost all methods that allow machines to learn something from data.

What do machines learn anyway? They learn whatever the programmer tells them they should find in the data. For example, if the programmer tells the machines that there should be a straight line passing through the data points, then the machines will go look for the best way to reconcile the data they see with this claim. They will use a measure of success given to them, in the form of a loss function, and they will keep measuring how successful they are in reconciling the data with the programmer’s claim. By using convex optimization methods, like Gradient Descent for example, they work to minimize the loss function in order to make progress towards reconciling the programmers claim with the data at hand – this is called model fitting. Such dumb minions will never counter the programmer’s claim by proposing something else according to the data, even if the model does not fit the data at all. They will always form a view of the world that is consistent with the programmer’s claim even if the data doesn’t fit at all, and they will not appear to be very intelligent when you ask them to make predictions!

Luckily, humans continue to come up with all sorts of objective functions, loss functions, and learning algorithms which are suitable for different kinds of data. The brilliance of some humans make machines appear to be intelligent, like Alex Krizhevsky et al. for example. In 2012, they came up with a way to get machines to learn how to identify objects in pictures. They used an algorithm called Neural Networks, but they did it in a way that will be later known as Deep Learning. They were not the first to do this, and they didn’t coin the term Deep Learning; it is just a buzz word that is not well defined after all. The brilliance of their method IMHO was in the way they formulated the problem to allow their machines to identify an object in a picture with 10% more accuracy than the machines of the next best participant in the ImageNet competition. They did that by showing their machines the photos 10 times, and they didn’t just show the machines the photos over and over – they did something very smart. Each time, they showed the machine a different section of each photo, and some times they used a mirror image of the photo. This allowed their machines to find patterns that generalize better, leading to their accuracy when asked about photos they hadn’t seen before. Of course, their use of a deep neural network is also brilliant, and their implementation was amazing craftsmanship. However, I believe that their way of passing insights to the machines, through showing it the data many times, is pure mastery in “the science of coaching machines”. Notice that they never showed the machines an upside down version of the photos, because this will not help in the task at hand, where images are already right-side-up.

I think of Machine Learning as a science of taming a wonderful tool, to get it to do something amazing by repeating simple steps. The repetition and the faithfulness of the machines to whatever the programmer tells them makes the process reminiscent to the process of following a Zen-Master to achieve a wisdom of sorts. It is such a beautiful and romantic analogy, specially that lots of programmers (myself included) would love to think of themselves as Zen-Masters. But, I’d rather get down to reality and call things with what they really are; it is a control system. Well, it is a very special one, because it is not grounded in Control Theory. Actually, it seems that multiple disciplines have been approaching the same problem from different angles and they finally converged into a new science that is yet to be named.

Another contributor to the great success of machines to acquire wisdom is the innovation in Distributed Systems, spearheaded by Google since the early 2000s. Earlier I claimed that machines look into data and find patterns using very simple repetitive loops, crafted by humans to make the task mind numbingly simple. The strength point of machines is the speed by which they can attain this wisdom, specially if they are working together in a distributed system. This speed allows machines to go through Billions of observations in seconds, growing their collective conscious by an amount equivalent to the growth of a human in several years of persistent studying. This scalability became possible because of a simplification of distributed systems that require algorithm programmers to be even more crafty in developing their algorithms. For an algorithm to be scalable it has to be written in a way such that multiple worker machines can make progress independently, sharing nothing.

Share nothing distributed systems are very scalable because they have no locks, so workers don’t need to wait to acquire a shared resource. The first such system to become main stream is the Map/Reduce processing framework, which was published by Google in a 2004 paper and then later adopted in the Apache Hadoop open source project. The strength of Map/Reduce lies in how workers are spawned on the machines containing the data, thus reducing the need to move data back and forth. More modern Map/Reduce frameworks like Apache Spark reduces the disk and network I/O overhead even further, specially for iterative workloads. This allows programmers to implement machine learning algorithms efficiently on Spark, shifting the role of the Hadoop eco-system from being more of a data preparation tool-set to being an end-to-end toolbox for creating machine wisdom applications.

The last sentence intentionally uses “machine wisdom applications” as if it is really something, in a tongue-in-cheek fashion. I don’t want to coin a term, but I also hope that you have found it as acceptable as “machine intelligence”, “artificial intelligence”, “machine cognition”, or any other such terms that we learned to accept. I believe that “wisdom” is more accurate than “intelligence” for describing the cognitive capacity of machines, or rather the lack thereof. I love artificial intelligence, machine learning, data mining, statistical modeling, and whatever else you could call the family of algorithms that leads to getting machines to do something that the programmer did not actually code. I don’t care what they are called as long as people realize they are just metaphors. It helps if people remember that all such algorithms can very easily fall into the pitfall of “over fitting” the training data, which renders what they learn useless when they are evaluated on new data points. Much like a wise shaman who has never left the village, and while the wisdom is perfect for leading life in the village it is not generally applicable. The difference is that the shaman has intelligence and may be able to rectify the wisdom to be more applicable to different situations. On the other hand, we have not yet devised any machine learning algorithm that can do this!

Please follow us on Twitter and stay tuned for our blog posts. Also, please join the conversation on our Gitter rooms. You can comment about this blog post on its Gitter room.