Reinforcement learning. Artificial Intelligence with all the time in the world.

In his book  "Mastery," the American writer Robert Greene argues that it takes more than 10,000 hours to acquire the practical knowledge needed to master an activity. This equates to 4 hours a day, 6 days a week, for 8 years.  The idea seems reasonable in the light of the hours that, from an early age, professional athletes or musicians dedicate to practice. Similarly, it seems reasonable to think that it takes a person five years, full-time, to become an expert of his activity.

An illustrative example of this concept is the system to learn a craftmanship in the Middle Ages, in which teenagers worked 7 years as apprentices in a master’s workshop. There, by observing his mentor, endless repetitions and punishment of error, the young men learned to master the craft. Today, fortunately in friendlier circumstances, the same principles of mentoring, repetition and avoiding error are applicable to any learning process.


Image published here

According to Greene, a mentor's guidance avoids wasting valuable years in experimentation; without it, the time to achieve mastery is much greater. But what if you have all the time in the world? What if we can travel to a dimension where time runs at a different pace, spend tens or hundreds of years experimenting without consequences with an activity and return home with acquired skill, after only weeks or months? Well, itis precisely what the area of artificial intelligence (AI) known as Reinforcement Learning achieves.

Reinforcement Learning (RL) studies techniques so that a program can choose the actions to take in an environment, in order to maximize a reward. In the words of Richard S. Sutton and Andrew G. Barto, of the main pioneers of the field, RL is learning what to do; to discover, by testing them, the actions that produce the greatest reward. This learning mechanism is used by humans and other animals, so the development of RL runs parallel and is fed back from psychology and neuroscience.

All part of a very simple idea, the action-reward cycle, which works as follows. At a time t the environment around us is in an St situation, in which we take an action At. Our  action  produces a change in the environment,  so at the  next moment t+1  we are in an a  new situation St+1  and we receive a reward signal Rt+1, which tells us if the result of the action we took was good. With this information we choose an A t+1 action, which at the next moment t+2 brings as to situation St+2 and a new reward signal Rt+2. The cycle continues like this until our interaction with the environment ends.

Blog 27_Englishgif

It's like a little girl who learns to walk. To get to where your dad is, she shyly takes a step, changes the situation and receives a reward signal, the pain of a fall or the words that encourage her to follow. She takes another step and receives a new signal, another one and another, until she reaches Dad's arms or falls and ends up also in his arms but crying. Eventually, without rules or conscious reasoning, the girl learns the kind of movements and speed that allow her to walk safely and get where she wants to get to without falls or pain.

Variations of the model to reflect the characteristics of various types of problems, generate the research lines of RL. For example, what if the actions we can take form a continuum, suchas the steering positions of a car? How much importance should be given to the immediate reward (the pain) and how much to the final reward (getting to Dad)? How do you deal with interactions that never end? Should the action be chosen, with a rule indicating the optimal action for each state of the environment, or based on a way to calculate the total reward that can be expected from an action? These and other variations give rise to different models, techniques and algorithms to achieve learning in such circumstances.

Armed with these tools, in environments simulated with powerful computer networks, RL systems rehearse thousands or millions of times a problem until they find the actions that maximize the defined reward, without previous rules or knowledge. In this way, they experience the equivalent of hundreds of years or thousands of years in weeks or months. According  to the article that describes the AlphaGo Zero project, a system that learned the Game Go by itself and defeated the program that beat South Korean champion Lee Sedol, the program trained 29 million games. This would have taken a human being 3,700 years.


Robotic Hand Solves Rubik's Cube with Reinforcement Learning

Today researchers explore  the application of RL in  areas  that include, in addition to game resolution, topics such as training robots to do specific activities,  allocating resources in cloud computing  services,  reducing  traffic  with traffic lights systems, optimization of chemical reactions, or advertising biddings in portals and social networks. In fact, any situation that can be modeled and simulated can be analyzed with RL, which opens interesting possibilities in fields such as strategy and negotiation, more related to social and economic disciplines.

While the recent development of the RL has been strengthened by the use of deep neural networks to assess the expected rewards of an action, it still has limitations. Like all AI tools, it  does not generate knowledge that can be applied in a domain other than that which gave rise to, very complex situations become unfeasible to model and simulate, especially when we move to the physical realm, where the variables to be controlled are many,  and finally the current advances of the robotics restrict practical applications.

Despite this, in my opinion reinforcement learning is one of the areas that have the greatest potential to increase our human capabilities.  Having   a tool to rehearse a problem repeatedly until we find the actions that allow us to solve it in the best way, saving years of experimentation and in a simulated environment without consequences, represents a great opportunity. Not for a system to solve things for us, but so that we can acquire the mastery that Richard Greene speaks of not after years of experience, but only weeks or months of studying simulations.

Did you enjoy this post? Read another of our posts here.

Visit our other sections