The protagonist of this article is AlphaGo, the Go AI developed by the Google DeepMind team. It has attracted a lot of attention with its feat of defeating the world's top player Li Shishi in 2016. Go is an ancient chess game, and there are many choices in each step. Therefore, the next position is very predictable in the Senate - requiring the players to have strong intuition and abstract thinking ability. Because of this, people have long believed that only human beings are good at playing Go. Most researchers even believe that it will take decades for AI to truly have this ability to think. But now it has been two years since AlphaGo played against Li Shishi (March 8 to March 15), and this article is just to commemorate this great day!
But even more frightening is that AlphaGo has not stopped its own progress. Eight months later, on a Go website, in the name of "Master", he played 60 professional games with champion players from all over the world and scored a winning result.
This is of course a huge achievement in the field of artificial intelligence , and has caused a new wave of discussion around the world – should we be excited about the speed of artificial intelligence development, or worry?
Today, we will use DeepMind's original research paper published in Nature to provide a simple and clear interpretation of its content, detailing what AlphaGo is and how it works. I also hope that after reading this article, you will no longer be intimidated by the sensational headline thrown by the headlines of the media, and really excited about the development of artificial intelligence.
Of course, you don't need to master Go skills, you can also understand the point of this article. In fact, I have read only a little bit of Go on the network encyclopedia. Instead, I actually use the basic chess example to explain the algorithm. You only need to understand the basic rules of a double board game - each player takes turns taking action and finally a winner. Other than that, you don't need to know anything about physics or high numbers.
This is to minimize the barrier to entry, in order to make it easier for friends who are new to machine learning or neural networks to accept. This article also deliberately reduces the complexity of the expression, but also hope that everyone can focus on the content itself.
As we all know, the goal of the AlphaGo project is to build an AI program and ensure that it can compete with the world's top human players in the field of Go.
In order to understand the challenges brought by Go, we first talk about another chess game similar to it - chess. As early as the early 1990s, IBM created a deep blue computer that defeated the great world champion Gary Kasparov in the chess game. So how does Deep Blue do this?
In fact, Deep Blue uses a very “violent†approach. At each step of the game, Deep Blue will consider all possible reasonable moves and explore along each move to analyze future changes. Under such forward-looking analysis, the calculation results quickly formed a huge decision tree of ever-changing. After that, Deep Blue will return to the origin along the tree structure, observing which moves are most likely to produce positive results. However, what is a "positive result"? In fact, many excellent chess players have carefully designed a chess strategy for Deep Blue to help them make better decisions – for example, is it to decide to protect the King or to gain an advantage elsewhere in the disk? They built specific "evaluation algorithms" for such purposes to compare the strengths or weaknesses of different disk positions (IBM introduced the expert's chess strategy into the evaluation function in a hard-coded form). In the end, Deep Blue will choose carefully calculated moves accordingly. In the next round, the whole process repeats again.
This means that Deep Blue will consider millions of theoretical positions before each step. Therefore, the most impressive performance of Deep Blue is not in the artificial intelligence software level, but in its hardware - IBM claims that Deep Blue was one of the most powerful computers on the market at the time. It can calculate 200 million disk positions per second.
Let us now return to Go. Go is obviously more open, so if you repeat the dark blue strategy here, you will not get the desired results. Since each move has too many selectable locations, the computer simply cannot cover so many potential possibilities. For example, in the beginning of chess, there are only 20 possible ways to go; but in Go, the first-hand players will have 361 possible points - and this range of choices has been very extensive throughout the game.
This is the so-called "great search space." Moreover, in Go, it is not so easy to judge the favorable or unfavorable weight of a particular face position - in the official stage, the two sides even need to arrange for a while to finally determine who is the winner. But is there a magical way to make computers work in the field of Go? The answer is yes, deep learning can accomplish this daunting task!
Therefore, in this study, DeepMind uses neural networks to accomplish the following two tasks. They trained a "policy neural network" to determine which is the most sensible option for a particular disk location (this is similar to following a visual strategy to choose a mobile location). In addition, they trained a set of "value neural networks" to estimate the extent to which a particular disk layout is beneficial to the player (or the actual impact of the position at which to win the game). They first trained these neural networks using human chess (the most traditional but also very effective supervised learning method). After such training, our artificial intelligence can imitate the way humans play chess to a certain extent - at this time, it is like a rookie human player. Then, in order to further train the neural network, DeepMind allows the AI ​​to play millions of times with itself (that is, the part of "enhanced learning"). In this way, with more full practice, the AI's chess power has been greatly improved.
With these two networks, DeepMind's artificial intelligence solution is enough to have the same level of chess as the most advanced Go program. The difference between the two is that the original program uses the more popular preset game algorithm, namely "Mo nte Carlo Tree Search" (MCTS), which we will introduce later.
But obviously, we haven't talked about the real core here. DeepMind's artificial intelligence solution relies not only on strategy and valuation networks—it does not use these two networks to replace the Monte Carlo tree search; instead, it uses neural networks to further enhance the effectiveness of the MCTS algorithm. The actual results are indeed satisfactory - the performance of the MCTS has reached the height of Superman. This improved variant of MCTS is "AlphaGo", which successfully defeated Li Shishi and became one of the biggest breakthroughs in the history of artificial intelligence.
Let us recall the first paragraph of this article. As mentioned above, how does a deep blue computer build a decision tree containing millions of disk positions and moves in every step of chess—the computer needs to simulate, observe, and compare every possible drop point— This is a simple and very straightforward approach. If a general software engineer has to design a chess program, they are likely to choose a similar solution.
But let us think about how humans play chess. Suppose you are currently at a particular stage in the game. According to the rules of the game, you can make a dozen different choices - move the pieces here or move the queen there and so on. However, do you really list all the moves you can take in your head and choose from this long list? No, you will "intuitively" narrow down the feasible range by at least a few key moves (assuming you have made 3 sensible moves), and then think about if you choose one of them, then the situation on the board will What kind of change happened. For each of these moves, you may need 15 to 20 seconds to consider - but please note that within these 15 seconds, we are not very precise in deriving the next confrontation and change. In fact, humans tend to “throw†some intuitively guided choices without much thinking (of course, good players will think farther and deeper than ordinary players). This is done because your time is limited and you can't accurately predict what follow-up strategies your opponents will outline. Therefore, you can only let your instincts guide yourself. I call this part of the thinking process "spreading", please pay attention to this in the following text.
After completing the "spreading" of several sensible moves, you finally decide to give up this headache and go straight to the most scientific step you think.
After that, the opponent will respond accordingly. This step may be as early as you expected, which means you are more confident about what to do next—in other words, you don't have to spend too much time on subsequent “spreadingâ€. Or, your opponent may have a trick that will force you to fight back and have to think more carefully about the next step.
The game continues in this way, and as the situation progresses, you will be able to more easily predict the outcome of each move, and the time spent on it will be shortened accordingly.
The reason why I have said so much is to tell the role of the MCTS algorithm in a relatively simple way - it simulates the above thinking process by repeatedly constructing the move and the position "search tree". But the innovation is that the MCTS algorithm does not make potential moves at every location (different from deep blue); instead, it intelligently selects a small group of reasonable moves and explores them. During the exploration process, it “spreads†the changes in the situation caused by these moves and compares them based on the calculated results.
(Well, as long as you understand the above, the reading of this article is basically up to standard.)
Now let's go back to the paper itself. Go is a "perfect information game." That is to say, theoretically, no matter which stage of the game you are in (even if you just walk out one or two steps), you can accurately guess who wins and who wins (assuming both players will be 'perfect 'The way to complete the disk.' I don't know who proposed this basic theory, but as a premise of this research project, it is really important.
In other words, in the game state, we will be able to predict the final result through a function v*(s) - for example, the probability of winning this game, ranging from 0 to 1. Researchers at DeepMind call this the "optimal valuation function." Since some disc positions are more likely to result in a win than other disc positions, the former may have a "higher valuation" than other positions. Let me emphasize again, valuation = 0 to 1 probability value of winning the game.
But don't worry first—assuming a girl named Foma is sitting next to you, and every time I play, she will tell you whether this decision will lead you to victory or failure. "You won... you handled it... No, it's still handled..." I think this kind of hint doesn't help much for your choice of moves, and it's very annoying. Instead, the real thing that can help you is to outline all the possible move trees and the state that these moves will trigger—and then Foma will tell you which states will take you throughout the tree structure. Push to victory, and which will lead to failure. Suddenly, Foma became your perfect partner - not an annoying inserter. Here, Foma will serve as your optimal valuation function v*(s). Previously, people had always thought that games like Go would not have an accurate valuation function like Foma because there was too much uncertainty.
However, even if you do have Foma, her estimates of all possible face positions may not work in the real game. Because in games like chess or Go, as mentioned before, even if you want to predict the overall situation after seven to eight steps, too many possibilities will make Foma take a lot of time to get the results. .
In other words, Foma alone is not enough. Everyone needs to further narrow down the specific scope of the sensible moves and then derive the next trend. So how can our program do this? Lusha made his debut here. Lusha is a highly skilled chess player who has spent decades watching the Chess Masters. She can watch your face position, quickly think about all the reasonable choices you can make, and tell you the possibility of a professional player making all kinds of judgments. Therefore, if you have 50 possible move options at a particular point in time, Lusha will tell you the specific probability that the professional player will choose each option. Of course, some of these sensible moves have a higher probability, while other meaningless moves have a very low probability. She is your strategy function, p(as). For a given state s, she is able to provide you with a corresponding probability of all the choices a professional player might make.
Next, you can find more reasonable check options with the help of Lusha, and Foma will tell us the actual impact of these moves on the outcome of the game. In this case, you can choose to have a proposal from Foma and Lusha, or you can give advice first by Lusha, and Foma will evaluate the result. Next, pick some of these options for follow-up impact analysis, and then Foma and Lusha continue to guide the forecast – in this way, we will be able to more effectively grasp the trend of the disk situation. And this is the practical significance of the so-called "reduction of search space." Using the valuation function (Foma) to predict the results, the strategy function (Lusha) is used to provide the probability of trade-offs at the chess level to narrow down the range of the scores that are worth exploring further. This system is called "Mo nte Carlo rollouts". Next, when you go back to the current move, you will be able to get the average valuation conclusions for the various options and find the most suitable drop position accordingly. However, here, it still performs poorly in the level of Go - because the actual guiding ability of these two functions is still weak.
But that's okay.
First, a specific explanation will be given. Among the MCTS, the functions of Foma and Lusha in the initial stage are not yet sufficiently sophisticated. But the more the number of games, the stronger the two are in predicting reliable results and position. The paper pointed out that “reducing the specific range of high probability moves†is actually a more complicated expression. “Lusha actually helps to narrow down the options that need to be considered by providing the probabilities of professional players. Previous work mainly used this technology. Provide a powerful and mature AI player solution with a simple strategy function.
Yes, convolutional neural networks are ideal for image processing tasks. And because the neural network requires specific input and gives corresponding output, it is essentially equivalent to a function. This means that you can use neural networks to act as a highly complex function. Starting from this idea, you can pass it a picture of the position of the disk, and the neural network will judge the current situation. As a result, the neural network created will have very accurate strategies and valuation capabilities.
Below, we will discuss the specific training methods of Foma and Luha. In order to train the strategy network (responsible for predicting the position of the professional players), we only need to use the human game as a material and use it for traditional supervised learning.
In addition, we also want to be able to build a slightly different version of the policy network; it should be smaller and faster. It is conceivable that if Lusha's experience is very rich, then the time it takes to process each location will be extended accordingly. In this case, although she can better narrow down the reasonable range, it will take too long because the whole process will be repeated. So, we need to train a faster strategy network for this work (we call it...Lusha's brother, Jerry? Just call it that way). Next, once we use the data of human players to train a strategic network that meets our needs, we can let Lusha fight against the chess board to get more practice opportunities. This is the embodiment of intensive learning – building a more powerful version of the strategy network.
After that, we need to train Foma to make a valuation: determine the probability of winning. Artificial intelligence will repeat self-practice in the simulated environment, each time observing its final flaws and learning better and more advanced experiences from mistakes.
Due to space limitations, I will not specifically introduce the training methods of the network here. You can find more details at the link to the paper provided at the end of this article (see the 'Methods' section). In fact, the main purpose of this paper is not to show how researchers can conduct reinforcement learning on top of these neural networks. In an earlier article, DeepMind talked about how they used reinforcement learning techniques to teach AI to master Atari games. So in this article, I only mention a little bit of relevant content in the content summary section. Here again, AlphaGo's biggest innovation is that DeepMind researchers use reinforcement learning plus neural networks to improve the already popular game algorithm MCTS. Reinforcement learning is indeed a cool tool. Researchers use reinforcement learning to implement fine-tuning of strategies and valuation function neural networks after routine supervised training. However, the main role of this research paper is to prove the functional diversity and excellence of this tool, rather than teaching you how to actually use it.
Ok, now everyone has a relatively complete impression of AlphaGo. Below, we will delve further into the various topics mentioned earlier. Of course, there are inevitably some mathematical formulas and expressions that seem to be "dangerous", but believe me, they are very simple (I will explain in detail). So please relax your mind.
So the first step is to train our strategic neural network (Lusha), which is responsible for predicting what judgment a professional player might make. The goal of neural networks is to make artificial intelligence work like a human expert. This set of convolutional neural networks (as mentioned earlier, this particular neural network is very good at image processing) uses a board layout to simplify image content. We can add "rectifier nonlinearity" to the various layers of the network architecture, which will give the overall network the ability to learn more complex skills. If you have trained a neural network before, you may not be unfamiliar with the "ReLU" layer. Here we also use the ReLU layer.
The training data here exists in the form of random disk position pairs, and the labels are the choices made by humans. This part of the training uses regular supervised learning.
Here, DeepMind uses the "random gradient ASCENT". This is a back propagation algorithm. In this way, we want to maximize the role of the reward function. The reward function represents the probability that a human expert will make different predictions of actions; our goal is to increase this probability as much as possible. However, in actual network training, we generally only need to make the loss function as low as possible - this is essentially to narrow the error/difference between the prediction result and the actual label, which is called the gradient drop. In the actual implementation of the research paper, they did use the conventional gradient descent method. You can easily find the missing function as opposed to the reward function and maximize the former by minimizing the latter.
This strategic network has 13 layers, which we call the "SL policy" network (SL stands for Supervised Learning). The data it uses comes from a highly popular website where millions of users play Go games. So, what is the actual performance of the SL strategy network?
First, its level of Go is higher than the early development of other researchers. As for the "spreading strategy", you may remember that we mentioned before that the researchers trained a faster version of Lusha - we call it Jerry. Here, Jerry is responsible for his role. As you can see, Jerry's accuracy is only half that of Lusha, but the speed is thousands of times faster! When we apply the MCTS algorithm, Jerry will help us complete the simulation of subsequent changes in the situation faster.
To understand the content of the next section, you may not understand reinforcement learning, but you need to agree on a premise that the explanations I have made are true and effective. If you want to explore more details and try it out, you may want to first read some background information about reinforcement learning.
Once you have this SL network, the next thing to do is to use the human player's judgment data to train it in a supervised manner. After that, it is the self-determination of the ability to constantly hone judgment. The specific implementation method is also very simple - select the SL policy network, save it in a file, and then copy a copy.
Then you can fine tune it with reinforcement learning. In this way, the network is able to confront itself and learn from the results.
However, there is actually a problem with this type of training.
If you only fight against the same opponent in practice, and the opponent has been running through the training, you may not be able to gain new learning experience. In other words, what the network has learned is just how to defeat the other side, rather than really grasp the mystery of Go. That's right, this is the problem of overfitting: you're doing well against a particular opponent, but you don't necessarily have the ability to deal with all types of players. So how do we solve this problem?
The answer is simple. When we fine-tune a set of neural networks, it becomes another player with a slightly different style. In this way, we can save each version of the neural network in a "player" list and ensure that each player's performance is different. Very good, in the next neural network training process, we can randomly select different versions from the list as the object of confrontation. Although they originate from the same neural network, their performance is slightly different. And the more you train, the more versions of the players. The problem is solved!
In this training process, the only guide to the training process is the ultimate goal - to win the game. At this point, we no longer need to conduct targeted training on the network, such as capturing more locations on the disk. We only need to provide all possible reasonable options for it, and the goal below is "You must win." Because of this, reinforcement learning is so powerful – it can be used to train any game strategy or valuation network, and is not limited to Go.
Here, DeepMind researchers tested the accuracy of the RL policy network—without using any MCTS algorithms. As we mentioned before, this network can directly obtain the position of the disk and think about the probability of judgment of professional players. Here, it has been able to play independently. As a result, the network after enhanced learning fine-tuning defeated the supervised learning network that only trained with human chess. Not only that, but it also beats other powerful Go programs.
It must be emphasized that even before training this network of intensive learning strategies, the network of supervised learning strategies already has a level of play that goes beyond the existing technology – and now we are even further! More importantly, we don't even need to use other ancillary solutions such as valuation networks.
Here, we finally completed the training of Lusha. Next, I return to Foma, which represents the optimal valuation function v*(s)—that is, when only two players are perfectly performing their expected judgments, she can provide the current situation. The possibility of winning. Obviously, in order to train the neural network to act as our valuation function, we need a perfect opponent here... Unfortunately, we don't have such an opponent yet. Therefore, we sent the most powerful player - RL strategy network.
It will extract the current disk state state_s and then output the probability that you win the game. Each game state will serve as a sample of the data and will be used to annotate the game results in the form of a label. Therefore, after 50 passes, we obtained 50 samples of valuation forecasts.
But this practice is actually very naive - after all, we can't and should not add all 50 of the game to the data set.
In other words, we must carefully select the training data set to avoid over-fitting. Since each drop corresponds to a new position, each drop in Go is very similar. If the state of all drop selections is added to the training data with the same label, then there will be a lot of "duplication" in its content and will inevitably lead to overfitting. In order to prevent this from happening, we can only choose those more representative game states. For example, we can only select five states in the game process - not all 50 states - to be added to the training data set. DeepMind extracted 30 million states from 30 million different games, reducing the possibility of duplicate data. It turns out that this kind of work works very well!
Now let's talk about the concept: We can evaluate the value of the position of the disk in two ways. The first is to choose the best valuation function (that is, the function that was previously trained). The other is to use the existing strategy (Lusha) to directly derive the situation on the disk and predict the final result of this time. Obviously, the real game will rarely advance completely in accordance with our plan. But DeepMind still compares the actual effects of the two methods. In addition, you can also mix these two options. Later, we will understand this "mixed parameter", please remember this important concept.
Here, our set of neural networks will try to give the most approximate optimal valuation function, which is even better than the spreading strategy after thousands of simulations! Foma's performance here is really amazing. On the other hand, DeepMind also tried to use the Lusha RL strategy, which was doubled in accuracy but very slow, which required tens of thousands of simulations to draw conclusions - the final result was slightly better than Foma. But it is only slightly better, but the speed is much slower. Therefore, Foma won in this competition, she proved that she has irreplaceable value.
Now that we have completed the training of strategy and valuation functions, we can combine it with MCTS to bring our predecessor world champions, a large number of masters, a breakthrough of a generation, and a weight of 268 pounds... Alphaaaa GO!
In this section, you should have a deeper understanding of how the MCTS algorithm works. Don't worry, all the content mentioned so far should be enough to support your content. The only thing to note is how we use strategic probabilities and valuation methods. We combine the two in the process of spreading, thus reducing the specific scope that needs to be explored each time. Q(s, a) represents the evaluation function, and u(s, a) represents the saved probability of the position. I will explain specifically below.
It should also be noted that the strategy network uses supervised learning to predict the judgment of professional players. Not only does it provide the most probable drop options, but it also provides the specific probability of each option. This probability can be stored in each drop judgment. Here, DeepMind refers to it as "prior probability" and uses it to select the drop options that are necessary to explore. Basically, to decide whether you need to explore a particular drop option, we need to consider the following two points: First, how likely can we win by this move? Yes, we already have a “valuation network†that can answer this question. The second question is, how likely is the professional chess player to consider this move? (If a professional player is unlikely to consider this move, why should we waste time exploring it? This part of the conclusion is provided by the Strategy Network.)
Next, let's talk about "mixed parameters." As mentioned earlier, in order to evaluate the position of each panel, we have two options: First, we directly use the valuation network that has been used to evaluate the state of the panel. Second, we can also use the existing strategy network to quickly derive the game situation (assuming the opponent's players also adopt the method of predicting the next method) to judge whether we lose or win. In general, the valuation function is better than regular spreading. Here, combining the two will provide a weight estimate for each forecast, such as five-five open, four-six open, and so on. If you process the valuation X as a percentage, the other is (100-X)%. This is what the mixed parameters mean. The actual effect will be explained later.
After each spread, everyone can update the search tree with any information obtained in the simulation to further enhance the sensibility of future simulations. After all the simulations are over, you can choose the best drop option.
Let's look at the interesting conclusions!
You should remember that the RL fine-tuning strategy neural network has a better judgment than the SL human training strategy neural network. But when added to AlphaGo's MCTS algorithm, the artificially trained neural network in turn goes beyond the fine-tuning neural network. At the same time, in the valuation function (which can be understood as providing perfect judgment in an infinite approach), Foma training using the RL strategy can bring about the actual effect beyond the use of the SL strategy.
"The implementation of the above assessment requires a lot of computing resources, we have to take out the hard cargo at the bottom of the box to get these damn procedures."
But the meaning of DeepMind is actually...
“Hey, compared to our program, the previous Go programs are simply Down's children's level.â€
Here again, go back to "mixing parameter". When evaluating the position, the emphasis on the valuation function and the spread is higher than either of them. The rest is an in-depth explanation of DeepMind, which shows an interesting conclusion!
Please read the sentence marked with a red underline again. I believe that everyone can understand that this sentence is basically a comprehensive summary of the entire research project.
Insulated Power Cable,Bimetallic Crimp Lugs Cable,Pvc Copper Cable,Cable With Copper Tube Terminal
Taixing Longyi Terminals Co.,Ltd. , https://www.longyicopperterminals.com