GSoC’18: AlphaGo.jl
Published:
Hello, world!
The phase 2 of GSoC is over and AlphaGo.jl is ready! In this post I am going to explain about the usage of it.
AlphaGo.jl is built to try and test the Alpha(Go)Zero algorithm with your own parameters on the game of Go. Today, I’ll explain about higher level methods of it. For more details, which mainly includes MCTS implementation, you can check out the repo. It is built using Flux.jl, a machine learning library for Julia.
Environment
GameEnv is an abstract type used to represent game environment. Setting up environment is the first thing to do before starting with anything. This is because environment stores important information about the game which is used by other modules. Other modules require environment as input in order to set up themselves using this information.
Example:
env = GoEnv(9)
Here we have set up environment for Go having board size of 9x9.
NeuralNet
NeuralNet structure is used to store the AlphaZero neural network. The AlphaZero neural network is made up of three parts. A base network is there, which branches out into value netwrok and policy network.
The base network accepts Position of the board as input. Value network outputs a single value between -1 to 1 denoting who will win from the given position. -1 denotes white will win from that position and 1 implies black. The policy network returns the probability distribution over the different actions for that position of board.
You can replace any of these three networks with your own Flux model, provided it is consistent with the whole pipeline fo the NeuralNet.
mutable struct NeuralNet
base_net::Chain
value::Chain
policy::Chain
opt
function NeuralNet(env::T; base_net = nothing, value = nothing, policy = nothing,
tower_height::Int = 19) where T <: GameEnv
if base_net == nothing
res_block() = ResidualBlock([256,256,256], [3,3], [1,1], [1,1])
# 19 residual blocks
tower = [res_block() for i = 1:tower_height]
base_net = Chain(Conv((3,3), env.planes=>256, pad=(1,1)), BatchNorm(256, relu),
tower...) |> gpu
end
if value == nothing
value = Chain(Conv((1,1), 256=>1), BatchNorm(1, relu), x->reshape(x, :, size(x, 4)),
Dense(env.N*env.N, 256, relu), Dense(256, 1, tanh)) |> gpu
end
if policy == nothing
policy = Chain(Conv((1,1), 256=>2), BatchNorm(2, relu), x->reshape(x, :, size(x, 4)),
Dense(2env.N*env.N, env.action_space), x -> softmax(x)) |> gpu
end
all_params = vcat(params(base_net), params(value), params(policy))
opt = Momentum(all_params, 0.02f0)
new(base_net, value, policy, opt)
end
end
MCTSPlayer
MCTSPlayer struct simulates a game using Monte-Carlo Tree Search and NeuralNet. It takes NeuralNet and env as input. The player plays the game upto the number of readouts. MCTSPlayer can perform following functions:
- MCTS
- Pick a move based on MCTS and play it
- Extract data from the games played by it
These functionalities are used during the training and testing phase.
Selfplay
Self-play stage is used in the training phase. In this stage, the MCTSPlayer plays a game against itself. Every move in the game is picked based on the MCTS and played. After the game ends, the MCTSPlayer object is returned for extraction of data.
Training
train() method is used by the user to train the model based on the following parameters:
envnum_games: Number of self-play games to be played Optional arguments:memory_size: Size of the memory bufferbatch_sizeepochs: Number of epochs to train the data onckp_freq: Frequecy of saving the model and weightstower_height: AlphaGo Zero Architecture uses residual networks stacked together. This is called a tower of residual networks.tower_heightspecifies how many residual blocks to be stacked.model: Object of typeNeuralNetreadouts: number of readouts byMCTSPlayerstart_training_after: Number of games after which training will be started
train() starts off with a game of selfplay() using the current best NeuralNet. On completion of the game, the data from that game is extracted. This includes the board states, policy used at each move,and the result of that game. This data is stored in the memory buffer.
for i = 1:num_games
player = selfplay(env, cur_nn, readouts)
p, π, v = extract_data(player)
pos_buffer = vcat(pos_buffer, p)
π_buffer = vcat(π_buffer, π)
res_buffer = vcat(res_buffer, v)
if length(pos_buffer) > memory_size
pos_buffer = pos_buffer[end-memory_size+1:end]
π_buffer = π_buffer[end-memory_size+1:end]
res_buffer = res_buffer[end-memory_size+1:end]
end
if length(pos_buffer) >= start_training_after
replay_pos, replay_π, replay_res = get_replay_batch(pos_buffer, π_buffer, res_buffer; batch_size = batch_size)
loss = train!(cur_nn, (replay_pos, replay_π, replay_res); epochs = epochs)
result = player.result_string
num_moves = player.root.position.n
println("Episode $i over. Loss: $loss. Winner: $result. Moves: $num_moves.")
end
if i % ckp_freq == 0
save_model(cur_nn)
print("Model saved. ")
end
end
At every training step, batch_sizenumber of samples are picked from the memory. The features are extracted from the board states picked and fed into the NeuralNet, which gives out the value and policy as described above in the NeuralNet section.
We then compute losses. There are three kinds of losses here: Policy loss, Value loss and L2 regularisation.
# Policy loss: p is predicted policy
loss_π(π, p) = crossentropy(p, π; weight = 0.01f0)
# Value loss
loss_value(z, v) = 0.01f0 * mse(z, v)
The losses are added and backpropagated, after which the optimizer updates the weights. epochs can be specified in the train call to train on this data. Periodically, the NeuralNet and its weights are backed up using BSON.jl.
Play
To play against saved NeuralNet model, we have to load it using load_model. It accepts path of the model and env as parameters and returns an object of NeuralNet.
play() takes following arguments:
envnn: an object of typeNeuralNettower_heightnum_readoutsmode: It specifies human will play as Black or white. Ifmodeis 0 then human is Black, else White.
Sample usage
using AlphaGo
# This makes a Go board of 9x9
env = GoEnv(9)
# A NeuralNet object of tower_height 10 is made and trained and returned
neural_net = train(env, num_games=100, ckp_freq=10, tower_height=10, start_training_after=500)
# Plays a game against the trained network, with human as White
play(env, neural_net, mode = 1)