Community RSS Feed
http://community.wolfram.com
RSS Feed for Wolfram Community showing any discussions in tag Machine Learning sorted by activeMusic Generation with GAN MidiNet
http://community.wolfram.com/groups/-/m/t/1435251
I generate a music with reference to [MidiNet][1]. Most neural network models for music generation use recurrent neural networks. However, MidiNet use convolutional neural networks.
There are three models in MidiNet. Model 1 is Melody generator, no chord condition. Model 2,3 are Melody generators with chord condition. I try Model 1, because it is most interesting in the three models compared in the paper.
**Get MIDI data**
-----------------------
My favorite Jazz bassist is [Jaco Pastorius][2]. I get MIDI data from [here][3]. For example, I get MIDI data of "The Chicken".
url = "http://www.midiworld.com/download/1366";
notes = Select[Import[url, {"SoundNotes"}], Length[#] > 0 &];
There are some styles in the notes. I get base style from them.
notes[[All, 3, 3]]
Sound[notes[[1]]]
![enter image description here][4]
![enter image description here][5]
I change MIDI data to Image data. I fix the smallest note unit to be the sixteenth note. I divide the MIDI data into the sixteenth note period and select the sound found at the beginning of each period. And the pitch of SoundNote function is from 1 to 128. So, I change one bar to grayscale image(h=128*w=16).
First, I create the rule to change each note pitch(C-1,...,G9) to number(1,...,128), C4 -> 61.
codebase = {"C", "C#", "D", "D#", "E" , "F", "F#", "G", "G#" , "A",
"A#", "B"};
num = ToString /@ Range[-1, 9];
pitch2numberrule =
Take[Thread[
StringJoin /@ Reverse /@ Tuples[{num, codebase}] ->
Range[0, 131] + 1], 128]
![enter image description here][6]
Next, I change each bar to image (h = 128*w = 16).
tempo = 108;
note16 = 60/(4*tempo); (* length(second) of 1the sixteenth note *)
select16[snlist_, t_] :=
Select[snlist, (t <= #[[2, 1]] <= t + note16) || (t <= #[[2, 2]] <=
t + note16) || (#[[2, 1]] < t && #[[2, 2]] > t + note16) &, 1]
selectbar[snlist_, str_] :=
select16[snlist, #] & /@ Most@Range[str, str + note16*16, note16]
selectpitch[x_] := If[x === {}, 0, x[[1, 1]]] /. pitch2numberrule
pixelbar[snlist_, t_] := Module[{bar, x, y},
bar = selectbar[snlist, t];
x = selectpitch /@ bar;
y = Range[16];
Transpose[{x, y}]
]
imagebar[snlist_, t_] := Module[{image},
image = ConstantArray[0, {128, 16}];
Quiet[(image[[129 - #[[1]], #[[2]]]] = 1) & /@ pixelbar[snlist, t]];
Image[image]
]
soundnote2image[soundnotelist_] := Module[{min, max, data2},
{min, max} = MinMax[#[[2]] & /@ soundnotelist // Flatten];
data2 = {#[[1]], #[[2]] - min} & /@ soundnotelist;
Table[imagebar[data2, t], {t, 0, max - min, note16*16}]
]
(images1 = soundnote2image[notes[[1]]])[[;; 16]]
![enter image description here][7]
**Create the training data**
-----------------------
First, I drop images1 to an integer multiple of the batch size. Its length is 128 bars and about 284 seconds with a batch size of 16.
batchsize = 16;
getbatchsizeimages[i_] := i[[;; batchsize*Floor[Length[i]/batchsize]]]
imagesall = Flatten[Join[getbatchsizeimages /@ {images1}]];
{Length[imagesall], Length[imagesall]*note16*16 // N}
![enter image description here][8]
MidiNet proposes a novel conditional mechanism to use music from the previous bar to condition the generation of the present bar to take into account the temporal dependencies across a different bar. So, each training data of MidiNet (Model 1: Melody generator, no chord condition) consists of three "noise", "prev", "Input". "noise" is a 100-dimensions random vector. "prev" is an image data(1*128*16) of the previous bar. "Input" is an image data(1*128*16) of the present bar. The first "prev" of each batch is all 0.
I generate training data with a batch size of 16 as follows.
randomDim = 100;
n = Floor[Length@imagesall/batchsize];
noise = Table[RandomReal[NormalDistribution[0, 1], {randomDim}],
batchsize*n];
input = ArrayReshape[ImageData[#], {1, 128, 16}] & /@
imagesall[[;; batchsize*n]];
prev = Flatten[
Join[Table[{{ConstantArray[0, {1, 128, 16}]},
input[[batchsize*(i - 1) + 1 ;; batchsize*i - 1]]}, {i, 1, n}]],
2];
trainingData =
AssociationThread[{"noise", "prev",
"Input"} -> {#[[1]], #[[2]], #[[3]]}] & /@
Transpose[{noise, prev, input}];
**Create GAN**
-----------------------
I create generator with reference to MidiNet.
generator = NetGraph[{
1024, BatchNormalizationLayer[], Ramp, 256,
BatchNormalizationLayer[], Ramp, ReshapeLayer[{128, 1, 2}],
DeconvolutionLayer[64, {1, 2}, "Stride" -> {2, 2}],
BatchNormalizationLayer[], Ramp,
DeconvolutionLayer[64, {1, 2}, "Stride" -> {2, 2}],
BatchNormalizationLayer[], Ramp,
DeconvolutionLayer[64, {1, 2}, "Stride" -> {2, 2}],
BatchNormalizationLayer[], Ramp,
DeconvolutionLayer[1, {128, 1}, "Stride" -> {2, 1}],
LogisticSigmoid,
ConvolutionLayer[16, {128, 1}, "Stride" -> {2, 1}],
BatchNormalizationLayer[], Ramp,
ConvolutionLayer[16, {1, 2}, "Stride" -> {1, 2}],
BatchNormalizationLayer[], Ramp,
ConvolutionLayer[16, {1, 2}, "Stride" -> {1, 2}],
BatchNormalizationLayer[], Ramp,
ConvolutionLayer[16, {1, 2}, "Stride" -> {1, 2}],
BatchNormalizationLayer[], Ramp, CatenateLayer[],
CatenateLayer[], CatenateLayer[],
CatenateLayer[]}, {NetPort["noise"] ->
1, NetPort["prev"] -> 19,
19 -> 20 ->
21 -> 22 -> 23 -> 24 -> 25 -> 26 -> 27 -> 28 -> 29 -> 30,
1 -> 2 -> 3 -> 4 -> 5 -> 6 -> 7, {7, 30} -> 31,
31 -> 8 -> 9 -> 10, {10, 27} -> 32,
32 -> 11 -> 12 -> 13, {13, 24} -> 33,
33 -> 14 -> 15 -> 16, {16, 21} -> 34, 34 -> 17 -> 18},
"noise" -> {100}, "prev" -> {1, 128, 16}
]
![enter image description here][9]
I create discriminator which does not have BatchNormalizationLayer and LogisticSigmoid, because I use [Wasserstein GAN][10] easy to stabilize the training.
discriminator = NetGraph[{
ConvolutionLayer[64, {89, 4}, "Stride" -> {1, 1}], Ramp,
ConvolutionLayer[64, {1, 4}, "Stride" -> {1, 1}], Ramp,
ConvolutionLayer[16, {1, 4}, "Stride" -> {1, 1}], Ramp,
1},
{1 -> 2 -> 3 -> 4 -> 5 -> 6 -> 7}, "Input" -> {1, 128, 16}
]
![enter image description here][11]
I create Wasserstein GAN network.
ganNet = NetInitialize[NetGraph[<|"gen" -> generator,
"discrimop" -> NetMapOperator[discriminator],
"cat" -> CatenateLayer[],
"reshape" -> ReshapeLayer[{2, 1, 128, 16}],
"flat" -> ReshapeLayer[{2}],
"scale" -> ConstantTimesLayer["Scaling" -> {-1, 1}],
"total" -> SummationLayer[]|>,
{{NetPort["noise"], NetPort["prev"]} -> "gen" -> "cat",
NetPort["Input"] -> "cat",
"cat" ->
"reshape" -> "discrimop" -> "flat" -> "scale" -> "total"},
"Input" -> {1, 128, 16}]]
![enter image description here][12]
**NetTrain**
-----------------------
I train by using the training data created before. I use RMSProp as the method of NetTrain according to Wasserstein GAN. It take about one hour by using GPU.
net = NetTrain[ganNet, trainingData, All, LossFunction -> "Output",
Method -> {"RMSProp", "LearningRate" -> 0.00005,
"WeightClipping" -> {"discrimop" -> 0.01}},
LearningRateMultipliers -> {"scale" -> 0, "gen" -> -0.2},
TargetDevice -> "GPU", BatchSize -> batchsize,
MaxTrainingRounds -> 50000]
![enter image description here][13]
**Create MIDI**
-----------------------
I create image data of 16 bars by using generator of trained network.
bars = {};
newbar = Image[ConstantArray[0, {1, 128, 16}]];
For[i = 1, i < 17, i++,
noise1 = RandomReal[NormalDistribution[0, 1], {randomDim}];
prev1 = {ImageData[newbar]};
newbar =
NetDecoder[{"Image", "Grayscale"}][
NetExtract[net["TrainedNet"], "gen"][<|"noise" -> noise1,
"prev" -> prev1|>]];
AppendTo[bars, newbar]
]
bars
![enter image description here][14]
I select only the pixel having the max value among each column of the image, because there is a feature that the image generated by Wasserstein GAN is blurred. I clear the images.
clearbar[bar_, threshold_] := Module[{i, barx, col, max},
barx = ConstantArray[0, {128, 16}];
col = Transpose[bar // ImageData];
For[i = 1, i < 17, i++,
max = Max[col[[i]]];
If[max >= threshold,
barx[[First@Position[col[[i]], max, 1], i]] = 1]
];
Image[barx]
]
bars2 = clearbar[#, 0.1] & /@ bars
![enter image description here][15]
I change the image to SoundNote. I concatenate the same continuous pitches.
number2pitchrule = Reverse /@ pitch2numberrule;
images2soundnote[img_, start_] :=
SoundNote[(129 - #[[2]]) /.
number2pitchrule, {(#[[1]] - 1)*note16, #[[1]]*note16} + start,
"ElectricBass", SoundVolume -> 1] & /@
Sort@(Reverse /@ Position[(img // ImageData) /. (1 -> 1.), 1.])
snjoinrule = {x___, SoundNote[s_, {t_, u_}, v_, w_],
SoundNote[s_, {u_, z_}, v_, w_], y___} -> {x,
SoundNote[s, {t, z}, v, w], y};
I generate music and attach its mp3 file.
Sound[Flatten@
MapIndexed[(images2soundnote[#1, note16*16*(First[#2] - 1)] //.
snjoinrule) &, bars2]]
![enter image description here][16]
**Conclusion**
-----------------------
I try music generation with GAN. I am not satisfied with the result. I think that the causes are various, poor training data, poor learning time, etc.
Jaco is gone. I hope Neural Networks will be able to express Jaco's base.
[1]: https://arxiv.org/abs/1703.10847
[2]: https://en.wikipedia.org/wiki/Jaco_Pastorius
[3]: http://www.bock-for-pastorius.de/midi.htm
[4]: http://community.wolfram.com//c/portal/getImageAttachment?filename=317901.jpg&userId=1013863
[5]: http://community.wolfram.com//c/portal/getImageAttachment?filename=567502.jpg&userId=1013863
[6]: http://community.wolfram.com//c/portal/getImageAttachment?filename=476803.jpg&userId=1013863
[7]: http://community.wolfram.com//c/portal/getImageAttachment?filename=744004.jpg&userId=1013863
[8]: http://community.wolfram.com//c/portal/getImageAttachment?filename=586405.jpg&userId=1013863
[9]: http://community.wolfram.com//c/portal/getImageAttachment?filename=707106.jpg&userId=1013863
[10]: https://arxiv.org/abs/1701.07875
[11]: http://community.wolfram.com//c/portal/getImageAttachment?filename=435507.jpg&userId=1013863
[12]: http://community.wolfram.com//c/portal/getImageAttachment?filename=170508.jpg&userId=1013863
[13]: http://community.wolfram.com//c/portal/getImageAttachment?filename=324809.jpg&userId=1013863
[14]: http://community.wolfram.com//c/portal/getImageAttachment?filename=965210.jpg&userId=1013863
[15]: http://community.wolfram.com//c/portal/getImageAttachment?filename=706311.jpg&userId=1013863
[16]: http://community.wolfram.com//c/portal/getImageAttachment?filename=177112.jpg&userId=1013863Kotaro Okazaki2018-09-02T02:30:04Z[Event] Shanghai User Meetup Review
http://community.wolfram.com/groups/-/m/t/1450141
*All notebooks used in the presentation can be downloaded at the end of the post.*
----------
The idea of the post is to encourage our lovely users to share their experience about Wolfram products in local meetup groups, building up friendship and partnership among our community.
On 9/8/2018 Saturday, WRI Developer Mr. Shenghui Yang hosted a 12-people private Mathematica user panel to discuss the latest R&D achievement of Wolfram Language V11.1, 2 and 3, including
- Updates and Improvements for Geo system and Entity
- Neural Network in V11.3
- Wolfram Cloud user interface and deployment
- Several appealing examples of Mathematica dynamic feature in K-12 teaching project
![lecturing][1]
![beginning][2]
## Geo system ##
To have Wolfram Language features more accessible and relatable to our domestic users, Shenghui mixed his real life elements into Wolfram Language. The whole presentation became his daily life storytelling upon Wolfram Language knowledge base.
W|A command line interface briefly describes the weather condition on the day of this event
![weather][3]
GeoMagenetData and GeoGravityData demonstrates important geophysics properties of Shanghai at the moment of the presentation ;-) No need to worry about any anomaly
![geodata][4]
GeoPosition with customized GeoMarker visualize the location of this event along the riverband of Yangtze
![marker][5]
GeoDistance, GeoPath and several powerful projection options show our user how Wolfram headquarters relates to the meeting place. One of ~530 projection types is used in the example.
In:= GeoProjectionData["LambertAzimuthal"]
Out= {LambertAzimuthal,{Centering->{0,0},GridOrigin->{0,0},ReferenceModel->1}}
In := GeoProjectionData[]//Short
Out:= {Airy,Aitoff,Albers,AmericanPolyconic,ApianI,<<525>>,WinkelTripel}
![path][6]
GeoArea + GeoPosition, after mark the places the host visited most frequently in Shanghai, formed a large triangle. Combine EntityValue and related functions to easily extract the ratio of the triangle to Shanghai in terms of area
![area][7]
GeoPath and TravelDirectionData also reported accurately how long it takes to route and visit all three marked places
![travel][8]
Finally, Shenghui mention that this event being hosted in a nice tea house, owned by YueSheng Du, the Shanghai-born Mob King and the God of Father of Far East during the Chiang Kai-shek era. Related background information can be retrieved both by built-in Entity functions and ExternalService with BingSearch V5
![history][9]
![bing][10]
## Discussion on K-12 Math Topics ##
This section is set specifically to users in K-12 education industry or the parents, whose kids in this academic interval, looking for new way for their kids to understand the school materials.
Shenghui and several local users reached out to some domestic teachers in public and private schools, ranging from the elite to mid level.
Real test problems were collected for the demo. A brief moment was left to the audience to think about the challenging problems before seeing the notebook with solution. The solution uses Mathematica built-in strong visualization, dynamic and CloudDeploy features. One of the most stressful and painful problem in the current domestic K-12 math education is that students need to take math-olympiad level exam for middle and high school. Most of the kids have no choice but to recite the hard-coded hacks to solve the tricky problems in short time. Lack of understanding and intuitive explanation make the process even more challenging. The host brings new vision into these problem via graphical presentation.
Here is an example of Non-stop trains problem with graphical explanation. (10 grade math problem) This question is asked to compute the distance between each cross point. The demo is designed to help students to understand the physical process and solve by hand in the exam, rather than shoot a Mathematica solution to them
![question][11]
![solution][12]
## Neural Network and AI ##
The presentation is based on the updated version of [Taliesin's][13] [notebook][14] and demo session on [YouTube][15] (some NN layers' name are updated in V 11.3 like DotPlusLayer -> LinearLayer). The examples are fully tested in the attached notebook for V11.3. Though the topic is quite involved for first time users, the audience are willing to learn Wolfram Language. Shenghui and his college roomate, a [Tecent AI Lab][16] senior researcher and also a veteran Mathematica user, collaboratively initialize bi-weekly online discussion for domestic Mathematica users. The one-hour AI-topic paper reading session is aimed to have the users familiar with basic NN layers in Wolfram Language and with different Networks available in the [Wolfram Neural Network Repository][17].
[1]: http://community.wolfram.com//c/portal/getImageAttachment?filename=1.jpg&userId=23928
[2]: http://community.wolfram.com//c/portal/getImageAttachment?filename=2.jpg&userId=23928
[3]: http://community.wolfram.com//c/portal/getImageAttachment?filename=3.png&userId=23928
[4]: http://community.wolfram.com//c/portal/getImageAttachment?filename=4.png&userId=23928
[5]: http://community.wolfram.com//c/portal/getImageAttachment?filename=5.png&userId=23928
[6]: http://community.wolfram.com//c/portal/getImageAttachment?filename=6.png&userId=23928
[7]: http://community.wolfram.com//c/portal/getImageAttachment?filename=7.png&userId=23928
[8]: http://community.wolfram.com//c/portal/getImageAttachment?filename=8.png&userId=23928
[9]: http://community.wolfram.com//c/portal/getImageAttachment?filename=9.png&userId=23928
[10]: http://community.wolfram.com//c/portal/getImageAttachment?filename=10.png&userId=23928
[11]: http://community.wolfram.com//c/portal/getImageAttachment?filename=11.png&userId=23928
[12]: http://community.wolfram.com//c/portal/getImageAttachment?filename=12.png&userId=23928
[13]: https://twitter.com/taliesinb
[14]: https://wolfr.am/gLSyxCEE
[15]: https://www.youtube.com/watch?v=FnpqI4REiak
[16]: https://ai.tencent.com/ailab/index.html
[17]: https://resources.wolframcloud.com/NeuralNetRepository/Shenghui Yang2018-09-11T14:30:05ZImageAugmentationLayer on image and target mask
http://community.wolfram.com/groups/-/m/t/1445573
Hi, I'd like to use ImageAugmentationLayer in my binary image segmentation neural network. However, it seems like I can't get the ImageSegmentationLayer to do exactly the same transform on my input image as on my target mask. Is there a hidden way to do this that's not mentioned in the docs? It seems like every invocation of the layer will use a new random crop, but I need to do the _exact same_ random crop on pairs of images.
Cheers!Carl Lange2018-09-09T12:40:13Z[WSS18] Reinforcement Q-Learning for Atari Games
http://community.wolfram.com/groups/-/m/t/1380007
## Introduction ##
This project aims to create a neural network agent that plays Atari games. This agent is trained using Q-Learning. The agent will not have any priori knowledge of the game. It is able to learn by playing the game and only being told when it loses.
##What is reinforcement learning? ##
Reinforcement learning is an area under the general machine Learning, inspired by behavioral psychology. The agent learns what to do, given a situation and a set of possible actions to choose from, in order to maximize a reward. Therefore, to model a problem to reinforcement learning problem, the game should have a set of states, a set of actions that able to transfer one state into another and a set of reward corresponding to each state. The mathematical formulation of reinforcement learning problem is called Markov Decision Process (MDP).
![An visual representation of reinforcement learning problem][1]
Image From:https://medium.freecodecamp.org/diving-deeper-into-reinforcement-learning-with-q-learning-c18d0db58efe
## Markov Decision Process ##
Before apply Markov decision process to the problem, we need to make sure the problem satisfy the Markov property which is that the current state completely represents the state of the environment. For short, the future depends only on the present.
An MDP can be defined by **(S,A,R,P,γ)** where:
- S — set of possible states
- A — set of possible actions
- R — probability distribution of reward given (state, action) pair
- P — probability distribution over how likely any of the states is to
be the new states, given (state, action) pair. Also known as
transition probability.
- γ — reward discount factor
At initial state $S_{0}$, the agent chooses action $A_{0}$. Then the environment gives reward $R_{0}=R(.|S_{0}, A_{0})$ and next state $S_{1}=P(.|S_{0},A_{0})$. Repeats till the environment ends.
##Value Network##
In value-based RL, the input will be the current state or a combination of few recent states, and the output will be the estimated future reward of every possible action at this state. The goal will be to optimize the value function so that the prediction value is close to the actual reward. In the following graph, each number in the box represents the distance from current box to the goal.
![Value network example][2]
Image From:https://medium.freecodecamp.org/diving-deeper-into-reinforcement-learning-with-q-learning-c18d0db58efe
## Deep Q-Learning ##
Deep Q-learning is the algorithm that I used to construct my agent. The basic idea of Q function is to get the state and action then output the corresponding sum of rewards till the end of the game. In deep Q-learning, we use a neural network as the Q function therefore we can use one state as input and let neural network to generate the prediction for all possible actions.
The Q function is stated as following.
$Q(S_{t},A) = R_{t+1}+\gamma maxQ(S_{t+1},A)\\Where:\\Q(S_{t},A)\,\,\,\,\,\,\,\,\,\,\,\,\, = The \,predicted\,sum \,of rewards \,given\, current\,state\,and\,selected\,action\\R_{t+1} \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,= Reward\,received\,after\,taking\,action\\\gamma \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,= Discount\,factor\\maxQ(S_{t+1},A) = The\,prediction\,of\,next\,state$
As we can see that given current state and action, Q function outputs the reward of current plus the max value of the predictions of next state. This function will iteratively predicts the reward till the end of the game where Q[S,A] = R. Therefore we can calculate the loss by minus the prediction of current state with the sum of the reward and the prediction of the next state. When loss equals to 0, the function will able to perfectly predicts the reward of all actions. In another sense that the Q function is predicting the future value of its own prediction. People might ask how could this function be ever converge? Yes, this function is usually hard to converge but when it converges, the performance is really well. There are a lot of techniques that can be used to speed up the converge of the Q function. I will talk about a few techniques I used in this project.
## Experience Replay ##
Experience replay means that the agent will remember the states that it has experienced and learn from those experience when training. It gives more efficient way of using generated data which is by learning it for multiple times. It is important when gaining experience is expensive to agent. Since the Q function usually don't converge in a short time which means a lot of the outcomes from the experience are usually similar, multiple passes on the same data is useful.
## Decaying Random Factor ##
Random Factor is the possibility for the agent to choose a random action instead of the best predicted action. It allows the agent to start with random player to increase the diversity of the sample. The random factor decreases with the more game plays therefore the agent is able to be reinforced on its own action pattern.
## Combine Multiple Observations As Input ##
The following image shows a single frame took out from Atari game BreakOut. From this image, the agent is able to capture information about the location of the ball, the location of the board, etc. But several important information is not shown. If you play as the agent, this image is shown to you, what action you will choose? Feel something is missing? Is the ball going right or left? Is the ball going up or down?
![breakout frame1][3]
Generated Using openAI Gym
The following images are two continuous frames took out from the game BreakOut. From these two images that agent is able to capture the information on the direction of the ball and also the speed of the ball. A lot of people tends to forget this since processing recent memories during playing a game is like a nature to us but not to an reinforcement agent.
![breakout frame1][4]![frame 2][5]
Generated Using openAI Gym
## Agent Play in CartPole environment ##
The main environment for agent to learn and tested is CartPole environment. This environment is consist of two movable parts. One is the cart which is controlled by the agent, has two possible action every state which is moving left or right. The other one is pole. This environment simulate the effect of gravity on pole which makes it fall to left or right due to its orientation with the horizon. For this environment to be considered as solved, the average episodes that the agent able to get in 100 games is over 195. Following graph is a visual representation of the environment. The blue rectangle represents the pole. The black box is the cart. The black line is the horizon.
![cart pole sample][6]
First, let's create an environment
$env = RLEnvironmentCreate["WLCartPole"]
Then, initialize a network for this environment and a generator
policyNet =
NetInitialize@
NetChain[{LinearLayer[128], Tanh, LinearLayer[128], Tanh,
LinearLayer[2]}, "Input" -> 8,
"Output" -> NetDecoder[{"Class", {0, 1}}]];
generator := creatGenerator[$env, 20, 10000, False, 0.98, 1000, 0.95, False]
The generator function plays the game and generates input-output pairs to train the network.
Inside the generator, it initialize the replay buffer which is processed, reward list is used to record the performance, best is to record the peak performance.
If[#AbsoluteBatch == 0,
processed = <|"action"->{},"observation"->{},"next"->{},"reward"->{}|>;
$rewardList = {};
$env=env;
best = 0;
];
Then the environment data are being generated from game function and being preprocessed. At the start of training, the generator will produce more data to fill the replay buffer.
If[#AbsoluteBatch == 0,
experience = preprocess[game[start,maxEp,#Net, render, Power[randomDiscount,#AbsoluteBatch], $env], nor]
,
experience = preprocess[game[1,maxEp,#Net, render, Power[randomDiscount,#AbsoluteBatch],$env], nor]
];
The game function is below, it is joining current observation and next observation as the input to the network.
game[ep_Integer,st_Integer,net_NetChain,render_, rand_, $env_, end_:Function[False]]:= Module[{
states, list,next,observation, punish,choiceSpace,
state,ob,ac,re,action
},
choiceSpace = NetExtract[net,"Output"][["Labels"]];
states = <|"observation"->{},"action"->{},"reward"->{},"next"->{}|>;
Do[
state["Observation"] = RLEnvironmentReset[$env]; (* reset every episode *)
ob = {};
ac = {};
re = {};
next = {};
Do[
observation = {};
observation = Join[observation,state["Observation"]];
If[ob=={},
observation = Join[observation,state["Observation"]]
,
observation = Join[observation, Last[ob][[;;Length[state["Observation"]]]]]
];
action = If[RandomReal[]<=Max[rand,0.1],
RandomChoice[choiceSpace]
,
net[observation]
];
(*Print[action];*)
AppendTo[ob, observation];
AppendTo[ac, action];
state = RLEnvironmentStep[$env, action, render];
If[Or[state["Done"], end[state]],
punish = - Max[Values[net[observation,"Probabilities"]]] - 1;
AppendTo[re, punish];
AppendTo[next, observation];
Break[]
,
AppendTo[re, state["Reward"]];
observation = state["Observation"];
observation = Join[observation, ob[[-1]][[;;Length[state["Observation"]]]]];
AppendTo[next, observation];
]；
,
{step, st}];
AppendTo[states["observation"], ob];
AppendTo[states["action"], ac];
AppendTo[states["reward"], re];
AppendTo[states["next"], next];
,
{episode,ep}
];
(* close the $environment when done *)
states
]
Preprocess function flatten the input and has an option on if normalizing the observation
preprocess[x_, nor_:False] := Module[{result},(
result = <||>;
result["action"] = Flatten[x["action"]];
If[nor,
result["observation"] = N[Normalize/@Flatten[x["observation"],1]];
result["next"] = N[Normalize/@Flatten[x["next"],1]];
,
result["observation"] = Flatten[x["observation"],1];
result["next"] = Flatten[x["next"],1];
];
result["reward"] = Flatten[x["reward"]];
result
)]
Let's continue with generator, after getting the data from the game, generator measures the performance and records it.
NotebookDelete[temp];
reward = Length[experience["action"]];
AppendTo[$rewardList,reward];
temp=PrintTemporary[reward];
Records the net with best performance
If[reward>best,best = reward;bestNet = #Net];
Add these experience to the replay buffer
AppendTo[processed["action"],#]&/@experience["action"];
AppendTo[processed["observation"],#]&/@experience["observation"];
AppendTo[processed["next"],#]&/@experience["next"];
AppendTo[processed["reward"],#]&/@experience["reward"];
Make sure the total size of replay buffer does not exceed the limit
len = Length[processed["action"]] - replaySize;
If[len > 0,
processed["action"] = processed["action"][[len;;]];
processed["observation"] = processed["observation"][[len;;]];
processed["next"] = processed["next"][[len;;]];
processed["reward"] = processed["reward"][[len;;]];
];
Add input of the network to the result
pos = RandomInteger[{1,Length[processed["action"]]},#BatchSize];
result = <||>;
result["Input"] = processed["observation"][[pos]];
Calculates the out put based on the next state and reward and add to the result
predictionsOfCurrentObservation = Values[#Net[processed["observation"][[pos]],"Probabilities"]];
rewardsOfAction = processed["reward"][[pos]];
maxPredictionsOfNextObservation = gamma*Max[Values[#]]&/@#Net[processed["next"][[pos]],"Probabilities"];
temp = rewardsOfAction + maxPredictionsOfNextObservation;
MapIndexed[
(predictionsOfCurrentObservation[[First@#2,(#1+1)]]=temp[[First@#2]])&,(processed["action"][[pos]]-First[NetExtract[net,"Output"][["Labels"]]])
];
result["Output"] = out;
result
In the end, we can start training
trained =
NetTrain[policyNet, generator,
LossFunction -> MeanSquaredLossLayer[], BatchSize -> 32,
MaxTrainingRounds -> 2000]
## Performance of the agent ##
![enter image description here][7]
The graph above show the performance of the agent in 1000 games in cart pole environment. The agent starts with random play which has a low number of episodes lasted. The performance stay low till 800 games. But after 800 games, the performance starts to increase exponentially. In the end of the training, the performance jumps from 3k to 10k which is the maximal number of episode per game in 4 games. This proves that although the Q function is hard to converge, but when it converges, the performance is very well.
##Future Directions##
The current agent uses the classical DQN as its major structure. Other techniques like Noisy Net, DDQN, Prioritized Reply, etc can help the Q function to converge in a shorter time. Other algorithms like Rainbow Algorithm which is based on Q learning will be the next step of this project.
code can be found on [github link][8]
[1]: http://community.wolfram.com//c/portal/getImageAttachment?filename=rl.png&userId=1363029
[2]: http://community.wolfram.com//c/portal/getImageAttachment?filename=vn.png&userId=1363029
[3]: http://community.wolfram.com//c/portal/getImageAttachment?filename=breakout1.png&userId=1363029
[4]: http://community.wolfram.com//c/portal/getImageAttachment?filename=breakout1.png&userId=1363029
[5]: http://community.wolfram.com//c/portal/getImageAttachment?filename=breakout2.png&userId=1363029
[6]: http://community.wolfram.com//c/portal/getImageAttachment?filename=cp.png&userId=1363029
[7]: http://community.wolfram.com//c/portal/getImageAttachment?filename=performance.png&userId=1363029
[8]: https://github.com/ianfanx/wss2018ProjectIan Fan2018-07-11T20:52:09Z