1701.07274

May 4, 2018 | Author: Diego Alejandro Gomez Mosquera | Category: Deep Learning, Artificial Neural Network, Machine Learning, Estimation Theory, Systems Science


Comments



Description

D EEP R EINFORCEMENT L EARNING : A N OVERVIEWYuxi Li ([email protected]) A BSTRACT We give an overview of recent exciting achievements of deep reinforcement learn- ing (RL). We discuss six core elements, six important mechanisms, and twelve applications. We start with background of machine learning, deep learning and arXiv:1701.07274v5 [cs.LG] 15 Sep 2017 reinforcement learning. Next we discuss core RL elements, including value func- tion, in particular, Deep Q-Network (DQN), policy, reward, model, planning, and exploration. After that, we discuss important mechanisms for RL, including at- tention and memory, unsupervised learning, transfer learning, multi-agent RL, hi- erarchical RL, and learning to learn. Then we discuss various applications of RL, including games, in particular, AlphaGo, robotics, natural language processing, including dialogue systems, machine translation, and text generation, computer vision, neural architecture design, business management, finance, healthcare, In- dustry 4.0, smart grid, intelligent transportation systems, and computer systems. We mention topics not reviewed yet, and list a collection of RL resources. After presenting a brief summary, we close with discussions. This is the first overview about deep reinforcement learning publicly available online. It is comprehensive. Comments and criticisms are welcome. 1 C ONTENTS 1 Introduction 5 2 Background 6 2.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.1 Problem Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.2 Value Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.3 Temporal Difference Learning . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.4 Multi-step Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.5 Function Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.6 Policy Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.7 Deep Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.8 RL Parlance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.9 Brief Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3 Core Elements 14 3.1 Value Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.1.1 Deep Q-Network (DQN) . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.1.2 Double DQN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.1.3 Prioritized Experience Replay . . . . . . . . . . . . . . . . . . . . . . . . 16 3.1.4 Dueling Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.1.5 More DQN Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2 Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.1 Actor-Critic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2.2 Policy Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2.3 Combining Policy Gradient with Off-Policy RL . . . . . . . . . . . . . . . 19 3.3 Reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.4 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.5 Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.6 Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4 Important Mechanisms 22 4.1 Attention and Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.2.1 Horde . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.2.2 Unsupervised Auxiliary Learning . . . . . . . . . . . . . . . . . . . . . . 23 4.2.3 Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . . . . 23 2 4.3 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.4 Multi-Agent Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . 24 4.5 Hierarchical Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . 25 4.6 Learning to Learn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 5 Applications 26 5.1 Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.1.1 Perfect Information Board Games . . . . . . . . . . . . . . . . . . . . . . 26 5.1.2 Imperfect Information Board Games . . . . . . . . . . . . . . . . . . . . . 28 5.1.3 Video Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5.2 Robotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.2.1 Guided Policy Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.2.2 Learn to Navigate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.3 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.3.1 Dialogue Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.3.2 Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.3.3 Text Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.4 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.5 Neural Architecture Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.6 Business Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.7 Finance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.8 Healthcare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.9 Industry 4.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.10 Smart Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.11 Intelligent Transportation Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.12 Computer Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 6 More Topics 36 7 Resources 37 7.1 Books . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 7.2 More Books . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 7.3 Surveys and Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 7.4 Courses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 7.5 Tutorials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 7.6 Conferences, Journals and Workshops . . . . . . . . . . . . . . . . . . . . . . . . 39 7.7 Blogs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 7.8 Testbeds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 7.9 Algorithm Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3 8 Brief Summary 42 9 Discussions 44 4 . (2016) are recent deep learning books. i. and strong financial support. We present a list of topics not reviewed yet in Section 6.0 in Section 5.11. even made a formula: artificial intelligence = reinforcement learning + deep learning (Silver. robotics in Section 5. in the breakthroughs and novel architectures and applications discussed above. including games in Sec- tion 5. we have been witnessing the renaissance of reinforcement learning (Krakovsky. including value function in Section 3. new algorithmic techniques. It covers RL fundamentals and reflects new progress.1. value iteration networks (Tamar et al. transfer learning in Section 4. 2017).. 2016). unsupervised reinforcement and auxiliary learning (Jaderberg et al. conferences. especially. or deep neural networks.9. healthcare in Section 5. 2012.. smart grid in Section 5. and novel ar- chitectures and applications. hierarchical RL in Section 4. Deng and Dong (2014) and Goodfellow et al. 2016b).g. 1996. asynchronous methods (Mnih et al. The integration of reinforcement learning and neural networks has a long history (Sutton and Barto. 2015) and AlphaGo (Silver et al.2.1.. tutorials. spoken dialogue systems (Su et al.3... With recent exciting achievements of deep learning (LeCun et al.5. including attention and memory in Section 4.1. in deep Q-network.12. e. natural language processing. dueling network architectures (Wang et al. e.e. dual learning for machine translation (He et al. After that. like differentiable neural computer (Graves et al. benefiting from big data. it is Sutton and Barto’s RL book (Sutton and Barto. and exploration in Section 3. etc. We invent a sentence: Deep reinforcement learning is artificial intelligence. 1996.. learning to learn in Section 4. Feature engineering used to be done manually and is usually time- consuming. deep reinforcement learning (deep RL). and engineering (Sutton and Barto. reports.. journals and workshops. has been prevailing in reinforcement learning in the last several years.3. Hastie et al. Next we discuss core RL elements. and Murphy (2012) 5 .3. online courses. Goodfellow et al. 2016). We have been witnessing break- throughs. 2016). and. 2017. Deep learning. Szepesvári..5. we discuss various RL applications. in- telligent transportation systems in Section 5. 2017).5. in games.2. powerful computation. Then we discuss important mechanisms for RL. (2009). Bertsekas and Tsitsiklis. distributed representations exploit the hierar- chical composition of factors in data to combat the exponential challenges of the curse of dimen- sionality. mature software packages and architectures. 2016a). so that reliance on domain knowledge is significantly reduced or even removed. reward in Section 3.1 I NTRODUCTION Reinforcement learning (RL) is about an agent interacting with the environment.. 2015. plan- ning in Section 3. 2nd edition in preparation. Deep learning and reinforcement learning. Bishop (2011)...g. In Section 7.10. multi-agent RL in Section 4. and computer systems in Section 5. 2010. the major contributor of AlphaGo (Silver et al. David Silver. 2011). deep learn- ing and reinforcement learning in Section 2. natural language processing in Section 5. 2017. finance in Section 5. and open sources. Powell.4. 2016)..6. and close with discussions in Section 9. 2016a). 2017. First we discuss background of machine learning.. 2016). unsupervised learning in Section 4. as well as in psychology and neuroscience. for sequential decision making problems in a wide range of fields in both natural and social sciences. learning an optimal policy. we list a collection of RL resources including books. AlphaGo. give a brief summary in Section 8. neural architecture design (Zoph and Le. Bertsekas.6. 2016a). policy in Section 3. Industry 4. Why has deep learning been helping reinforcement learning make so many and so enormous achieve- ments? Representation learning with deep learning enables automatic feature engineering and end- to-end learning through gradient descent. business management in Section 5. surveys. 2016). information extraction (Narasimhan et al.4..8. Generality.. The outline of this overview follows.4. being selected as one of the MIT Technology Review 10 Breakthrough Technologies in 2013 and 2017 respectively. blogs. the combination of deep neural networks and reinforcement learning. Mirowski et al. 1998. model in Section 3.6. over-specified. like deep Q-network (Mnih et al. and generative adversarial imitation learning (Ho and Ermon. 2016). computer vision in Section 5. robotics. expressiveness and flexibility of deep neural networks make some tasks easier or possible. will play their crucial role in achieving artificial general intelligence.. Bertsekas and Tsitsiklis. neural architecture design in Section 5. If picking a single RL resource.2. 2016). 2017). and incomplete. 2015).7. 2016b). etc. guided policy search (Levine et al. 2016a). policy gra- dient methods.. Deep. by trail and error. Schmidhuber.. Figure 1 illustrates the conceptual organization of the overview. In the future. (2013) gives an introduction to machine learn- ing. personalized web services.. Industry 4.0. hierarchical RL. we hope this overview would be helpful as a reference. unsupervised learning. we will also improve the depth by conducting deeper analysis of the issues involved and the papers discussed. In this version. robotics. The main readers of this overview would be those who want to get more familiar with deep re- inforcement learning. deep learn- ing (Goodfellow et al. The agent-environment interaction sits in the center. as well as new comers. James et al. For reinforcement learning experts. In this overview. we mainly focus on contemporary work in recent couple of years. Provost and Fawcett (2013) and Kuhn and Johnson (2013) discuss practical issues in machine learning applications. Comments and criticisms are welcome. and computer systems. policy. reward. Next come important mechanisms: attention and memory. and Simeone (2017) is a brief introduction to machine learning for engineers. We do not give detailed background introduction for machine learning and deep learning. and make slight effort for discussions of historical context. and exploration. ITS (intelligent transportation systems). smart grid. we briefly introduce concepts and fundamentals in machine learning. Instead. and learning to learn. for which the best material to consult is Sutton and Barto (2017). by no means complete. model. besides further refinements for the width. about core elements. We endeavour to provide as much relevant information as possible. 2016) and reinforcement learning (Sutton and Barto. Then come various appli- cations: games. Figure 1: Conceptual Organization of the Overview are popular machine learning textbooks. neural architecture design. healthcare. 2 BACKGROUND In this section. multi-agent RL. transfer learning. planning. NLP (natural language processing). and applications. 2017). we recommend 6 . computer vision. important mechanisms. we endeavour to provide a wide coverage of fundamental and contemporary RL issues. finance. around which are core elements: value function. 7 . but no supervised signals. like mean square error in regression and classification error rate. which differentiates machine learning from optimization. nat- ural language processing. Reinforcement learning is kin to optimal control (Bertsekas. A model is under-fitting if it can not achieve a low training error. clustering and density estimation. Unsupervised learning attempts to extract information from data without labels.1 M ACHINE L EARNING Machine learning is about learning from data and making predictions and/or decisions. and can be integrated with reinforcement learning. e. Training error measures the error on the training data. and testing subsets.. and in reinforcement learning. Usually we categorize machine learning as supervised. Littman (2015). measures the error on new input data. at the same time. there are no labeled data. As model capacity increases. or vari- ance of the estimator. and operations research and management (Powell. Machine learning is based on probability theory and statistics (Hastie et al. with low-dimensional. We find the optimal capacity to achieve low training error and small gap between train- ing error and generalization error. and becomes a critical ingredient for computer vision. is the basis for big data. and LeCun et al. 2004). simple models are preferred. In supervised learning. with respect to accuracy. usually for su- pervised or unsupervised learning. a model is over-fitting if the gap between training error and test error is large. Deep learning. VC dimension measures the capacity of a binary classifier. 2012). We try to find the optimal capacity point. No free lunch theorem states that there is no universally best model. to reduce the general- ization error. there are evaluative feedbacks. and is also related to psychology and neuroscience (Sutton and Barto. validation. far-sighted. 2017. consid- ering instant reward. sparse. (2015) for deep learning. and is evolving to be critical for all fields of AI. etc. Machine learning is a subset of artificial intelligence (AI). 2. and the gap be- tween training error and testing error small. of which under-fitting occurs on the left and over-fitting occurs on the right. However. considering long-term accumulative reward. An implication is that deep learning may not be the best model for some problems. while variance tends to increase. or best regularizor. A cost/loss function measures the model performance. We also collect relevant resources in Section 7. Training error and generalization error versus model capacity usually form a U-shape relationship. Generaliza- tion error.the following recent Nature/Science survey papers: Jordan and Mitchell (2015) for machine learn- ing. 2016). we recommend the textbook. A machine learning algorithm is composed of a dataset. or test error. and hyperparameters for model capacity and regularization. However. there are labeled data.. A dataset is divided into non-overlapping training. an optimization pro- cedure. There are model parameters. etc. in unsupervised learning. data mining. Representation learning is a classical type of unsupervised learning. with the same expressiveness. usually as a function approximator.. unsupervised.g. predictive modeling (Kuhn and Johnson. while variance measures the deviation of the estimator from the expected value. Sutton and Barto (2017). and independent representations. to keep the representation simpler or more accessible than the original data. but not training error. 2017). with categorical and numerical outputs respectively. A machine learning algorithm tries to make the training error.g. and reinforcement learning. 2013).. Occam’s Razor states that. 2008). is a particular machine learning scheme. 2009) and optimiza- tion (Boyd and Vandenberghe. 2011).. informa- tion retrieval (Manning et al. and the recent Nature survey paper. e. 2013). and a model (Goodfellow et al. minimizing which is an optimization problem. training feedforward networks or convolutional neural networks with supervised learning is a kind of representation learning. data science (Blei and Smyth. A model’s capacity measures the range of functions it can fit. yielding another U-shape relationship between generalization error versus model capacity. Regularization add a penalty term to the cost function. Supervised and unsupervised learning are usually one-shot. myopic. We cover some RL basics. Classification and regression are two types of supervised learning problems. or deep neural networks. for reinforcement learning. Bias measures the expected deviation of the estimator from the true value. Representation learning finds a representation to preserve as much informa- tion about the original data as possible. Provost and Fawcett. while reinforcement learning is sequential. a cost/loss function. robotics. bias tends to decrease. or equivalently. MLE is equivalent to minimizing KL divergence. Dis- tributed representation is a central idea in deep learning. rectified linear unit (ReLU). colour image. between input and output layers. when being unfolded in time of forward computation. or activation function. with gating mechanisms to manipulate information through recurrent cells. A RNN can be seen as a multilayer neural network with all layers sharing the same weights. 2. we have one or more hidden layers. to apply to the input of a unit. decision trees. pooling layers and fully connected layers. Importance sampling is a technique to estimate properties of a particular distribution. or more popular recently. Deep neural networks learn representations automatically from raw inputs to recover the compo- sitional hierarchies in many natural signals. Minimiz- ing KL divergence between two distributions corresponds to minimizing the cross-entropy between the distributions..Cross-validation is used to tune hyperparameters.g. support vector machines (SVMs). linear regression. i. distributed representations combat the exponential challenges of the curse of dimensionality.. and backpropagate gradients towards the input layer. motifs. Dropout is a regularization strategy to train an ensemble of sub-networks by removing non-output units randomly from the original network.e. CNNs are designed to process data with multiple arrays. We have weights on links between units from layer to layer. i. by samples from a different distribution. then we usually use nonlinear transformation.. Frequentist statistics estimates a single value. Gradient descent is a common approach to solve optimization problems. A recurrent neural network (RNN) is often used to process sequential inputs like speech and language.g. at output layer and each hidden layer. parts. and usually with minibatches.. Stochastic gradient descent extends gradient descent by working with a single sample each time. and local combinations of edges. as the weighted sum of units from the previous layer.e. e. to accelerate training by reducing internal covariate shift. In short. e. Long short term memory networks (LSTM) and gated recurrent unit (GRU) were proposed to address such issues. It is hard for RNN to store information for very long time and the gradient may vanish. with convolutional layers. A feedforward deep neural network or multilayer perceptron (MLP) is to map a set of input values to output values with a mathematical function formed by composing many simpler functions at each layer. pooling and the use of many layers. and characterizes variance by confidence interval. with hidden units to store history of past elements. benefit from the properties of such signals: local connections. and boosting.g. the change of parameters of previous layers will change each layer’s inputs distribution. In deep learning. and the inputs may be transformed with manual feature en- gineering before training. the hierarch of objects. and each feature may represent many inputs. Bayesian statistics considers the distribution of an estimate when making predictions and decisions. so that weights can be updated to optimize some loss function. we compute the input to each unit. The exponential advantages of deep. 2015). For issues like numerical underflow. After computations flow forward from input to output.. e. Batch normalization performs the normalization for each training mini-batch. in images. we can compute error derivatives backward. the dissimilarity between the empirical distribution defined by the training data and the model distribution. to lower the variance of the estimation. logistic regression. Maximum likelihood estimation (MLE) is a common approach to derive good estimation of param- eters. and video. maximization of likelihood becomes minimization of the negative log- likelihood (NLL). language. element by element. Gradient backpropagation or its variants can be used for training all deep neural networks mentioned above. which implies that many features may represent each input. For many machine learning algorithms. minimization of cross entropy.2 D EEP L EARNING Deep learning is in contrast to ”shallow” learning. and to select the optimal model. and are inspired by simple cells and complex cells in visual neuroscience (LeCun et al. A convolutional neural network (CNN) is a feedforward deep neural network. shared weights. such as logistic. At each layer except input layer. audio spectrogram. higher-level features are composed of lower-level ones. to obtain a new representation of the input from previous layer. to strike a balance between bias and variance. or when sampling from the distribution of interest is difficult.. 8 . the product in MLE is converted to summation to obtain negative log-likelihood (NLL). we have input layer and output layer. tanh. we use dynamic programming methods: policy evaluation to calculate value/action value function for a policy. following a policy π(at |st ).. qπ (s. receives a scalar reward rt . a)[r + γ maxa q∗ (s . a) is the maximum state value achievable by any policy for state s. When there is no model. 2014) with raw sentences for machine translation.3 R EINFORCEMENT L EARNING We provide background of reinforcement learning briefly in this section. a) = s0 . a) decomposes into the 0 0 0 Bellman equation: q∗ (s..r p(s0 . Seq2Seq (Sutskever et al. a game. we discuss value function. a mapping from state st to actions at . deep RL.r p(s0 . TD learning is a prediction problem. a) = maxπ qπ (s. but not on the past. 9 . is.. The action value P qπ (s. 2. To have a good understanding of deep reinforcement learning. a) is the maxi- P mum action value achievable byP any policy for state s and action a. at ) respectively.r p(s . It is not hard to extend it to continuous spaces. where α is a learning rate. future reward. P. discounted. an MDP.r p(s0 .The notion of end-to-end training refers to that a learning model uses raw inputs without manual feature engineering to generate outputs. defined by the 5-tuple (S. 2. AlexNet (Krizhevsky et al.. 2017) and Q- learning (Watkins and Dayan. γ). 2. The agent aims to maximize the expectation of such long term return from each state. An optimal action value function q∗ (s. a) = E[Rt |st = s. 2015) with raw pixels and score to play games. q∗ (s. r|s. it is essential to have a good understanding of rein- forcement learning first.3.g. with bootstrapping. a)[r + γvπ (s0 )]. a)[r + γ a0 π(a0 |s0 )qπ (s0 . and transitions to the next state st+1 . a RL environment can be a multi- armed bandit. etc. i. measur- ing how good each state. We denote an optimal 0 policy by π ∗ .3. The problem is set up in discrete state and action spaces. vπ (s) decomposes into the Bellman equation: vπ (s) = a π(a|s) s0 . a) = s0 . RL methods also work when the model is available. P∞ this process continues until the agent reaches a terminal state and then it restarts. which is the agent’s behavior.3 T EMPORAL D IFFERENCE L EARNING When a RL problem satisfies the Markov property. r|s. a) decomposes into the Bellman equation: qπ (s. in a model-free. The return Rt = k=0 γ k rt+k is the discounted. v∗ (s) decom- poses into the Bellman equation: v∗ (s) = maxa s0 . Additionally. When the system model is available.e. the future depends only on the current state and action. R.3. it is formulated as a Markov Decision Process (MDP). the agent receives a state st in a state space S and selects an action at from an action space A. In an episodic problem. a POMDP. value iteration and policy iteration for finding an optimal policy. 2. r|s. r|s. SARSA (Sutton and Barto. 1]. 2012) with raw pixels for image classification. Algorithm 1 presents the pseudo code for tabular TD learning. where 0 indicates it is based on one-step return. for reward function R(s. we resort to RL methods. a)[r + γv∗ (s0 )]. or model. accumulative. according to the environment dynamics. 1992) are also regarded as temporal difference learning. at = a] is the expected return for selecting action a in Pstate s and then fol- lowing policy π. The update rule is V (s) ← V (s) + α[r + γV (s0 ) − V (s)]. online.1 P ROBLEM S ETUP A RL agent interacts with an environment over time. accumulated reward with the discount factor γ ∈ (0.2 VALUE F UNCTION A value function is a prediction of the expected. TD learning (Sutton. and DQN (Mnih et al. At each time step t. After setting up the RL problem. a0 )].. function approximation. and r + γV (s0 ) − V (s) is called TD error. 1988) learns value function V (s) directly from experience with TD error. A. i.e. policy optimization. and close this section with a brief summary. The state value vπ (s) = E[Rt |st = s] is the expected Preturn for following policy π from state s. a) and state transition probability P(st+1 |st . and fully incremental way. or state-action pair. RL parlance. Temporal difference (TD) learning is central in RL. e. TD learning is usually refer to the learning methods for value function evaluation in Sutton (1988). it is tabular TD(0) learning.. Precisely. temporal difference learning. An optimal state value v∗ (s) = maxπ vπ (s) = P maxa qπ∗ (s. a )]. e. with the update rule. a)] s ← s0 . s0 a0 ← action for s0 derived by Q. like TD learning. e. and enable learning to be online and continual. Algorithm 2 presents the pseudo code for tabular SARSA. adapted from Sutton and Barto (2017) SARSA. a ← a0 end end Algorithm 2: SARSA. a) ← Q(s. From an optimal action value function.g. a0 ) − Q(s. observe r. a0 ) − Q(s. a)].. a) ← Q(s. since the target depends on the weights to be estimated. state s is not terminal do a ← action for s derived by Q. is common in RL. to 0 for all states for each episode do initialize state s for each step of episode. state s is not terminal do a ← action given by π for s take action a. Q learning refines the policy greedily with respect to action values by the max operator. a)]. tabular Q(0) learning. Q(s. Bootstrapping methods are usually faster to learn.Bootstrapping. e. and actor-critic. (next) state. like the TD update rule. a)] s ← s0 end end Algorithm 3: Q learning. Q-learning is an off-policy control method to find the optimal policy.g.. TD-learning. s0 Q(s. a) + α[r + γQ(s0 . set action value for terminal states as 0 for each episode do initialize state s for each step of episode.. precisely tabular SARSA(0). Input: the policy π to be evaluated Output: value function V initialize V arbitrarily. 10 .. we can derive an optimal policy. Algorithm 3 presents the pseudo code for Q learning. a0 ) − Q(s. set action value for terminal states as 0 for each episode do initialize state s for each step of episode.g. with the update rule. a) + α[r + γ maxa0 Q(s0 . state s is not terminal do a ← action for s derived by Q. -greedy take action a. 2017). -greedy Q(s. Q(s. Bootstrapping methods are not instances of true gradient decent. is an on-policy control method to find the optimal policy. The concept of semi-gradient descent is then introduced (Sutton and Barto. estimates state or action value based on subsequent esti- mates. to 0 for all states. e.. -greedy take action a. Q-learning learns action value function.. a) + α[r + γQ(s0 . observe r.g. e. reward. precisely. adapted from Sutton and Barto (2017) Output: action value function Q initialize Q arbitrarily. observe r. (next) action. a) + α[r + γ maxa0 Q(s0 . e. s0 V (s) ← V (s) + α[r + γV (s0 ) − V (s)] s ← s0 end end Algorithm 1: TD learning. a0 ) − Q(s. a) ← Q(s. Q-learning and SARSA converge under certain conditions. adapted from Sutton and Barto (2017) Output: action value function Q initialize Q arbitrarily.g. a) ← Q(s. action. Q learning. representing state.g. to 0 for all states. Function approximation is a way for generalization when the state and/or action spaces are large or continuous. 2017. w)]∇v̂(s. w) s ← s0 end end Algorithm 4: TD(0) with function approximation. v̂(terminal. 1]. 2017). local minima or complex non-linear function approximation are not. the integration of reinforcement learning and neural networks dated back a long time ago (Sutton and Barto. Eligibility trace is a short-term memory. 2015). The weight vector is a long-term memory. studied in the fields of machine learning. esp. 1996. and the weight vector is updated fol- lowing the update rule. bootstrapping. since dynamic programming suffers from divergence with function approximation... Input: the policy π to be evaluated Input: a differentiable value function v̂(s. We have TD learning and Q learning variants and Monte-Carlo approach with multi-step return in the forward view. e.2. incremental implementation. It is unclear what is the root cause for instability – each single factor mentioned above is not – there are still many open problems in off-policy learning (Sutton and Barto. and off-policy learning for freeing behaviour policy from target policy. lasting the whole duration of the system. 2017). All these three elements are necessary: function approximation for scalability and gener- alization. it is usually a concept in supervised learning. explo- ration. and delayed targets (Sutton and Barto. 11 . s0 w ← w + α[r + γv̂(s0 . Schmidhuber. instability and divergence may occur (Tsitsiklis and Van Roy. w)]∇v̂(s.STEP B OOTSTRAPPING The above algorithms are referred to as TD(0) and Q(0). since prediction alone can diverge. for prediction algorithms. and encounters new issues like nonstationarity. 2015). w) is the ap- proximate value function. resulting in TD(λ) and Q(λ) algorithms. learning with one-step return. (2017) made unification for multi-step TD control algorithms. 2.3. determines the estimated value. before the work of Deep Q-Network (Mnih et al. TD(1) is the same as the Monte Carlo approach.5 F UNCTION A PPROXIMATION We discuss the tabular cases above. function approximation.4 M ULTI . where λ ∈ [0.g. where a value function or a policy is stored in a tabular form. w) initialize value function weight w arbitrarily. ∇v̂(s. w) − v̂(s. and bootstrapping. 2017). TD(0). 2017). since linear function approximation can produce instabil- ity (Sutton. However. w). The eligibility trace from the backward view provides an online. w) − v̂(s. using eligibility traces and the decay parameter λ. Function approximation aims to generalize from examples of a function to construct an approximate of the entire function. which is called the deadly triad issue (Sutton and Barto. ·) = 0 Output: value function v̂(s.3. partially due to its desirable theoretical properties. Linear function approximation is a popular choice. state s is not terminal do a ← π(·|s) take action a. w is the value function weight vector. assists the learning pro- cess. De Asis et al. TD(1). What is the root cause for the instability? Learning or sampling are not. greedification. Bertsekas and Tsitsiklis. v̂(s. by affecting the weight vector. bootstrapping for computational and data efficiency. observe r. with Monte Carlo methods. w). usually lasting within an episode. 1997). function approximation in reinforcement learning usually treats each backup as a training example. or control are not. w ← w + α[r + γv̂(s0 . w) is the gradient of the approximate value function with respect to the weight vector. patten recognition. TD(λ) unifies one-step TD prediction. w = 0 for each episode do initialize state s for each step of episode. Algorithm 4 presents the pseudo code for TD(0) with function approximation. 2016). Eligibility trace helps with the issues of long-delayed rewards and non-Markov tasks (Sutton and Barto. adapted from Sutton and Barto (2017) When combining off-policy. and statistical curve fitting. 2009a. since Rt is an estimate of Q(at . Algorithm 5 presents the pseudo code for REINFORCE algorithm in the episodic case. with function approximation. LSPE (Nedić and Bertsekas. st ) = Q(at .Table 1 presents various algorithms that tackle various issues (Sutton. the critic updates action-value function parameters. Algorithms 2. Residual gradi- ent algorithms (Baird. θ) or 12 . TD fix point is then vπ = Bπ vπ . and the actor updates policy parameters. yet with squared time complexity. w). Algorithm 6 presents the pseudo code for one-step actor-critic algorithm in the episodic case.. 2016) are not presented here. policy-based methods opti- mize the policy π(a|s.3. st ). yet keeping its unbiasedness. Recall that Bellman equa- 0 0 tion for value function is vπ (s) = a π(a|s) s0 ..r p(s . Gradient-TD (Sutton et al. 2003) extended LSTD. See Sutton and Barto (2017) for more details. st ) − V (st ). 2005) learn action values in batch mode. REINFORCE (Williams. θ) (with function approximation) directly.6 P OLICY O PTIMIZATION In contrast to value-based methods like TD learning and Q-learning. Before explaining Table 1. w)] − v̂ π (s. yet being a semi-gradient method.3. 2014) methods are true gradient algorithms. ADP algorithms refer to dynamic programming algorithms like policy evaluation. r|s. Du et al. in the direction suggested by the critic. since they do not have theoretical guarantee. Least square temporal difference (LSTD) (Bradtke and Barto. we introduce P some P background definitions. a)[r + P P γv̂ π (s0 . Emphatic-TD (Sutton et al. Usually a baseline bt (st ) is subtracted from the return to reduce the variance of gradient estimate.. and made suggestions about their practical use. LSTD is data efficient. Using V (st ) as the baseline bt (st ). 2015) and A3C (Mnih et al.b. 1995) minimize Bellman error. White and White (2016) performed empirical comparisons of linear TD methods. P P 0 0 is defined as (Bπ v)(s) = a π(a|s) s0 .. 1992) is a policy gradient method. 2016). Fitted-Q algo- rithms (Ernst et al. the right side of Bellman equation with function approximation minus the left side. 2.r p(s0 . r|s. Riedmiller. a)[r + γvπ (s )]. perform SGD in the projected Bellman error (PBE). Bellman operator . Deep RL algorithms like Deep Q-Network (Mnih et al.r p(s .7 D EEP R EINFORCEMENT L EARNING We obtain deep reinforcement learning (deep RL) methods when we use deep neural networks to approximate any of the following component of reinforcement learning: value function. 1996) computes TD fix-point directly in batch mode.. Bellman error for the function approximation case is then a π(a|s) s0 . although they achieve stunning performance empirically. v̂(s. and value iteration. online X X X X converges to PBE = 0 X X X X X Table 1: RL Issues vs. we have the advantage func- tion A(at . θ)(Rt − bt (st )). updating θ in the direction of ∇θ log π(at |st . 2005. r|s. and update the parameters θ by gradient ascent on E[Rt ]. 2016) emphasizes some updates and de-emphasizes others by reweight- ing. Mah- mood et al. a)[r + γvπ (s )]. It can be written as Bπ vw − vw . policy iteration. converge robustly under off-policy training and non-linear function approximation. algorithm TD(λ) LSTD(λ) Residual GTD(λ) SARSA(λ) ADP LSPE(λ) Fitted-Q Gradient GQ(λ) linear computation X X X X nonlinear convergent X X X issue off-policy convergent X X X model-free. In actor-critic algorithms. to yield the gradient direction ∇θ log π(at |st .. θ)Rt . (2017) proposed variance reduction techniques for policy eval- uation to achieve fast convergence. Bellman error is the expectation of the TD error. improving computational efficiency. . w) = 0) w ← w + βδ∇w v̂(st . observe s0 . θ). The distinct difference between deep RL and ”shallow” RL is what function approximator is used. Planning constructs a value function or a policy with a model. w) w ← w + βδ∇w v̂(st . e.. a. θ) for each step t of episode 0. SARSA fits the action-value function to the current policy. adapted from Sutton and Barto (2017) q̂(s. a0 . SARSA evaluates the policy based on samples from the same policy. sT −1 . recent work like Deep Q-Network (Mnih et al. θ) end end Algorithm 5: REINFORCE with baseline (episodic). 2.e. instability and divergence may occur (Tsitsiklis and Van Roy. decision trees. β > 0 Output: policy π(a|s. v̂(s0 . 2015) and AlphaGo (Silver et al. When off-policy. decision trees. in particular.Input: policy π(a|s. tile coding and so on as the function approximator. maybe following an unrelated behavioural policy. w) − v̂(s. The control problem is to find the optimal policy. The prediction problem. adapted from Sutton and Barto (2017) Input: policy π(a|s. then refines the policy greedily with respect to action values. T − 1 do Gt ← return from step t δ ← Gt − v̂(st . and the parameters θ are the weight parameters in these models. β > 0 Output: policy π(a|s. θ). a shallow model. w) θ ← θ + αIδ∇θ logπ(at |st .. This is similar to the difference between deep learning and ”shallow” learning. When we use ”shallow” models.. rT . the first state of the episode I←1 for s is not terminal do a ∼ π(·|s. function approximation. and bootstrapping are combined together. 1997). w) Parameters: step sizes. an agent learns an optimal value function/policy. may be non-linear. we obtain ”shallow” RL. θ) initialize policy parameter θ and state-value weights w for true do initialize s. w) θ ← θ + αγ t δ∇θ logπ(at |st . i. θ) I ← γI s ← s0 end end Algorithm 6: Actor-Critic (episodic). Note. θ). e. w) Parameters: step sizes. In off-policy methods.3. or policy evaluation. On-policy methods evaluate or improve the behavioural policy. and model (state transition function and reward function). θ). Q- 13 . v̂(s. · · · . · · · . θ) initialize policy parameter θ and state-value weights w for true do generate an episode s0 .8 RL PARLANCE We explain some terms in RL parlance. policy π(a|s.g. like linear function. the parameters θ are the weights in deep neural networks.g. v̂(s. δ ← r + γv̂(s0 .. w) (if s0 is terminal. We usually utilize stochastic gradient descent to update weight parameters in deep RL. Here. non-linear function approximation. r .g. following π(·|·. aT −1 . is to compute the state or action value function for a policy. e. r1 . θ) take action a.. α > 0. However. 2016a) stabilized the learning and achieved outstanding results. α > 0. Reinforcement learning algorithms may be based on value func- tion and/or policy. the model (state transition function) is not known or learned from experience. and temporal difference (TD) learning (Sutton.b. policy in Section 3. 3. Control algorithms find optimal policies. 2016a) stabilized the learning and achieved stunning results.1 VALUE F UNCTION Value function is a fundamental concept in reinforcement learning.. Theoretical guarantee has been established for linear function approximation. we discuss core RL elements: value function in Section 3.4. P. 1992). DQN made several important contri- 14 . model-free or model-based. yet it has to explore the environment to find better actions. with value function and/or policy.9 B RIEF S UMMARY A RL problem is formulated as an MDP when the observation about the environment satisfies the Markov property. planning in Section 3. A RL problem may be formulated as a prediction. RL methods that use models are model-based methods. and about the depth of backups. 2016) and Du et al. When combining off-policy. algorithms like Deep Q-Network (Mnih et al. i. γ). on-policy or off-policy. R. An MDP is defined by the 5-tuple (S. Monte Carlo. Q-learning (Watkins and Dayan. e. either one-step return (TD(0) and dynamic pro- gramming) or multi-step return (TD(λ). with major components of value function. with function approximation or not. control or planning problem. The exploration-exploitation dilemma is about the agent needs to exploit the currently best action to maximize rewards greedily. In this section.5. In offline mode. 1988) and its extension. an estimate of state or action value is updated from subsequent estimates..g.2. With non-linear function ap- proximation. Gradient-TD (Sutton et al.. A central concept in RL is value function.3. or the system is non-stationary. in particular deep learning. Emphatic-TD (Sutton et al. Mahmood et al. In the following. Before DQN. with sample backups (TD and Monte Carlo) or full backups (dynamic programming and exhaustive search). and exhaustive search).e. In model-free methods. 2015). it is well known that RL is unstable or even divergent when action value function is approximated with a nonlinear function like neural networks.6. or batch mode. not necessarily fitting to the policy generating the data. training algorithms are executed on data acquired in sequence. policy and model. (2015) introduced Deep Q-Network (DQN) and ignited the field of deep RL. In online mode. 2015) and AlphaGo (Silver et al. (2017). and its extensions. the deadly triad issue (Sutton and Barto.. which is the focus of this overview. 2. 2009a. Temporal difference learning algorithms are fundamental for evaluating/predicting value functions. model in Section 3.. the policy Q-learning obtains is usually different from the policy that generates the samples.. are classical algo- rithms for learning state and action value functions respectively.. 3 C ORE E LEMENTS A RL agent executes a sequence of actions and observe states and rewards. function approximation. a recent breakthrough. Bellman equations are cornerstone for developing RL algorithms. Exploration-exploitation is a fundamental tradeoff in RL. and bootstrapping.3..1. reward in Section 3. and solution methods may be model-free or model-based. A. 2014). and exploration in Section 3. we focus on Deep Q-Network (Mnih et al. 3. 2017).learning attempts to find action values for the optimal policy directly. we face instability and divergence (Tsitsiklis and Van Roy. The notion of on-policy and off-policy can be understood as same-policy and different-policy. when the policy is not optimal yet. the agent learns with trail-and-error from experience explicitly. models are trained on the entire data set. With bootstrapping.1 D EEP Q-N ETWORK (DQN) Mnih et al. We present DQN pseudo code in Algorithm 7. 1997).1. θ − ) otherwise Perform a gradient descent step on (yj − Q(φj . aj . θt− ). rt . select at = arg maxa Q(φ(st ). θt ). where ytQ = rt+1 + γ max Q(st+1 . at . and outperforming previous algorithms and performing comparably to a human professional tester. (2016a) proposed Double DQN (D-DQN) to tackle the over-estimate problem in Q-learning. θt ))∇θt Q(st . 15 . set θ − = θ end end Algorithm 7: Deep Q-Nework (DQN). a D-DQN found better policies than DQN on Atari games. it is more likely to select over-estimated values. as well as in DQN.r. θt ). In standard Q-learning. i. 1992) and target network. van Hasselt et al. at .. a. (2016a) proposed to evaluate the greedy policy according to the online network. and results in over-optimistic value esti- mates.Input: the pixels and the game score Output: Q action value function (from which we obtain policy and select action) Initialize replay memory D Initialize action-value function Q with random weight θ Initialize target action-value function Q̂ with weights θ − = θ for episode = 1 to M do Initialize sequence s1 = {x1 } and preprocessed sequence φ1 = φ(s1 ) for t = 1 to T do  a random action with probability  Following -greedy policy. ytQ can be written as ytQ = rt+1 + γQ(st+1 . network architecture and hyper- parameters to perform well on many different tasks. arg max Q(st+1 . at . a so that the max operator uses the same values to both select and evaluate an action. but to use the target network to estimate its value. a0 .e. θ))2 w. arg max Q(st+1 .2 D OUBLE DQN van Hasselt et al. 2) designing an end-to-end RL ap- proach. at .com/research/dqn/. the network parameter θ // periodic update of target network Every C steps reset Q̂ = Q. a where θt is the parameter for online network and θt− is the parameter for target network. θt ). xt+1 and preprocess φt+1 = φ(st+1 ) Store transition (φt . For reference. at . replacing ytQ with ytD−DQN = rt+1 + γQ(st+1 . φt+1 ) in D // experience replay Sample random minibatch of transitions (φj . (2015) butions: 1) stabilize the training of action value function approximation with deep neural networks (CNN) using experience replay (Lin. at . so that only minimal domain knowledge is required. with only the pixels and the game score as inputs. 2013). adapted from Mnih et al.1. 49 Atari games (Bellemare et al. As a conse- quence. This can be achieved with a minor change to the DQN algorithm. See Deepmind’s description of DQN at https://deepmind. a. i. θt ). the parameters are updated as follows: θt+1 = θt + α(ytQ − Q(st . See Chapter 16 in Sutton and Barto (2017) for a detailed and intuitive description of Deep Q- Network. φj+1 ) from D  rj if episode terminates at step j + 1 Set yj = rj + γ maxa0 Q̂(φj+1 . aj .. θt ). 3.e. rj .t.. 3) training a flexible network with the same algorithm. θ) otherwise Execute action ai in emulator and observe reward rt and image xt+1 Set st+1 = st . θ. 2016). benchmark results (Duan et al. a0 . Reactor (Gruslys et al. using importance sampling to avoid the bias in the update distribution.3. β) + A(s. (2016) attempted to understand the success of DQN and reproduced results with shallow RL. and to improve accuracy over DQN. θ.3 P RIORITIZED E XPERIENCE R EPLAY In DQN. to estimate value function and advantage function separately. 16 . • Liang et al. on Atari games. to learn more efficiently. 3. a. • He et al. (2016b) proposed to replace max operator with average as following for better stability.5 M ORE DQN E XTENSIONS DQN has been receiving much attention. for Retrace-actor. Usually we use the following to combine V (s) and A(s. θ.. In dueling architecture. and its actor-critic extension. experience transitions are uniformly sampled from the replay memory. trust region policy optimization (Schulman et al. 2016). regardless of the significance of experiences. α) − |A| Dueling architecture implemented with D-DQN and prioritized experience replay improved previous work. 2017). β) = V (s. to propagate reward faster. then the two streams are combined to estimate action value function. 2017. (2015) proposed spatio-temporal video prediction conditioned on actions and previous video frames with deep neural networks in Atari games. (2016) designed better exploration strategy to improve DQN. a constrained opti- mization approach.4 D UELING A RCHITECTURE Wang et al. α. to converge faster than Q-learning. a. The authors used prioritized experience replay in DQN and D-DQN. (2017a) proposed to accelerate DQN by optimality tightening. a. 2017). Wang et al. See Retrace algorithm (Munos et al. A(s. a). and improved their performance on Atari games.3. a0 . 2017). Schaul et al.. 2016). • Osband et al. Lillicrap et al. θ.. a CNN layer is followed by a fully connected (FC) layer. We list several extensions/improvements here. including deterministic policy gradient (Silver et al. as discussed in Section 3. α)  Q(s.2.1. a. Then we discuss policy gradient... Next we discuss the combination of policy gradient and off-policy RL (O’Donoghue et al. 3. and policy optimization is to find an optimal mapping. Gu et al.. and then combine them to estimate action value function Q(s. β) + A(s. θ. a safe and efficient return-based off-policy control algo- rithm. (2017) proposed to reduce variability and instability by an average of previ- ous Q-values estimates. 2017. α. (2016b) proposed the dueling network architecture to estimate state value function V (s) and associated advantage function A(s. The authors designed a stochastic prioritization based on the TD errors.. Nachum et al. • O’Donoghue et al. α)  Q(s. a). α) − max0 a where α and β are parameters of the two streams of FC layers. See distributed proximal policy optimization (Heess et al. • Oh et al. 2015). (2016) proposed to prioritize experience replay.1.. We discuss actor-critic (Mnih et al. θ. DQN and D-DQN with prioritized experience replay. 2016). a A(s. The importance of experience transitions are measured by TD errors. 3. θ..2 P OLICY A policy maps state to action.1.. (2017) proposed policy gradient and Q-learning (PGQ). 2014. • Anschel et al.. and. In DQN. a CNN layer is followed by two streams of FC layers. a) to obtain Q(s. β) = V (s. θ. so that important experience transitions can be replayed more frequently. a). . θv0 )) accumulate gradients wrt θv0 : dθv ← dθv + ∇θv0 (R − V (si . updating a state from subsequent estimates.2. a navigating task in random 3D mazes using visual inputs. Sutton et al. θv ). we focus on asynchronous advantage actor-critic (A3C) (Mnih et al. A3C ran much faster yet performed better than or comparably with DQN. θv ) is an estimate of the advantage function. similar to using minibatches. one-step Q-learning and n-step Q-learning.. θ. tstart } do R ← ri + γR accumulate gradients wrt θ 0 : dθ ← dθ + ∇θ0 log π(ai |si . 17 . i. thread-specific parameter vectors θ 0 and θv0 Global shared counter T = 0. get state st for st not terminal and t − tstart ≤ tmax do Take at according to policy π(at |st . 2016). (2016) also discussed asynchronous one-step SARSA. so that experience replay is not utilized. θv0 ) otherwise for i ∈ {t − 1. Q-learning as discussed in Section 3.e. Wang et al. (2017b) proposed a stable and sample efficient actor-critic deep RL model using experi- ence replay.. Gorila (Nair et al. stochastic dueling network (Wang et al. 2015). Babaeizadeh et al. in which an agent will face a new maze in each new episode. at . A3C also succeeded on continuous motor control problems: TORCS car racing games and MujoCo physics manipulation and locomotion. with k upbounded by tmax . θ 0 )(R − V (si .1. where A(st .3. 2017).2. 2015) as discussed in Section 3. with truncated importance sampling.. Different from most deep learning algorithms.. and the value function is used for bootstrapping. each actor-learner thread. (2016) We present pseudo code for asynchronous advantage actor-critic for each actor-learner thread in Algorithm 8. 2016b) as discussed in Section 3. and Prioritized D-DQN. at . being updated with n-step returns in the forward view. Relatively speaking. θ) and an estimate of the value function V (st .1 is sample efficient.. θ..2. In the following. asynchronous methods can run on a single multi-core CPU. and Labyrinth.4. Global shared parameter vectors θ and θv . The gradient update can be seen as Pk−1 i ∇θ0 log π(at |st . D-DQN. In A3C. θv ) − V (st . and trust region policy optimization (Schulman et al. while policy gradient is stable.2. Tmax Initialize step counter t ← 1 for T ≤ Tmax do Reset gradients. Dueling D-DQN. T ← T + 1 end  0 for terminal st R= V (st . and θv using dθv end Algorithm 8: A3C. to reduce variance and accelerate learning (Sutton and Barto. θ 0 ) Receive reward rt and new state st+1 t ← t + 1.1 ACTOR -C RITIC An actor-critic algorithm learns both a policy and a state-value function. 1992. For Atari games. θv ).. dθ ← 0 and dθv ← 0 Synchronize thread-specific parameters θ 0 = θ and θv0 = θv Set tstart = t. parallel actors employ different exploration policies to stabilize training. θ 0 )A(st . θv0 ))2 end Update asynchronously θ using dθ. based on Mnih et al. 3.. (2017) proposed a hybrid CPU/GPU implementation of A3C. A3C maintains a policy π(at |st . 2000) is a popular policy gradient method. Mnih et al.. after every tmax actions or reaching a terminal state.2 P OLICY G RADIENT REINFORCE (Williams. θv ) = k i=0 γ rt+i + γ V (st+k . so that it needs to learn a general strategy to explore random mazes. D ETERMINISTIC P OLICY G RADIENT Policies are usually stochastic. However, Silver et al. (2014) and Lillicrap et al. (2016) proposed deterministic policy gradient (DPG) for efficient estimation of policy gradients. Silver et al. (2014) introduced the deterministic policy gradient (DPG) algorithm for RL problems with continuous action spaces. The deterministic policy gradient is the expected gradient of the action-value function, which integrates over the state space; whereas in the stochastic case, the pol- icy gradient integrates over both state and action spaces. Consequently, the deterministic policy gradient can be estimated more efficiently than the stochastic policy gradient. The authors intro- duced an off-policy actor-critic algorithm to learn a deterministic target policy from an exploratory behaviour policy, and to ensure unbiased policy gradient with the compatible function approxima- tion for deterministic policy gradients. Empirical results showed its superior to stochastic policy gradients, in particular in high dimensional tasks, on several problems: a high-dimensional bandit; standard benchmark RL tasks of mountain car and pendulum and 2D puddle world with low dimen- sional action spaces; and controlling an octopus arm with a high-dimensional action space. The experiments were conducted with tile-coding and linear function approximators. Lillicrap et al. (2016) proposed an actor-critic, model-free, deep deterministic policy gradient (DDPG) algorithm in continuous action spaces, by extending DQN (Mnih et al., 2015) and DPG (Sil- ver et al., 2014). With actor-critic as in DPG, DDPG avoids the optimization of action at every time step to obtain a greedy policy as in Q-learning, which will make it infeasible in complex action spaces with large, unconstrained function approximators like deep neural networks. To make the learning stable and robust, similar to DQN, DDPQ deploys experience replay and an idea similar to target network, ”soft” target, which, rather than copying the weights directly as in DQN, updates the soft target network weights θ 0 slowly to track the learned networks weights θ: θ 0 ← τ θ + (1 − τ )θ 0 , with τ  1. The authors adapted batch normalization to handle the issue that the different com- ponents of the observation with different physical units. As an off-policy algorithm, DDPG learns an actor policy from experiences from an exploration policy by adding noise sampled from a noise process to the actor policy. More than 20 simulated physics tasks of varying difficulty in the Mu- JoCo environment were solved with the same learning algorithm, network architecture and hyper- parameters, and obtained policies with performance competitive with those found by a planning algorithm with full access to the underlying physical model and its derivatives. DDPG can solve problems with 20 times fewer steps of experience than DQN, although it still needs a large number of training episodes to find solutions, as in most model-free RL methods. It is end-to-end, with raw pixels as input. DDPQ paper also contains links to videos for illustration. T RUST R EGION P OLICY O PTIMIZATION Schulman et al. (2015) introduced an iterative procedure to monotonically improve policies theoreti- cally, guaranteed by optimizing a surrogate objective function. The authors then proposed a practical algorithm, Trust Region Policy Optimization (TRPO), by making several approximations, includ- ing, introducing a trust region constraint, defined by the KL divergence between the new policy and the old policy, so that at every point in the state space, the KL divergence is bounded; approximat- ing the trust region constraint by the average KL divergence constraint; replacing the expectations and Q value in the optimization problem by sample estimates, with two variants: in the single path approach, individual trajectories are sampled; in the vine approach, a rollout set is constructed and multiple actions are performed from each state in the rollout set; and, solving the constrained opti- mization problem approximately to update the policy’s parameter vector. The authors also unified policy iteration and policy gradient with analysis, and showed that policy iteration, policy gradient, and natural policy gradient (Kakade, 2002) are special cases of TRPO. In the experiments, TRPO methods performed well on simulated robotic tasks of swimming, hopping, and walking, as well as playing Atari games in an end-to-end manner directly from raw images. Wu et al. (2017) proposed scalable TRPO with Kronecker-factored approximation to the curvature. B ENCHMARK R ESULTS Duan et al. (2016) presented a benchmark for continuous control tasks, including classic tasks like cart-pole, tasks with very large state and action spaces such as 3D humanoid locomotion and tasks with partial observations, and tasks with hierarchical structure, implemented various algorithms, 18 including batch algorithms: REINFORCE, Truncated Natural Policy Gradient (TNPG), Reward- Weighted Regression (RWR), Relative Entropy Policy Search (REPS), Trust Region Policy Opti- mization (TRPO), Cross Entropy Method (CEM), Covariance Matrix Adaption Evolution Strategy (CMA-ES); online algorithms: Deep Deterministic Policy Gradient (DDPG); and recurrent variants of batch algorithms. The open source is available at: https://github.com/rllab/rllab. Duan et al. (2016) compared various algorithms, and showed that DDPG, TRPO, and Truncated Nat- ural Policy Gradient (TNPG) (Schulman et al., 2015) are effective in training deep neural network policies, yet better algorithms are called for hierarchical tasks. Islam et al. (2017) 3.2.3 C OMBINING P OLICY G RADIENT WITH O FF -P OLICY RL O’Donoghue et al. (2017) proposed to combine policy gradient with off-policy Q-learning (PGQ), to benefit from experience replay. Usually actor-critic methods are on-policy. The authors also showed that action value fitting techniques and actor-critic methods are equivalent, and interpreted regularized policy gradient techniques as advantage function learning algorithms. Empirically, the authors showed that PGQ outperformed DQN and A3C on Atari games. Nachum et al. (2017) introduced the notion of softmax temporal consistency, to generalize the hard- max Bellman consistency as in off-policy Q-learning, and in contrast to the average consistency as in on-policy SARSA and actor-critic. The authors established the correspondence and a mutual compatibility property between softmax consistent action values and the optimal policy maximizing entropy regularized expected discounted reward. The authors proposed Path Consistency Learning, attempting to bridge the gap between value and policy based RL, by exploiting multi-step path-wise consistency on traces from both on and off policies. Gu et al. (2017) proposed Q-Prop to take advantage of the stability of policy gradients and the sample efficiency of off-policy RL. Schulman et al. (2017) showed the equivalence between entropy- regularized Q-learning and policy gradient. 3.3 R EWARD Rewards provide evaluative feedbacks for a RL agent to make decisions. Rewards may be sparse so that it is challenging for learning algorithms, e.g., in computer Go, a reward occurs at the end of a game. There are unsupervised ways to harness environmental signals, see Section 4.2. Reward function is a mathematical formulation for rewards. Reward shaping is to modify reward function to facilitate learning while maintaining optimal policy. Reward functions may not be available for some RL problems, which is the focus of this section. In imitation learning, an agent learns to perform a task from expert demonstrations, with samples of trajectories from the expert, without reinforcement signal, without additional data from the expert while training; two main approaches for imitation learning are behavioral cloning and inverse rein- forcement learning. Behavioral cloning, or apprenticeship learning, or learning from demonstration, is formulated as a supervised learning problem to map state-action pairs from expert trajectories to policy, without learning the reward function (Ho et al., 2016; Ho and Ermon, 2016). Inverse reinforcement learning (IRL) is the problem of determining a reward function given observations of optimal behaviour (Ng and Russell, 2000). Abbeel and Ng (2004) approached apprenticeship learning via IRL. In the following, we discuss learning from demonstration (Hester et al., 2017), and imitation learning with generative adversarial networks (GANs) (Ho and Ermon, 2016; Stadie et al., 2017). We will discuss GANs, a recent unsupervised learning framework, in Section 4.2.3. Su et al. (2016b) proposed to train dialogue policy jointly with reward model. Christiano et al. (2017) proposed to learn reward function by human preferences from comparisons of trajectory segments. See also Hadfield-Menell et al. (2016); Merel et al. (2017); Wang et al. (2017); van Seijen et al. (2017). 19 L EARNING FROM D EMONSTRATION Hester et al. (2017) proposed Deep Q-learning from Demonstrations (DQfD) to attempt to accel- erate learning by leveraging demonstration data, using a combination of temporal difference (TD), supervised, and regularized losses. In DQfQ, reward signal is not available for demonstration data; however, it is available in Q-learning. The supervised large margin classification loss enables the policy derived from the learned value function to imitate the demonstrator; the TD loss enables the validity of value function according to the Bellman equation and its further use for learning with RL; the regularization loss function on network weights and biases prevents overfitting on small demonstration dataset. In the pre-training phase, DQfD trains only on demonstration data, to obtain a policy imitating the demonstrator and a value function for continual RL learning. After that, DQfD self-generates samples, and mixes them with demonstration data according to certain proportion to obtain training data. The authors showed that, on Atari games, DQfD in general has better initial performance, more average rewards, and learns faster than DQN. In AlphaGo (Silver et al., 2016a), to be discussed in Section 5.1.1, the supervised learning policy network is learned from expert moves as learning from demonstration; the results initialize the RL policy network. See also Kim et al. (2014); Pérez-D’Arpino and Shah (2017). See Argall et al. (2009) for a survey of robot learning from demonstration. G ENERATIVE A DVERSARIAL I MITATION L EARNING With IRL, an agent learns a reward function first, then from which derives an optimal policy. Many IRL algorithms have high time complexity, with a RL problem in the inner loop. Ho and Ermon (2016) proposed generative adversarial imitation learning algorithm to learn poli- cies directly from data, bypassing the intermediate IRL step. Generative adversarial training was deployed to fit the discriminator, the distribution of states and actions that defines expert behavior, and the generator, the policy. Generative adversarial imitation learning finds a policy πθ so that a discriminator DR can not dis- tinguish states following the expert policy πE and states following the imitator policy πθ , hence forcing DR to take 0.5 in all cases and πθ not distinguishable from πE in the equillibrium. Such a game is formulated as: max min −Eπθ [log DR (s)] − EπE [log(1 − DR (s))] πθ DR The authors represented both πθ and DR as deep neural networks, and found an optimal solution by repeatedly performing gradient updates on each of them. DR can be trained with supervised learning with a data set formed from traces from a current πθ and expert traces. For a fixed DR , an optimal πθ is sought. Hence it is a policy optimization problem, with − log DR (s) as the reward. The authors trained πθ by trust region policy optimization (Schulman et al., 2015). T HIRD P ERSON I MITATION L EARNING Stadie et al. (2017) argued that previous works in imitation learning, like Ho and Ermon (2016) and Finn et al. (2016b), have the limitation of first person demonstrations, and proposed to learn from unsupervised third person demonstration, mimicking human learning by observing other humans achieving goals. 3.4 M ODEL A model is an agent’s representation of the environment, including the transition model and the reward model. Usually we assume the reward model is known. We discuss how to handle unknown reward models in Section 3.3. Model-free RL approaches handle unknown dynamical systems, however, they usually require large number of samples, which may be costly or prohibitive to obtain for real physical systems. Model-based RL approaches learn value function and/or policy in a data- efficient way, however, they may suffer from the issue of model identification so that the estimated models may not be accurate, and the performance is limited by the estimated model. 20 by introducing information gain. The authors focused on time-varying linear-Gaussian policies. The authors implemented the proposed exploration strategy with minor modifications to REIN- FORCE. The under- appreciated reward exploration strategy resulted from importance sampling from the optimal policy. and integrated a model- based linear quadratic regulator (LQR) algorithm with a model-free path integral policy improve- ment algorithm. VIN is model-free. e. (2016). See recent work on model-based learning. to relate to confidence intervals in count-based exploration.6 E XPLORATION A RL agent usually uses exploration to reduce its uncertainty about the reward function and tran- sition probabilities of the environment. for the first time with a RL method. Houthooft et al. In tabular cases. 3.g. policies in RL. and combined a mode seeking and a mean seeking terms to tradeoff exploration and exploitation. the authors combined the proposed approach with guided policy search (GPS) (Levine et al. The proposed approach does not generate synthetic samples with estimated models to avoid degradation from modelling errors. a fully differentiable CNN plan- ning module to approximate the value iteration algorithm. Hester and Stone (2017). (2016) proposed pseudo-count. (2017). However... 2016)..5 P LANNING Planning constructs a value function or a policy usually with a model. 21 . as well as Du- eling Network(Wang et al. Intrinsic motivation suggests to explore what is surprising. See classical Dyna-Q (Sutton. Gu et al. on several algorithmic tasks. Mars Rover Navigation. (2016) proposed bootstrapped DQN to combine deep exploration with deep neural networks to achieve efficient learning. Watter et al. to unify count-based exploration and intrinsic motivation. 2016b). Nachum et al. VIN can be trained end-to-end with backpropagation.g. typically in learning process based on change in prediction error. count-based methods are not directly useful in large domains. Fortunato et al. Bellemare et al. Osband et al.Chebotar et al.com/karpathy/paper- notes/blob/master/vin. See also Azar et al. Intrinsic motivation methods do not require Markov property and tabular repre- sentation as count-based methods require. Oh et al. To generalized the method for arbitrary parameterized policies such as deep neural networks. and to relate to learn- ing progress in intrinsic motivation. Henaff et al. which are related to the state-action visit counts.md. (2017). this uncertainty can be quantified as con- fidence intervals or posterior of environment parameters. (2016b). One merit of Value Iteration Network. is that they design novel deep neural networks architectures for reinforcement learning problems. (2017). Jiang et al. undirected exploration strategies of the reward landscape. (2016) proposed variational information maximizing exploration for continuous state and action spaces. Ostrovski et al. a density model over the state space. With count-based exploration. Tamar et al. e. so that it avoids issues with system identification. in which the log-probability of an action sequence under the current policy under-estimates the resulting reward. to learn to plan. (2015). In contrast to conventional planning. (2017) proposed NoisyNet for efficient exploration by adding parametric noise added to weights of deep neural net- works. (2017). Silver et al. and validated it with Atari games. where reward and transition probability are part of the neural network to be learned. (2017) attempted to combine the advantages of both model-free and model-based RL approaches. VIN can generalize in a diverse set of tasks: sim- ple gridworlds.4. so that planning is usually related to model-based RL methods as discussed in Section 3. (2017) proposed an under-appreciated reward exploration technique to avoid the pre- vious ineffective. 2016a). 1990). a RL agent uses visit counts to guide its behaviour to re- duce uncertainty. The author established pseudo-count’s theoretical advantage over previous intrinsic motivation methods. as in -greedy and en- tropy regularization. (2016b) proposed the predictron to integrate learning and planning into one end-to-end training procedure with raw input in Markov reward process. 3. (2016) introduced Value Iteration Networks (VIN). continuous control and WebNav Challenge for Wikipedia links navigation (Nogueira and Cho. and validated it. and to promote directed exploration of the regions. which can be regarded as Markov decision process without actions. See a blog about VIN at https://github. moment matching networks. DNC outperformed normal neural net- work like LSTM or DNC’s precursor Neural Turing Machine (Graves et al.. 2015.. we have generative adversarial networks. 4. in which. Finn et al.. (2016). 2015). POMDP.. Luo et al. a weighted addressing scheme to all memory locations.g. Similar to a conventional computer. where density functions are concerned. neural autoregressive distribution estima- tors. k-means etc. (2016). Sukhbaatar et al. some are with tractable models. in differentiable neural computer (Graves et al. like Botlzmann machines. and attention is an approach for memory addressing. 2017). with harder problems. etc. Xu et al.. 2016). Zagoruyko and Komodakis (2017). 2017a. Jader- berg et al. (2015). Memory provides data storage for long time.4 I MPORTANT M ECHANISMS In this section. 2009). Ba et al. (2016). (2015). Mnih et al. Zaremba and Sutskever. 2015. a DNC can solve synthetic question answering problems. e. (2014. etc. a neural network can read from and write to an external memory. Oquab et al. Oh et al. Xu et al. including attention and memory. transfer learning. (2017). like sparse coding. See Deepmind’s description of DNC at https://deepmind..pub/2016/augmented-rnns/ and http://www. Among probabilistic (generative) models with explicit density functions. (2015). When trained with supervised learning.1 ATTENTION AND M EMORY Attention is a mechanism to focus on the salient parts. e. e. (2015). Zhu and Goldberg. a DNC can solve a moving blocks puzzle with changing goals specified by symbol sequences. in a DNC. an LSTM may simply fail.g. 2016)... (2016). Zaremba and Sutskever (2015). 2015). (2015) in- tegrated attention to image captioning. the neural network is the controller and the external memory is the random-access memory. which a neural network without read-write memory can not solve. like Bayesian RL (Ghavamzadeh et al. 2014). Kadlec et al. (2017). Luo et al. When trained with reinforcement learning. and would be a crit- ical mechanism to achieve general artificial intelligence. We note that we do not discuss in detail some im- portant mechanisms. in Bahdanau et al. DNC minimizes memory allocation interference and enables long-term storage. like fully observable belief nets. See http://distill. Liang et al. so that DNC can solve complex. 2015). Hausknecht and Stone (2015). structured prob- lems. variational autoencoders.. There are endeavours for hard attention (Gulcehre et al. Vaswani et al. Kaiser and Bengio (2016). we expect to see further improvements and applications of DNC. 2016). autoencoders. Duan et al. a DNC learns such representation and manipulation end-to-end with gradient descent from data in a goal-directed manner. (2015. Unsupervised learning is categorized into non-probabilistic models. some are with non-tractable models. We briefly discuss application of attention in computer vision in Section 5. Although these experiments are relatively small-scale. (2016) proposed differentiable neural computer (DNC). Chen et al. Gregor et al. unsupervised learning. and with external memory. either explicitly or implicitly (Salakhut- dinov. 22 . semi-supervised learning.. we discuss important mechanisms for the development of (deep) reinforcement learning. which is the way conven- tional computers access memory. and probabilistic (gen- erative) models.2 U NSUPERVISED L EARNING Unsupervised learning is a way to take advantage of the massive amount of data. Most works follow a soft attention mechanism (Bahdanau et al. See recent work on attention and/or memory. and semi-supervised RL (Audiffren et al.g. 2016. Helmhotz machines. The attention mechanism is also deployed in NLP. (2016). 2016.com/2016/01/attention- and-memory-in-deep-learning-and-nlp/ for blogs about attention and memory. (2014) applied attention to image classification and object detection.. Graves et al. it can solve the shortest path finding problem between two stops in transportation networks and the relationship inference prob- lem in a family tree. and a DNC represents and manipulates complex data structures with the memory. (2016a). and PixelRNN. Danihelka et al. Weston et al. Yang et al. Eslami et al. (2015). for reasoning and inference in natural language. 2017. and learning to learn.com/blog/differentiable-neural-computers/.. Differently.wildml. 4. For probabilistic (generative) models with implicit density functions.4. (2015).. etc. hierarchical RL. Chen et al. See also Le et al. and performed well on 3D Labyrinth game. (2014) proposed generative adversarial nets (GANs) to estimate generative models via an adversarial process by training two models simultaneously. 2016). it learns in real-time while following some other behaviour policy.2 U NSUPERVISED AUXILIARY L EARNING Environments may contain abundant possible training signals. e.2. and unsupervised auxiliary learning (Jader- berg et al. and learns with gradient-based temporal difference learning methods. 4.In the following. 2014). (2014) modelled G and D with multilayer perceptrons: G(z : θg ) and D(x : θd ). reward function. together with a deconvolutional network. rewards and actions are stored in a reply buffer. 2011). Jaderberg et al.. Experiences of observations. See also Lample and Chaplot (2016). and value function replay. As a result. a generative model G to capture the data distribution.e. and auxiliary reward tasks may help to achieve a good representation of rewarding states. to tackle the issue of reward sparsity. 4. i. reward prediction. We discuss robotics navigation with similar unsupervised auxiliary learning (Mirowski et al.2. D will be trained to maximize the probability of assigning labels correctly to samples from both training data and G. to maximize changes in pixel intensity of different regions of the input images. for being used by auxiliary tasks. a scalable real-time architecture for learning in parallel general value functions for independent sub-agents from unsupervised sensorimotor interaction. Define a prior on input noise variable pz (z). The auxiliary policies use the base CNN and LSTM.2. and a discriminative model D to estimate the probability that a sample comes from the training data but not the generative model G. G is a differentiable function and D(x) outputs a scalar as the probability that x comes from the training data rather than pg . Horde can learn to predict the values of many sensors. The base agent is trained on-policy with A3C (Mnih et al. and z are input noise variables. See Deepmind’s description of UNREAL at https://deepmind. i. pixel control. besides the usual cumulative reward. (2017) proposed UNsupervised REinforcement and Auxiliary Learning (UNREAL) to improve learning efficiency by maximizing pseudo-reward functions. D and G form the two-player minimax game as follows: min max Ex∼pdata (x) [log D(x)] + Ez∼pz (z) [log(1 − D(G(z)))] G D 23 . Simultaneously. pixel changes may imply important events. UNREAL has a shared representation among signals. nonreward signals and observations.e... two ways to take advantages of possible non-reward training signals in environ- ments. Liu et al.com/blog/reinforcement-learning-unsupervised-auxiliary-tasks/. with constant time and memory complexity per time step. 4.2. with general value functions.. G will be trained to minimize such classification accuracy.. we discuss Horde (Sutton et al.. x are data points. where θg and θd are parameters.g.. (2017). UNREAL improved A3C’s performance on Atari games.3 G ENERATIVE A DVERSARIAL N ETWORKS Goodfellow et al. Value function replay further trains the value function. The reward prediction module predicts short-term extrinsic reward in next frame by observing the last three frames. (2012). 2017) in Section 5. This may be even helpful when the extrinsic rewards are rarely observed.1 H ORDE Sutton et al. (2016). 2017). log(1 − D(G(z))). We also discuss generative adversarial networks (Goodfellow et al. UNREAL is composed of RNN-LSTM base agent. Horde is off-policy. and answer predictive or goal-oriented questions. Goodfellow et al. where policy.. termination function. which may help to expedite achieving the main goal of maximizing the accumulative rewards. and terminal reward function are parameters. The authors then proposed Horde. (2011) proposed to represent knowledge with general value function. while sharing a common representation. while Horde trains each value function separately with distinct weights. the generative distribution we want to learn. and policies to maximize those sensor values. inductive transfer learning includes self-taught learning and multi- task learning. (2017a). (2017) unified GANs and Variational Autoencoders (VAEs). (2016). (2014) showed that as G and D are given enough capacity. de- fined the common representation using which to map states and to project the execution of skills. the criteria for evaluation. (2017).2. 2016). Smart Grid in Section 5. and compute systems in Section 5. Busoniu et al. (2016). transfer learning can be inductive. another sta- ble model. See NIPS 2015 Transfer and Multi-Task Learning: Trends and New Perspectives Workshop. as we will discuss. Lowe et al. (2017).g. or unsupervised. 2007). Multi-agent systems have many applications. Ganin et al. Rajendran et al. Gupta et al. Long et al. scale sensitivity. and even fundamental issues like what is the question for multi-agent learning. autoencoder. and energy-based models. (2017) for Wasserstein GAN (WGAN) as a stable GANs model. (2016). (2017). whether convergence to an equilibrium is an appropriate goal. and set a new milestone in visual quality for image generation.10. etc.12. (2016). (2015). Finn et al. We discuss imitation learning with GANs in Section 3. 4.. with similarity loss metric. Mao et al. There are several recent works. (2008) surveyed works in multi-agent RL. 2017. robotics in Section 5. (2017).. possibly with different feature spaces and/or different data distributions (Taylor and Stone. and transductive transfer learning includes domain adaptation and sample selection bias/covariance shift. and designed an algorithm for two agents to transfer the informative feature space maximally to trans- fer new skills. (2017) formulated the multi-skill problem for two agents to learn multiple skills. and coherent research agen- das (Shoham et al. As reviewed in Pan and Yang (2010).. games in Section 5. The authors validated their proposed approach with two simulated robotic manipulation tasks. (2017) pro- posed Cramér GAN to satisfy three machine learning properties of probability divergences: sum invariance. GANs have received much attention and many works have been appearing after the tutorial.. 2009) with RL. e. (2015.1. Papernot et al. Gulrajani et al. Bellemare et al. there are new issues like multiple equilibria. Parisotto et al.3 T RANSFER L EARNING Transfer learning is about transferring knowledge learned from different domains. instead of clipping weights as in Arjovsky et al. Pfau and Vinyals (2016) established the connection between GANs and actor-critic algorithms. Kaiser et al. GANs are notoriously hard to train. 2017.g. and provided a training algorithm with backpropa- gation by minibatch stochastic gradient descent.. Maurer et al. Weiss et al. 2017.4 M ULTI -AGENT R EINFORCEMENT L EARNING Multi-agent RL (MARL) is the integration of multi-agent systems (Shoham and Leyton-Brown. Andreas et al. (2017) proposed to improve stability of WGAN by penalizing the norm of the gradient of the discriminator with respect to its input. Mo et al. See Ruder (2017) for an overview about multi-task learning. Intelligent Transportation Systems in Section 5. Besides issues in RL like convergence and curse-of-dimensionality. about new deep MARL algorithms (Foerster et al. Yosinski et al. Berthelot et al. (2016a) established a connection between GANs. and reinforcement learning. Kansky et al. 4. including generative adversarial imitation learning. Hu et al. Conse- quently. transductive. See also recent work in transfer learning e. inverse RL. 2009. Whye Teh et al.. and unbiased sample gradients.11. See Goodfellow (2017) for Ian Goodfellow’s summary of his NIPS 2016 Tutorial on GANs. generative adversarial nets can recover the data generating distribution. 2010.Goodfellow et al.3. 2016). (2017). thus it is at the intersection of game theory (Leyton-Brown and Shoham. (2017). Pérez-D’Arpino and Shah (2017). Dong et al.. Pan and Yang. Omidshafiei 24 . 2008) and RL/AI communities. multi-agent learning is challenging both technically and conceptually. See Arjovsky et al. and third person imitation learning. Foerster et al. (2017) proposed BEGAN to improve WGAN by an equilibrium enforcing model. (2014). and demands clear understanding of the problem to be solved. (2016) proposed Least Squares GANs (LSGANs). (2015) proposed siamese neural networks with metric learning for one-shot image recognition. for learning high-level temporally abstracted macro-actions in an end-to-end man- ner based on observations from the environment. for handwritten characters in par- ticular.1 4. the former learns a policy over intrinsic sub-goals. 2017). or options (Sutton et al. Duan et al. Dietterich. Sharma et al. 2016. (2016) proposed strategic attentive writer (STRAW). (2016) validated STRAW on next character prediction in text. (2017). (2017).. to improve sample efficiency. (2016) proposed hierarchical-DQN (h-DQN) by organizing goal-driven intrinsically motivated deep RL modules hierarchically to work at different time-scales. learn new tasks in a few samples. Sukhbaatar et al. 2016). We add this section due to the importance of multi-agent reinforcement learning. (2016) and Wang et al. 25 . 2D maze navigation. Peng et al. (2014). and convergence rate as reward. Lake et al. which is a form of intrinsic motivation to explore agent’s own capabilities. (2016) proposed to learn a flexible RNN model to handle a family of RL tasks. the latter learns a policy over raw actions to satisfy given sub-goals. (2015) proposed an one-shot concept learning model. See also Andrychowicz et al. meta learning.et al. Duan et al. See a survey on hierarchical RL (Barto and Mahadevan. 2017). (2017b) designed a large scale memory module for life- 1 We leave this section as in progress. Hierarchical RL is an approach for issues of sparse rewards and/or long hori- zons (Sutton et al.. Vezhnevets et al. then on top of these skills. new communication mechanisms in MARL (Foerster et al. h-DQN integrates a top level action value function and a lower level action value function.. and the pre-training follows an unsupervised way. Their method combined hierarchical methods with intrinsic motivation. Pre-training is based on a proxy reward signal. 1999. (2017) proposed a model for one-shot imitation learn- ing with attention for robotics. or sub-goals. (2017). Vezhnevets et al. updated periodically based on observing re- wards. Li and Malik (2017) proposed to automate unconstrained continuous optimization algorithms with guided policy search (Levine et al. The authors tested their approach on the game of Minecraft. Montezuma’s Revenge. its design requires minimal domain knowledge about the downstream tasks. and Atari games. and learns for how long to commit to the plan by following it without replanning. 2016a) by representing a particular optimization algorithm as a policy. Machado et al.6 L EARNING TO L EARN Learning to learn is about learning to adapt rapidly to new tasks. 2000. with probabilistic program induction. a deep recurrent neural net- work architecture.5 H IERARCHICAL R EINFORCEMENT L EARNING Hierarchical RL is a way to learn. It is related to transfer learning. STRAW builds a multi-step action plan. plan. and represent knowledge with spatio-temporal abstraction at multiple levels. are learned to transfer knowledge to new tasks. 2016). including DQN and A3C. (2017a). 1999). (2017) proposed to pre-train a large span of skills using Stochastic Neural Networks with an information-theoretic regularizer.. (2016) designed matching net- works for one-shot classification. Vinyals et al. Ravi and Larochelle (2017) proposed a meta-learning model for few shot learning. and benefit from prior knowledge. Kompella et al. Koch et al. Kaiser et al. in contrast to the manual approach in pre- vious work. (2017). Johnson et al. In a hard Atari game. (2016) presented zero-shot translation for Google’s multilingual neural machine translation system. as well as for selective RL papers in NIPS 2017. Yao et al. to train high-level policies for downstream tasks.. (2017). and one/few/zero-shot learning. h-DQN outperformed previous methods. See also Bacon et al. (2015). Kulkarni et al. and sequential social dilemmas with MARL (Leibo et al. Barto and Mahadevan. representation learning. multi-task learning. Schaul et al. (2017) proposed a hierarchical deep RL network architecture for lifelong learning. STRAW learns to discover macro-actions automatically from data. It is a core ingredient to achieve strong AI (Lake et al. (2016). Macro-actions are sequences of actions commonly occurring. 2003). Vezhnevets et al.. Tessler et al. Florensa et al.. 2003). 4.. Reusable skills. and will refine it when we get more ready. 2. Games will still be an important testbed for AI. Tesauro (1994) approached Backgam- mon by using neural networks to approximate value function learned with TD learning.1. We discuss smart grid in Section 5. We discuss healthcare in Sec- tion 5. in which. and they used Labyrinth as the testbed. Next we discuss natural language processing in Section 5. as it includes several application areas here: healthcare. We do not discuss some interesting applications. (2017) pro- posed prototypical networks for few/zero-shot classification by learning a metric space to compute distances to prototype representations of each class. We focus on computer Go.g. after the success of deep learning. checker and Othello.1 and its extensions. In such games. etc. 2017. customer management. and game theory may be deployed or not. and focus on Texas Hold’em Poker. in particular.. We discuss Industry 4. including majiang/mahjong.1 and robotics in Section 5. There are optimization and control problems in these areas. Many countries have made plans to integrate AI with manufacturing. we discuss business management. In video games.0. 26 .. Neural architecture design in Section 5. an important application area of AI.ly/2pDEs1Q.4. It is desirable to do a deeper analysis of all application areas listed in the following. Jaques et al. resource management. Backgammon and computer Go are perfect information board games.1. We discuss finance in Section 5. and Mirowski et al.11. chess. and http://bit.0 in Section 5... all of which experimented with Atari games. We do not list smart city.1. esp. 2017).1. intelligent transportation system. AlphaGo.3. which we leave as a future work. for its significance. e. supply chain.1. and computer systems in Section 5.9.1 G AMES Games provide excellent testbeds for RL/AI algorithms.6.3.1. We discuss Mnih et al.2.8. in particular. and marketing. information may be perfect or imperfect. (2017) in Section 5. like music generation (Briot et al. In Section 5. We discuss Deep Q-Network (DQN) in Section 3. which we discuss in Section 5. See Yannakakis and Togelius (2017) for a draft book on artificial intelligence and games. new application of RL. 5. and many of them are concerned with networking and graphs.5 is an exciting. recommendation. AlphaGo (Silver et al. 2017). and achieved human level performance. players reveal prefect information. like ads.. Variants of card games. 2011). Go. intelligent transportation systems in Section 5.1 P ERFECT I NFORMATION B OARD G AMES Board games like Backgammon. 5. which enjoys wide and deep applications of RL recently. we do not list it as an application area — it is implicitly a component in application areas like intelligent transportation system and Industry 4.1. smart grid. which receives much attention recently.2. there are efforts for integration of vision and language.g. (2017) in Section 4. We discuss video games in Section 5. are classical testbeds for RL/AI al- gorithms. (2017) proposed Schema Networks for zero-shot transfer with a generative causal model of intuitive physics. In Section 5. (2016) in Section 3.7. and retrosyn- thesis (Segler et al. We may only touch the surface of some application areas. Business and finance have natural problems for RL.12. Jaderberg et al. two classical RL application areas. etc. e. and focus on computer Go.2. Computer vision follows in Section 5.2. These application areas may overlap with each other.ly/2rjsmaz. We discuss games in Section 5.10. inventory management. 2016a). a robot may need skills for many or even all of the application areas.long one-shot learning to remember rare events. Robotics will be critical in the era of AI. Snell et al. 5 A PPLICATIONS Reinforcement learning has a wide range of applications. we dis- cuss briefly Backgammon. Reinforcement learning is widely used in operations research (Powell. Kansky et al. are imperfect information board games. and their applications.. See previous work on lists of RL applications at: http://bit. We will see more achievements in imperfect information games and video games. 5 games to 0. and a RL value network. in the following stages: 1) select a promising node to explore further. and became the first computer Go program to won a human profes- sional Go player without handicaps on a full-sized 19 × 19 board. The fast rollout policy uses a linear softmax with small pattern features. The RL policy network improves SL policy network. and our guess is that it improved the accuracy of policy network and value network by self play so that it needs less search with MCTS. us- ing stochastic gradient descent to minimize the mean squared error between the prediction and the corresponding outcome.gl/lZoQ1d.. a fast rollout policy. (2016a) and Sutton and Barto (2017). together with the SL policy network and the rollout network. Soon after that in March 2016. and Monte Carlo tree search (MCTS) (Browne et al. an 18-time world champion Go player. The SL policy network has convolutional layers. 2007. 4 games to 1. This might explain the one game loss against Lee Sedol. See Deepmind’s description of AlphaGo at goo. to stabilize the learning and to avoid overfitting. This set a landmark in AI. Games are played between the current policy network and a random. D ISCUSSIONS The Deepmind team integrated several existing techniques together to engineer AlphaGo and it has achieved tremendous results. so that the RL value network. in October 2015. an astronomical number. so it is not entirely an end-to-end solution yet. A LPHAG O : T RAINING PIPELINE AND MCTS We discuss briefly how AlphaGo works based on Silver et al. The reward function is +1 for winning and -1 for losing in the terminal states. Ke Jie worked on a single machine with TPU. State- action pairs are sampled from expert moves to train the network with stochastic gradient ascent to maximize the likelihood of the move selected in a given state. which was successfully used in solving many other games. The 2017 version of AlphaGo vs. The weights are trained by regression on state-outcome pairs. and 0 otherwise. the RL policy network and RL value network are not strong/accurate enough. and policy gradient for training. It builds a partial game tree starting from the current state. 2012. except the out- put is a single scalar predicting the value of a position. 2012). A move is then selected. a RL policy network. 2) expand a leaf node guided by the SL policy network and collected statistics. and an output softmax layer representing probability distribution over legal moves. However. with the same network architecture.C OMPUTER G O The challenge of solving Computer Go comes from not only the gigantic search space of size 250150 . assist MCTS to search for the move. AlphaGo was built with techniques of deep CNN. in contrast.. 2002). Moverover. but also the hardness of position evaluation (Müller. AlphaGo still requires manually defined features with human knowledge. reinforcement learning. AlphaGo defeated Lee Sedol. The training pipeline phase includes training a supervised learning (SL) policy network from expert moves. Gelly et al. Gelly and Silver. making headline news worldwide. where 19 is the dimension of a Go board and 48 is the number of features. ReLU nonlinearities. AlphaGo is composed of two phases: neural network training pipeline and MCTS. DQN requires 27 . like Backgammon and chess. See Sutton and Barto (2017) for a detailed and intuitive description of AlphaGo. 3) evaluate a leaf node with a mixture of the RL value network and the rollout policy. data are generated by self-play between the RL policy network and itself until game termination. The inputs to the CNN are 19 × 19 × 48 image stacks. AlphaGo defeated Ke Jie 3:0 in May 2017. 2016a). supervised learning. The value network is learned in a Monte Carlo policy evaluation approach. previous iteration of the policy network.. The RL value network still has the same network architecture as SL policy network. a computer Go program. won the human European Go champion. and the weights of SL policy network as initial weights. AlphaGo (Silver et al. In MCTS phase. Weights are updated by stochastic gradient ascent to maximize the expected outcome. 4) backup evaluations to update the action values. To tackle the overfitting problem caused by strongly correlated successive positions in games. AlphaGo selects moves by lookahead search. . we see the success of AlphaGo as the triumph of AI.3 V IDEO G AMES Video games would be great testbeds for artificial general intelligence. 2017). accessible perfect simulator. 2017). a real-world scale imperfect-information game.1. significant progress has been made for Heads-up No-Limit Hold’em Poker (Moravčı́k et al. etc. NFSP was evaluated on two-player zero-sum games. Heads-up Limit Hold’em Poker was essentially solved (Bowling et al. and huge datasets of human play games. However. New deep neural network architectures are called for.g. like planning. superhuman algorithms which are based on significant domain expertise. 2017). or game theory in general.. 5. in context”. like classical AI problems. (2016a) in solving problems requiring titanic search spaces. He characterized properties of Computer Go as: fully deterministic. Being more practical. without MCTS. Heinrich and Silver (2016) proposed Neural Fictitious Self-Play (NFSP) to combine fictitious self- play with deep RL to learn approximate Nash equilibria for games of imperfect information in a scalable end-to-end approach without prior domain knowledge. and constraint satisfaction. these techniques are present in many recent achievements in AI.g. discrete action space. relatively short episode/game. and new areas for AI. It is nontrivial to apply A3C to such 3D games directly. a novel deep neural network architecture. which is an iterative method to approximate a Nash equilibrium of an extensive-form game with repeated self-play between two regret-minimizing algorithms. Such a room for improvements would inspire intellectual inquisition for better computer Go programs.2 I MPERFECT I NFORMATION B OARD G AMES Imperfect information games. and Monte Carlo tree search... we expect more applications/extensions of techniques in Silver et al. This would be based on a novel RL algorithm.. In Limit Texas Hold’em. and is far from artificial general intelligence. have many applications. such endeavour would be illusive at large currently. from recent four raw frames and game variables. In Leduc poker. with little domain knowledge or human expert games. for the sophistication to represent complex scenarios in Go and the elegance for learning in a reasonable time. 28 . AlphaGo will probably shed lights on classical AI areas. scheduling. Reportedly. 2009) approach of starting with simple tasks and gradually transition to harder ones. fully observable.1. focusing computation on specific situations arising when making decisions and use of value functions trained automatically. See Wang et al. like retrosynthesis (Segler et al. partly due to sparse and long term reward. learning from demonstration (as supervised learning). and the full version of Texas Hold’em. possibly for better reasoning.. DeepStack utilized the recursive reasoning of CFR to handle information asymmetry. It is true that computer Go has limitations in the problem setting and thus potential applications.e. after AlphaGo defeated Ke Jie in May 2017. potentially with deep RL only.only raw pixels and scores as inputs. in particular. deep learning. NFSP performed similarly to state-of-the-art. to illustrate the narrowness of AlphaGo. Admittedly. 2015) with counterfactual regret minimization (CFR). the success of AlphaGo’s conquering titanic search space inspired quantum physicists to solve the quantum many-body problem (Carleo and Troyer. Wu and Tian (2017) deployed A3C with CNN to train an agent in a partially observable 3D envi- ronment. the DeepStack computer program defeated professional poker players for the first time. 2017). (2017) for an endeavour in this direction. i. like TD-Gammon (Sutton and Barto. and constraint satisfaction (Silver et al. reinforcement learning. e. 2015). New RL algorithms are called for.. e. AlphaGo’s underlying techniques. It is interesting to see more progress of deep RL in such applications. clear and fast evaluation conducive for many trail- and-errors. security and medical decision support (Sandholm.. scheduling. following the curriculum learning (Bengio et al.2 5. to predict next action and value function. 2016a). planning. Doom. The authors won the champion in Track 1 of ViZDoom 2 Andrej Karpathy posted a blog titled ”AlphaGo. without abstraction and offline computation of complete strategies as before as in Sandholm (2015). NFSP approached a Nash equilibrium. D EEP S TACK Recently. As a whole technique. while common RL methods diverged. and powerful computation. so that an optimal policy and/or an optimal value function can be directly approximated to make decisions without the help of MCTS to choose moves. Popov et al. Chen and Yi (2017). The authors used StarCraft as the testbed. See Kober et al. Zhu et al. Usunier et al.cs. Peng et al. 1) unsupervised reconstruction of a low-dimensional depth map for representation learning to aid obstacle avoidance and short-term trajectory planning. Yahya et al. yet it was solved with supervised learning. Mahler et al. Chebotar et al. (2016b). and visualization of the reasoning process. 2) self-supervised loop clo- sure classification task within a local trajectory. 2016a) and learn to navi- gate (Mirowski et al. (2016) and Tessler et al. (2017). (2017) for StarCraft Dataset. In the following.1 G UIDED P OLICY S EARCH Levine et al. Boston Dynamics robots did not use machine learning. (2017). to obtain the training data coming from the policy’s own state distribution. by transforming policy search into su- pervised learning to achieve data efficiency. and won the Full Deathmatch track of the Visual Doom AI Competition. Pérez-D’Arpino and Shah (2017). GPS alternates between trajectory-centric RL and su- pervised learning.. See Lin et al.2. (2013) for a survey on policy search for robotics. (2017) proposed Schema Networks and empirically studied variants of Breakout in Atari games. long-horizon performance. (2017) obtained the navigation ability by solving a RL problem maximizing cumu- lative reward and jointly considering un/self-supervised tasks to improve data efficiency and task performance. and simulated comparisons with previous policy search methods. (2017) for a survey about applying deep (reinforcement) learning to video games. (2016). (2016. to address the issue that supervised learning usually does not achieve good. with a bidirectionally-coordinated network to form coordination among multiple agents in a team. and Argall et al. Good performance was achieved on a range of real-world manipulation tasks requiring localization. See Justesen et al. (2017). and focus fire without overkill. Duan et al.shtml. See the journal Science Robotics. ”this is the first method that can train deep visuomotor policies for complex. high-dimensional manipulation skills with direct torque control”.. As the authors mentioned. Without human demonstration or labelled data as supervision. Dosovitskiy and Koltun (2017) approached the problem of sensorimotor control in immersive en- vironments with supervised learning. deploying the concept of dynamic grouping and parameter sharing for better scalability. hit and run.. See more recent robotics papers. with training data provided by a trajectory-centric RL method operating under unknown dynamics. It is interesting to note that from NIPS 2016 invited talk. Justesen and Risi (2017) also studied StarCraft. a global plan to act. (2017) studied Super Smash Bros. visual tracking. (2009) for a survey of robot learning from demonstration. 5.2 ROBOTICS Robotics is a classical area for reinforcement learning.mun. (2016a). Gu et al. (2013) for a survey of RL in robotics. like move without collision. The authors incorporated a stacked LSTM to use 29 . localization. cover attack.2. (2017). Firoiu et al. Oh et al. 5. (2017b) proposed a multiagent actor-critic framework. Lee et al. (2017) studied Minecraft. 2017). and handling complex contact dynamics. and plan the following future work: a map from an unknown envi- ronment. Levine et al. and Kansky et al. The authors introduced guided policy search (GPS) to train policies represented as CNN. We list it here since it is usually a RL problem. (2017). The authors addressed the sparse reward issues by augmenting the loss with two auxiliary tasks. See Ontañón et al.Competition by a large margin. the proposed approach learned strate- gies for coordination similar to the level of experienced human players. (2016a) proposed to train the perception and control systems jointly end-to-end. Deisenroth et al. GPS utilizes pre-training to reduce the amount of experience data to train visuomotor policies.2 L EARN TO NAVIGATE Mirowski et al. (2017).ca/˜dchurchill/starcraftaicomp/history. and its history at https://www. to map raw image observations directly to torques at the robot’s motors. 5. Finn and Levine (2016). (2013) for a survey about Starcraft. we discuss guided policy search (Levine et al. Lample and Chaplot (2016) also discussed how to tackle Doom. 2017). Check AIIDE Starcraft AI Competitions. e.g. Radford et al. Liang et al. (2017). Guu et al.g. Wang et al. in con- trast to conventional approaches such as Simultaneous Localisation and Mapping (SLAM).. We intentionally remove ”spoken” before ”dialogue systems” to accommodate both spoken and written language user interface (UI). more about synergy than competition. (2017a). e. See an article by Christopher D. He et al. dialogue manager (dialogue state tracker and dialogue policy learning). 30 . • language tree-structure learning. (2017).memory at different time scales for dynamic elements in the environments. the latter are set up to mimic human-human interactions with extended conversations. and Narasimhan et al. Choi et al. (2017)...2.g.g. (2016). The proposed agent learn to navigate in complex 3D mazes end-to-end from raw sensory input.us/article/170/last-words-computational-linguistics-and-deep-learning. 5.g.3. 2013). e. e. Some non-deep learning algo- rithms are effective and perform well. e. e. e. conversational agents.. and Wang et al. (2017) • semantic parsing. where explicit position inference and mapping are used for navigation. Yogatama et al. Xiong et al.. infobots (interactive question answering). 2017b). and performed similarly to human level. 2017). and many works that study syntax and semantics of languages.1 D IALOGUE S YSTEMS In dialogue systems. A dialogue system usually include the following modules: (spoken) language understanding. e.. (2011. machine translation in Section 5. Jurafsky and Martin (2017) categorize dialogue systems as task-oriented dialog agents and chatbots. which usually requires manual processing. 2012). and helping make significant progress.. titled ”Last Words: Computa- tional Linguistics and Deep Learning. (2016). Zhang and Lapata (2017) • sentiment analysis (Liu. even when start/goal locations change frequently.3. task completion bots (task-oriented or goal-oriented) and personal assistant bots. and Mitra and Craswell (2017) • information extraction. and NAACL for papers about NLP. (2016) • automatic query reformulation. Celikyilmaz et al. sometimes with entertainment value. e. (2017c) • text games. and text generation in Section 5. word2vec (Mikolov et al.. (2017a). e. (2017b) • question answering. (2016b). there are four categories: social chatbots..3 NATURAL L ANGUAGE P ROCESSING In the following we talk about natural language processing (NLP). We are now experiencing generation three: data driven with deep learning. non-deep learning algorithms. or chatbots. Trischler et al. Zhang et al. and for approaches based on no domain knowledge (end-to-end) vs linguistics knowledge. (2015) Deep learning has been permeating into many subareas in NLP. Nogueira and Cho (2017) • language to executable program. This may have the chance to replace the popular SLAM. A look at the importance of Natural Language Processing”.. Xiong et al.nautil. and generation two: data driven with (shallow) learning.g. 5.1. 2013) and fastText (Joulin et al. e. Socher et al.g.g. (2017). and reinforcement learning usually play an important role. Manning.g. (2011.. (2017) • knowledge graph reasoning. We have seen generation one dialogue systems: symbolic rule/template based. The above is a partial list.g. the former are set up to have short conversa- tions to help complete particular tasks.3.g. As in Deng (2017). Yogatama et al. e..3.3. at http://mitp. e. See con- ferences like ACL... (2016a). dialogue systems in Section 5.. Socher et al. Paulus et al. and we list some in the following.. for deep learning vs.. EMNLP. Shen et al. 2008).g.g. There are many interesting issues in NLP. It appears that NLP is still a field. navigation is a by-product of the goal-directed RL optimization problem. Some deep learning approaches to NLP prob- lems incorporate explicitly or implicitly linguistics knowledge. (2017) • information retrieval (Manning et al. see a recent exam- ple in semantic role labeling (He et al. In this approach. e. (2017) • summarization.g. 2013). Narasimhan et al. human and computer interacts with natu- ral language. See Deng (2017) for more details. (2017a). Chen et al.3. and humans may not act optimally. (2014) used two RNNs to 31 . 2015) utilizes end-to-end deep learning for machine translation. (2016). 2016d) and a neural dialogue system. and the infinite loop of repetitive responses.. (2017a). The authors showed em- pirically that the proposed framework reduced manual data annotations significantly and mitigated noisy user feedback in dialogue policy learning. and becomes dominant. Zhang et al. Lewis et al. Shah et al. (2017b). (2017b) and Wen et al. (2016a). (2016). Kandasamy et al. Williams et al. Mo et al. The authors designed a reward function to reflect the above desirable properties. Some recent papers follow: Asri et al. 5. (2016). a dialogue system accesses real world knowledge from KB by symbolic. In task-oriented systems. SQL-like operations.com/MiuLab/TC-Bot. (2017c). Li et al. to tackle the issue that it is unreliable and costly to use explicit user feedback as the reward signal. The user simulator consists of user agenda modelling and natural language generation. (2016b) proposed an on-line learning framework to train the dialogue policy jointly with the reward model via active learning with a Gaussian process model. (2017). (2017). and then decodes it to a variable-length target sentence. (2017). (2017) proposed KB-InfoBot. Zhou et al.. Cho et al. 2013). including those for accessing external knowledge database (KB). Peng et al. (2017). Lipton et al.and a natural language generation (Young et al. Li et al. Yang et al. (2014): the myopia and misalignment of maximizing the probability of generating a response given the previous dialogue turn. although Su et al. Mesnil et al. Li et al. KB-InfoBot achieved the differentiability by inducing a soft pos- terior distribution over the KB entries to indicate which ones the user is interested in. and ease of answering. Zhang et al. there is usually a knowledge base to query. coherence. to attempt to address the issues in the sequence to sequence models based on Sutskever et al. (2017a). Sutskever et al. See NIPS 2016 Workshop on End-to-end Learning for Speech and Audio Processing. (2017). See Serban et al. and using a rule-based agent to warm-start the system with supervised learning. The source code is available at http://github. The proposed framework includes a user sim- ulator (Li et al. The neural machine translation approach usually first encodes a variable-length source sentence.The authors deployed imitation learning from rule-based belief trackers and policy to warm up the system.2 M ACHINE T RANSLATION Neural machine translation (Kalchbrenner and Blunsom. The neural dialogue system is composed of language understanding and dialogue management (dialogue state tracking and policy learning). (2014) and Sutskever et al. Bahdanau et al. (2017b). Saon et al. (2017). Wen et al. Li et al. It would be interesting to investigate the reward model with the approach in Su et al. (2016). Li et al. (2017). 2015). In previous work. (2016b) or with inverse RL and imitation learning as discussed in Section 3. A deep learning approach. (2016b).. Weiss et al. (2016a).. Wen et al. and NIPS 2015 Workshop on Machine Learning for Spoken Language Understanding and Interactions. (2016b) mentioned that such methods are costly. Fatemi et al. e. The authors deployed RL to train dialogue management end-to-end. Serban et al. attempts to make the learning of the system parameters end-to-end. 2013). representing the dialogue policy as a deep Q-network (Mnih et al. against the traditional statistical machine translation techniques..3. Xiong et al. (2015).g. (2015) for a survey of corpora for building dialogue systems. Li et al. with the tricks of a target network and a customized experience re- play. 2014. Su et al. (2017). Bordes et al. (2017). KB-InfoBot is trained end-to-end using RL from user feedback with differentiable operations. (2016). (2015a). The authors designed a modified version of the episodic REINFORCE algorithm to explore and learn both the policy to select dialogue acts and the posterior over the KB entries for correct retrievals. 2013. Eric and Manning (2017). (2017). (2016). (2017b) presented an end-to-end task-completion neural dialogue system with parameters learned by supervised and reinforcement learning. and deployed policy gradient to optimize the long term reward. Su et al. a goal-oriented dialogue system for multi-turn infor- mation access... Cho et al. (2016c) proposed to use deep RL to generate dialogues to model future reward for better informativity. She and Chai (2017). 2014. Xiong et al. (2016). Zhao and Eskenazi (2016). Dhingra et al. as usual. Williams and Zweig (2016). See a survey paper on applying machine learning to speech recognition (Deng and Li. which is non-differentiable and disables the dialogue system from fully end-to-end trainable. g. speech recognition and text to speech. (2017) for OpenNMT. the dual learning approach performed comparably with previous neural machine translation methods with full bilingual data in English to French tasks. (2016) proposed Mixed Incremental Cross-Entropy Reinforce (MIXER) for sequence prediction. The dual learning mechanism may have extensions to many tasks. and the dual. to train the model. Bahdanau et al. Johnson et al. using the generated words as inputs. and Wu et al.g. Experiments showed that. if the task has a dual form. MIXER is a sequence level training algorithm.. an open source neural machine translation system. to maximize the probability of next word. the models are evaluated on a different metrics like BLEU. abstractive summarization. to overcome the issue that the model is non-differentiable. search and keyword extraction. 32 . Yu et al. See Wu et al. Moreover. etc. to avoid the feedback loop when actor and critic need to be trained based on the output of each other. trained to predict next word given the previous ground truth words as inputs. (2016). (2016a) proposed dual learning mechanism to tackle the data hunger issue in machine translation.. The authors utilized a critic network to predict the value of a token. attempting to further improve Ranzato et al. See Zhang et al.encode a sentence to a fix-length vector and then decode the vector into a target sentence.3 T EXT G ENERATION Text generation is the basis for many NLP problems. See Monroe (2017) for a gentle introduction to translation. Klein et al. ma- chine translation. Some techniques are deployed to improve performance: SARSA rather than Monter-Carlo method to lessen the variance in estimating value functions. with only 10% bilingual data for warm start and monolingual data. 5. these models are trained with word level losses. inspired by the observation that the information feedback between the primal. Li et al. (2016). aligning training and testing objective. inte- grating the adversarial scheme in Goodfellow et al. See Vaswani et al. translation from language A to language B. Gehring et al. in particular. (2017a) proposed to improve sequence generation by considering the knowledge about the future. (2014) introduced the recurrent attention model (RAM) to focus on selected sequence of regions or locations from an image or video for image classification and object detection. e. (2016) for semi-supervised learning for neural machine translation. sequence generative adversarial nets with policy gradient. and experimented on an image classification task and a dynamic visual control problem. with a policy gradient method.3. like conversational response generation. (2015) introduced the soft-attention technique to learn to jointly align and translate. He et al. (2017) for adversarial neural machine translation.4 C OMPUTER V ISION Computer vision is about how computers gain understanding from digital images or videos. with incremental learning and a loss function combining both REINFORCE and cross- entropy. target network for stability. Ranzato et al. Mnih et al. etc. question answering and question generation. The errors will accumulate on the way. 5. We discuss attention in Section 4. translation from B to A. the expected score following the sequence prediction policy. i. sampling prediction from a delayed actor whose weights are updated more slowly than the actor to be trained. (2014). (2016) for Google’s Neural Machine Translation System. rather than predicting the next word as in previous works. can help improve both translation models.e. then in testing. feed-forward neural networks. or recurrent neural networks.1. cross entropy. REINFORCE algorithm. causing the exposure bias issue. Text generation models are usually based on n-gram. Bahdanau et al. such as BLEU. The authors used RL methods. (2017) proposed SeqGAN.. (2017b) for an open source toolkit for neural machine translation. (2017) for a new approach for translation that replaces CNN and RNN with attention and positional encoding. trained by the predicted value of tokens. the trained models are used to generate a sequence word by word. Cheng et al. using the language model likelihood as the reward signal. e. however. defined by an actor network. reward shaping to avoid the issue of sparse training signal. image caption and image generation. (2017) proposed an actor-critic algorithm for sequence prediction. (2017) for convolutional sequence to sequence learning for fast neural machine trans- lation. Strub et al. (2017) proposed end-to-end optimization with deep RL for goal-driven and visually grounded dialogue systems for GuessWhat?! game. to learn an algorithm to select articles sequentially for users based on contextual information of the user and articles. and better results for a language modeling task with Penn Treebank. customer management. (2017a) proposed to train a single model. (2016) introduced Value Iteration Networks. in particular. Zoph and Le (2017) proposed the neural architecture search to generate neural networks architec- tures with an RNN trained by RL. See also Kottur et al. each gradient update to the policy parameters corresponds to training one generated network to convergence. an attention mechanism. 5. The neural architecture search can generate convolutional layers. (2015) integrated attention to image captioning. Xu et al.6 B USINESS M ANAGEMENT Reinforcement learning has many applications in business management. 5. Das et al. (2010) formulated personalized news articles recommendation as a contextual bandit prob- lem. MultiModel. (2017). Neural architecture search provides a promising avenue to explore. (2016b) proposed the predictron to integrate learning and planning into one end-to-end training procedure with raw input in Markov reward process. (2017) proposed to learn cooperative Visual Dialog agents with deep RL. Vaswani et al. and recurrent cell architecture. In the RL formulation. Flickr30k. (2016b) proposed the dueling network architecture to estimate state value function and associated advantage function. nontrivial engineering issue. Zhong et al.. and marketing. (2015) formulated a personalized Ad recommendation systems as a RL problem to maximize life-time value (LTV) with theoretical guarantees. trained by Q-learning. including image classification. which are actions chosen from hyperparameters spaces. (2017) proposed to transfer the architectural building block learned with the neural archi- tecture search (Zoph and Le. such as historical activities of the user and descriptive information and cate- gories of content. image captioning and machine translation. (2017) proposed a new achichitecture for translation that replaces CNN and RNN with attention and positional encoding. the proposed approach achieved competitive results for an image classification task with CIFAR-10 dataset. (2017). This is in contrast to a myopic solution with supervised learning or contextual bandit formulation. recommendation. 33 .e.5 N EURAL A RCHITECTURE D ESIGN Neural networks architecture design is a notorious. See also Liu et al. usually with the performance metric of click through rate (CTR). 2017) on small dataset to large dataset for scalable image recognition. There are recent works exploring new neural architectures. (2017) proposed a meta-learning approach. REINFORCE. to generate CNN architectures automatically for a given learning task. As the models are hard to learn. and showed the effectiveness of attention on Flickr8k. (2017) proposed to construct network blocks to reduce the search space of network design. (2016) and Lu et al. using Q-learning with -greedy exploration and experience replay. trained the hard version attention with REINFORCE. to learn to plan. and sparsely-gated layers. to maximize the expected accuracy of the generated architectures on a validation set. Silver et al. searching from scratch in variable- length architecture space. to combine them to estimate action value function for faster convergence. and MS COCO datasets. like ads. a controller generates hyperparameters as a sequence of to- kens. Li et al. which is composed of convolutional layers. Kaiser et al. Baker et al. how to evaluate a RL policy without deployment. The authors designed a param- eter server approach to speed up training. See also Bello et al.Some are integrating computer vision with natural language processing. the authors deployed a model- free approach to compute a lower-bound on the expected return of a policy to address the off-policy evaluation problem. Wang et al. Comparing with state-of-the-art methods. with skip connections or branching layers. and to take user-click feedback to adapt article selection policy to maximize total user clicks in the long run. to learn multiple tasks from various domains. a fully differentiable CNN planning module to approx- imate the value iteration algorithm. Theocharous et al. (2016) for image captioning. i. See Pasunuru and Bansal (2017) for video captioning. Tamar et al. an accuracy on a valida- tion set is the reward signal. Zoph et al. Personalized medicine is getting popular in healthcare. 2004).8 H EALTHCARE There are many opportunities and challenges in healthcare for machine learning (Miotto et al. real- time diagnostics. Saria. 2017. We may also be aware that financial firms would probably hold state-of-the-art research/application results.2.. 2014. (2016) may be regarded as an exception. Silver et al. Heaton et al. It is interesting to see the applications of deep RL methods in this field. The market efficiency hypothesis is fundamental in finance. Deep (reinforcement) learning would provide better solutions in some issues in risk management (Hull.. (2013) pro- posed concurrent reinforcement learning for the customer interaction problem. 2010).. artificial intelligence in general. 2009).. there is a lecture in AFA 2017 annual meeting: Machine Learning and Prediction in Economics and Finance.0. Surana et al. However.. However. 2005. which may be approached by reinforcement learning. Neuneier. 5. potentially from electronic health/medical record (EHR/EMR). 34 . predictive maintenance. Robots will prevail in Industry 4. Goldberg and Kosorok (2012) pro- posed methods for censored data (patients may drop out during the trial) and flexible number of stages..g. 2016). See Chakraborty and Murphy (2014) for a recent survey. Shortreed et al. (2016) proposed to apply guided policy search (Levine et al. Li et al.Li et al. FinTech has been attracting attention. 5. see O’Donovan et al. consumer credit risk (Khandani et al.0 The ear of Industry 4. where value function based RL methods were used.. Reinforcement learning in particular. e.0 is approaching.google..g. and multi-period portfolio optimization (Brandt et al.. 2009). Yu et al. The authors generated cold spray surface simulation profiles to train the model. A reconciliation is the adaptive markets hypothesis (Lo. 1997). Moody and Saffell (2001) proposed to utilize policy search to learn to trade.2. will be critical enabling techniques for many aspects of Industry 4. etc. and Preuveneers and Ilie-Zudor (2017). Deng et al. (2016) extended it with deep neural networks. 5. 2001. 2016a) as discussed in Section 5.7 F INANCE RL is a natural solution to some finance and economics problems (Hull. for chronic conditions and cancers using individual patient infor- mation. in particular.com/site/nipsmlhc15/). It is nontrivial for finance and economics academia to accept blackbox methods like neural networks.google. and designed methods to quantify the evidence of the learned optimal policy.9 I NDUSTRY 4. like option pricing (Longstaff and Schwartz. in particular. 2001. 2014). Luenberger. and we discuss robotics in Section 5. (2015) also attempted to maximize lifetime value of customers. to handle complex trajec- tories traversing by robotic agents. (2015).1 to optimize trajectory policy of cold spray nozzle dynamics. (2011) tackled the missing data problem. Some recent workshops at the intersection of machine learning and healthcare are: NIPS 2016 Workshop on Machine Learning for Health (http://www. Dynamic treatment regimes (DTRs) or adaptive treatment strategies are sequential decision making problems. e. Currently Q-learning is the RL method in DTRs. It systematically optimizes the patient’s health care. FinTech employs machine learning techniques to deal with issues like fraud detection (Phua et al.com/view/icml2017-deep-health-tutorial/home). 2010). 1997). especially after the notion of big data. 2014.nipsml4hc.0. and Kosorok and Moodie (2015) for an edited book about recent progress in DTRs. See Sutton and Barto (2017) for a detailed and intuitive description of some topics discussed here under the section title of personalized web services. Some issues in DTRs are not in standard RL. See ICML 2017 Tutorial on Deep Learning for Health Care Applications: Challenges and Solutions (https://sites.. and management of manufacturing activities and processes. there are well-known behavioral biases in human decision-making under uncertainty. Tsitsiklis and Van Roy.ws) and NIPS 2015 Workshop on Machine Learning in Healthcare (https://sites. prospect theory (Prashanth et al. . facing challenges of stability. 2016). (2013) reviewed learning and reasoning techniques in cognitive radio networks. like congestion. The au- thors proposed to tackle multi-resource cluster scheduling with policy gradient. e. and con- trol (Fang et al. Self-driving vehicle is also a topic of intelligent transportation systems. and weather. (2017). 2014) and wireless sensor networks (Alsheikh et al. 5. (2017) proposed to optimize device placement for Tensorflow computational graphs with RL. etc. (2015b) took the exogenous prices as states.. safety. Wen et al. 2008). El-Tantawy et al. El-Tantawy et al. Ruelens et al. 2012).10 S MART G RID A smart grid is a power grid utilizing modern information technologies to create an intelligent elec- tricity delivery network for electricity generation. 2011). See also van der Pol and Oliehoek (2017) for a recent work. distribution. so that the computational complexity grows linearly with the number of devices. Wen et al. (2013) didn’t explore function approximation. The authors decomposed the RL formulation over devices. With suitable electricity prices. 2007. El-Tantawy et al. in an online manner with dynamic job arrivals.9. and reduce risks. and Haykin (2005) discussed issues in cognitive radio. e.. vehicles and users smart. and Mannion et al. like channel state prediction and resource allocation. mobile phones. computers. (2013) approached the issue of coordination by considering agents at neighbouring intersections. and real traffic data from the City of Toronto.. and conducted simulations using Q-learning.. See Bojarski et al. (2013) proposed to model the adaptive traffic signal control problem as a multiple player stochastic game. like electricity price. The authors deployed a seuqence-to-sequence model to predict how to place subsets of 35 . in Smart Grid as discussed in Section 5.11 I NTELLIGENT T RANSPORTATION S YSTEMS Intelligent transportation systems (Bazzan and Klügl. The authors validated their proposed approach with simulations. (2016) for an experimental review.0 as discussed in Section 5. temperature. The authors validated their proposed approach with simulation.12 C OMPUTER S YSTEMS Computer systems are indispensable in our daily life and work. reduce costs.. etc. Ruelens et al. (2016) proposed knowledge-defined networks.. See NIPS 2016 Workshop on Machine Learning for Intelligent Transportation Systems. Multi-agent RL integrates single agent RL with game theory. (2016) studied resource management in systems and networking with deep RL.. An important aspect is adaptive control (Anderson et al.11.10.g. Glavic et al. and Ruelens et al. Mestres et al. 2015b. consumption. to make trans- port networks. load of peak consumption may be rescheduled/lessened.5. Control and optimization problems abound in computer systems. Busoniu et al. An important issue in intelligent transportation systems is adaptive traffic signal control. and cloud computing.. 2014) apply advanced information technolo- gies for tackling issues in transport networks. (2016) utilized the average as feature extractor to construct states. tentative publication date December 2017. (2015b) proposed to design a fully automated energy management system with model-free reinforcement learning. optimizing various objectives like average job slowdown or completion time. Demand response systems motivate users to dynamically adapt electrical demands in response to changes in grid signals. transmission. and solve it with the approach of multi-agent RL (Shoham et al. Here we briefly discuss demand response (Wen et al. and curse of dimensionality. Gavrilovska et al. efficiency. so that it doesn’t need to specify a disutility function to model users’ dissatisfaction with job rescheduling. (2016) tackled the demand response problem with batch RL. and Bojarski et al. Mirhoseini et al. 5.. 2014) play an important role in Industry 4. about applying RL to adaptive traffic signal control. and in Intelligent Transportation Systems as discussed in Section 5. to improve efficiency.. We also note that Internet of Things (IoT)(Xu et al. (2017) reviewed application of RL for electric power system decision and control. nonstationarity. Check for a special issue of IEEE Transactions on Neural Networks and Learning Systems on Deep Reinforce- ment Learning and Adaptive Dynamic Programming.g. Mao et al. (2016). e. Han et al. Graves et al.com/blog/deepmind-ai-reduces-google-data-centre-cooling-bill-40/. Such hierarchical frame- work/decomposition approach was to reduce state/action space. Optimizing memory control is discussed in Sutton and Barto (2017). e. (2016b)... (2016) also discussed combinatorial optimization problems.g. Klambauer et al. hoping it would provide pointers for those who may be interested in studying them further. Miller (2017). (2017). Some topics/papers may not contain RL yet. Pinto et al.. (2017) • large action space. (2017) • optimization. we believe these are interesting and/or important directions for RL in the sense of either theory or application. Wilson et al. (2016). Lei et al. Li et al.operations in a Tensorflow graph on available devices. • understanding deep learning. Vinyals et al. e. Doshi-Velez and Kim (2017).. (2016c) • robust RL. reducing energy consump- tion by 40%. Bellemare et al. (2017). Liu et al. Czarnecki et al..g. (2016) • new Q-value operators.g. (2017). recurrent neural language model and neural ma- chine translation. (2016). Bailis et al. (2016). He et al. (2017) • value distribution. (2016) • testing. (2016). Kavosh and Littman (2017). (2017). Andrychowicz et al. Google deployed machine learning for data centre power management. (2017) • interpretability. using the execution time of the predicted placement as reward signal for REINFORCE algorithm. (2017) • deep learning efficiency. Al-Shedivat et al. and to enable distributed operation of power management. Shalev-Shwartz et al. The authors decomposed the problem as a global tier for virtual machines resource allocation and a local tier for servers power management. Computation burden is one concern for a RL approach to search directly in the solution space of a combinatorial problem. 6 M ORE T OPICS We list more interesting and/or important topics we have not discussed in this overview as below. Held et al. (2017) proposed a hierarchical framework to tackle resource allocation and power man- agement in cloud computing with deep RL. (2017). e. (2017).g. Shwartz-Ziv and Tishby (2017). Matiisen et al. (2017) • normalization. The proposed method found placements of Tensorflow operations on devices for Inception-V3.. https://deepmind.g. 2016 • usable machine learning. Zhang et al. Dulac-Arnold et al. Harrison et al. van Hasselt et al. (2017) • professor forcing. (2017). we leave it as future work. (2017) • expressivity. Pei et al.. Sze et al. (2016b) • curriculum learning. (2017) • Bayesian RL (Ghavamzadeh et al. (2017). Lamb et al. Lipton (2016). Neyshabur et al. Ribeiro et al. (2017). (2016). Raghu et al. (2016). The au- thors validated their proposed approach with actual Google cluster traces. Spring and Shrivastava (2017). e. It would be definitely more desirable if we could finish reviewing these. Daniely et al.g. (2015) and Bello et al. however. Haarnoja et al. Huk Park et al. yielding shorter execution time than those placements designed by human experts. Hausknecht and Stone (2015) 36 . Karpathy et al. Koh and Liang (2017). However. e. 2015) • POMDP. (2016) – ICML 2017 Tutorial on Interpretable Machine Learning – NIPS 2016 Workshop on Interpretable ML for Complex Systems – ICML Workshop on Human Interpretability in Machine Learning 2017. (2017). online courses. (2017). Balog et al. the RL courses by Rich Sutton and by David Silver as the first two items in the Courses subsection below. Madry et al. (2015).g. (2017). conferences. Chen and Liu (2016).. Rocktäschel and Riedel (2017) • music generation. Jaques et al. (2017).. Barto (Sutton and Barto. Papernot et al.. e. (2016) – NIPS 2015 Workshop on Quantum Machine Learning 7 R ESOURCES We list a collection of deep RL resources including books. Cai et al.. Finn et al. Segler et al. Fernando et al. (2017) • retrosynthesis. tutorial at Deep Learning and Rein- forcement Learning Summer School 2017 • adversarial attacks. (2017). (2017) • deep learning games. blogs.g. This by no means is complete. (2017). (2017) • pathNet. Denil et al. Steps towards continual learning. e. • semi-supervised learning.g.. Santoro et al. (2017) • evolution strategies. testbed. e. (2017). (2016) • symbolic learning. (2017). (2017). Sutton and Andrew G. Goodfellow et al.. Kingma et al. 2017) • deep learning books (Deng and Dong. e. (2017) • proving.. (2017) • physics experiments. Crawford et al. 2014. e. etc. (2017). Zhu and Goldberg (2009) • neural episodic control.. (2017). Parisotto et al. Loos et al. Dai et al. Parisotto et al. tutorials. Yang et al. e. e. We recommend to start with the textbook by Sutton and Barto (Sutton and Barto. Salimans et al. Schuurmans and Zinkevich (2016) • combinatorial optimization.g.. Papernot et al. Huang et al. e. (2015). (2016).g.g. before having a good under- standing of deep reinforcement learning. (2014). Garnelo et al. and open source algorithm implementations. (2017) • DeepForest. e. It is essential to have a good understanding of reinforcement learning. (2017a). Tran et al. Reed and de Freitas (2016) • relational reasoning.. Vinyals et al.g. journals and workshops. (2017) • quantum RL. surveys. The skill to efficiently select the best information becomes essential. (2017). Audiffren et al. Watters et al.g. Kirkpatrick et al.g. (2016). Zhou and Feng (2017) • deep probabilistic programming. arXiv.1 B OOKS • the definitive and intuitive reinforcement learning book by Richard S. Bello et al. In the current information/social media age. reports. from Twitter. Denil et al. (2016) • program learning. e. Lopez-Paz and Ranzato (2017) – Satinder Singh. research papers. Google+. Liang et al.g. 2016) 37 . e. (2017) • continual learning. blogs..g. Pritzel et al. Cheng et al. (2017). we expect to see an AI agent to do such tasks like intelligently searching and summarizing relevant news. 2017). etc.. 7. we are overwhelmed by information. In an ear of AI. Goldberg.. 2015) • AI safety (Amodei et al. 2017) • natural language processing (NLP) (Hirschberg and Manning. 2017. etc.. Weiss et al. 2012. CS 294: Deep Reinforcement Learning.edu/class/cs234/ • Charles Isbell. Bertsekas and Tsitsiklis. 2016) • Bayesian RL (Ghavamzadeh et al..stanford. Michael Littman and Pushkar Kolhe. 2015) • Monte Carlo tree search (MCTS) (Browne et al. 2012. Jurafsky and Martin. Spring 2017. 2015) • spoken dialogue systems (Deng and Li.. slides (goo. 2016) • artificial intelligence (Russell and Norvig.. reading materials.stanford. 2017. Provost and Fawcett. Simeone.io • Emma Brunskill. 2017) 3 • deep learning (LeCun et al. and was online much earlier.youtube. 2010. 2015) • practical machine learning advices (Domingos... http://cs231n. Udacity: Machine Learning: Rein- forcement Learning. 2017) • semi-supervised learning (Zhu and Goldberg. James et al. CS231n: Convolutional Neural Networks for Visual Recognition.com/user/ProfNandoDF • Fei-Fei Li. Andrej Karpathy and Justin Johnson. 2009) • natural language processing (NLP) (Deng and Liu. 2010) • an operations research oriented RL book (Powell. assignments.github.. http://web. http://www. 2017. 2013) • transfer learning (Taylor and Stone. Grondman et al. 2013. 2015. Bengio. Reinforcement Learning. He and Deng. 2015. 2013) • robotics (Kober et al.. 2012). slides. https://katefvision. 2009. 2016. Garcı̀a and Fernàndez. Smith. Pan and Yang. Kaelbling et al. Reinforcement Learning.4 C OURSES • Richard Sutton. 2013. 2016. Ruslan Satakhutdinov.. 2013. 2008) 7. 38 .. than this brief survey. 2011. 2012) • Markov decision processes (Puterman. 2012. Deep Learning Lectures. 2013. Schmidhuber. 2015. http://rll... Young et al.edu 3 Our overview is much more comprehensive. 1996. Szepesvári. 2015.gl/7BVRkT) • Sergey Levine..incompleteideas.edu/deeprlcourse/ • Katerina Fragkiadaki. Deep Reinforcement Learning and Control. 2013. John Schulman and Chelsea Finn.berkeley. 2009.2 M ORE B OOKS • theoretical RL books (Bertsekas. 2008. Cho. 2009. Geramifard et al. 2005) • machine learning (Bishop. Hastie et al.3 S URVEYS AND R EPORTS • reinforcement learning (Littman. Murphy. video-lectures (goo. Zhou. Haykin. 2012. Kuhn and Johnson.gl/UqaxlO).. https://www. 2009) • game theory (Leyton-Brown and Shoham. 2012) 7. Zinkevich. Wang and Raj. 2015. deep reinforcement learning (Arulkumaran et al. 1996.net/sutton/609%20dropbox/ • David Silver. Spring 2017. 2012.7. 2011) • an edited RL book (Wiering and van Otterlo.gl/eyvLfg • Nando de Freitas. Gelly et al. 2013. 2017. 2017) • efficient processing of deep neural networks (Sze et al.. Hinton et al. CS234: Reinforcement Learning. 2017) • machine learning (Jordan and Mitchell. goo. ICML 2016 – David Silver. The Nuts and Bolts of Deep Reinforcement Learning Research. http://homes.edu/ pabbeel/cs287- fa15/ • Emo Todorov. IJCAI. 2017 • Simons Institute Representation Learning Workshop. SIGIR.com/en-us/research/video/tutorial-introduction-to-reinforcement- learning-with-function-approximation/ • Deep Reinforcement Learning – David Silver.cs.6 C ONFERENCES . KDD. Oxford Deep NLP 2017 course. Nuts and Bolts of Building Applications using Deep Learning. SIGDIAL.stanford. etc. 2nd Multidisciplinary Conference on Reinforcement Learning and De- cision Making (RLDM). 2017 • Simons Institute Interactive Learning Workshop. J OURNALS AND W ORKSHOPS • NIPS: Neural Information Processing Systems • ICML: International Conference on Machine Learning • ICLR: International Conference on Learning Representation • RLDM: Multidisciplinary Conference on Reinforcement Learning and Decision Making • EWRL: European Workshop on Reinforcement Learning • AAAI. 39 . Intelligent control through learning and optimization. http://www. Introduction to Reinforcement Learning with Function Approximation. ACL. http://selfdrivingcars. Edmonton. Deep Reinforcement Learning. 2016. 2017 7. 2016. Deep Learning Summer School. Fall 2015. and Control.5 T UTORIALS • Rich Sutton. Practical Deep Learning For Coders.com/oxford-cs-deepnlp-2017 • Pieter Abbeel. Yannis Assael.edu/∼todorov/courses/amath579/index. NIPS 2016 – Sergey Levine and Chelsea Finn. 2015 • Deep Learning and Reinforcement Learning Summer Schools.fast.edu • Jeremy Howard.net/deeplearning2016 pineau reinforcement learning/ • Andrew Ng. ICML 2017 • John Schulman.microsoft. 2015. EMNLP.edu • Brendan Shillingford.washington. Introduction to Reinforcement Learning. NIPS 2016 • Joelle Pineau. Deep Reinforcement Learning Through Policy Op- timization. Canada. http://course. http://videolectures. Advanced Robotics. http://videolectures. • Richard Socher.net/deeplearning2016 abbeel deep reinforcement/ – Pieter Abbeel and John Schulman. ICRA. https://github. Deep Learning School.berkeley.abdeslam. Robot Learning Seminar.ai • Andrew Ng.mit.net/rldm2015 silver reinforcement learning/ – John Schulman.net/robotlearningseminar • MIT 6.html • Abdeslam Boularias. Decision Making. https://www. 2016. http://videolectures. Deep Learning Specialization https://www. https://people. CVPR.eecs. Deep Re- inforcement Learning Workshop.org/specializations/deep-learning 7. 2017 • Simons Institute Computational Challenges in Machine Learning Workshop. http://cs224d. Deep Learning Summer School. CS224d: Deep Learning for Natural Language Processing.S094: Deep Learning for Self-Driving Cars. NIPS 2016 • Deep Learning Summer School. 2016 – Pieter Abbeel. IROS. Alberta. Chris Dyer.coursera. lightweight and flexible platform for RL research (Tian et al.com/2016/08/01/how-deep-reinforcement-learning-can-help-chatbots/ • Christopher Olah. • ViZDoom is a Doom-based AI research platform for visual RL (Kempka et al. 2015. • OpenAI Universe (https://universe. 40 . • DeepMind released a first-person 3D game platform DeepMind Lab (Beattie et al. a special issue about AI • Deep Reinforcement Learning Workshop.learning to act based on long-term payoffs https://www. e.com/Microsoft/malmo). • OpenAI Gym (https://gym. from Microsoft. Atari games and simulated robots. PAMI. Recently.com/s/513696/deep-learning/ • Berkeley AI Research Blog.8 T ESTBEDS • The Arcade Learning Environment (ALE) (Bellemare et al. MLJ..StarCraft II Learning Environment. and a site for the com- parison and reproduction of results. JAIR.com/davechurchill/commandcenter • ParlAI is a framework for dialogue research. including Atari games..io. http://www. • FAIR TorchCraft (Synnaeve et al. Reinforcement learning explained . a framework for RL development.github. is an AI research and experimentation platform built on top of Minecraft. karpathy.com) is used to turn any program into a Gym environment. JMLR. https://github. Multi-Joint dynamics with Contact.com/deepmind/pysc2 • David Churchill..com/ideas/reinforcement-learning-explained • Li Deng. How deep reinforcement learning can help chatbots https://venturebeat.pub 7. AIJ. esp.io • Reinforcement Learning.. • Twitter open-sourced torch-twrl. https://github. 2016) is a library for Real-Time Strategy (RTS) games such as StarCraft: Brood War. • Baidu Apollo Project. 2014).oreilly. • MuJoCo.. flash games. 2017 issue.com/facebookresearch/ParlAI • ELF. http://bair.com) is a toolkit for the development of RL algorithms. colah.edu/blog/ 7.7 B LOGS • Andrej Karpathy..auto • TORCS is a car racing simulator (Bernhard Wymann et al. CommandCenter: StarCraft 2 AI Bot. survey papers on machine learning/AI • Science.gl/1hkKrb • Junling Hu. consisting of environments. implemented in Python. Deepmind and Blizzard will collaborate to release the Starcraft II AI research environment (goo. • Deepmind PySC2 . The Cyberscientist. • Science Robotics. an extensive. http://apollo.openai.g. https://www.org.technologyreview. open-sourced by Facebook. is a physics engine. IJCAI 2016 • Deep Learning Workshop. NIPS 2016. 2016)..com/s/603501/10-breakthrough- technologies-2017-reinforcement-learning/ • Deep Learning. July 7.mujoco. goo. https://github. ICML 2016 • http://distill.technologyreview. https://www. • Nogueira and Cho (2016) presented WebNav Challenge for Wikipedia links navigation.github. 2017) • Project Malmo (https://github. Universe has already integrated many environments. 2013) is a framework com- posed of Atari 2600 games to develop and evaluate AI agents. self-driving open-source. 2016). Science July 2015.openai. GTA V was added to Universe for self-driving vehicle simulation. browser tasks like Mini World of Bits and real-world browser tasks. etc • Nature May 2015.gl/Ptiwfg).berkeley. https://github. 2016) in TensorFlow.9 A LGORITHM I MPLEMENTATIONS We collect implementations of algorithms. http://bit. https://github.. https://github.com/deepmind/learning-to-learn • Value Iteration Networks (Tamar et al.. http://pemami4911. It may need extensions to accommodate progress in deep learning. http://www. 2015). https://github.com/facebookresearch/darkforestGo • Using Keras and Deep Q-Network to Play FlappyBird. https://github. 2016b).com/2016/10/learning-reinforcement-learning/ • OpenAI Baselines: high-quality implementations of reinforcement learning algorithms. https://yanpanlau.Byte Tank.com/a/deepmind.com/openai/baselines • TensorFlow implementation of Deep Reinforcement Learning papers. https://github. a Torch implementation. either classical ones as in a textbook like Sutton and Barto (2017) or in recent papers. Python code to accompany Sutton & Barto’s RL book and David Silver’s RL course. 2016) in TensorFlow.. https://github... 2017). https://github. 2015). 2017.github.com/ShangtongZhang/reinforcement-learning-an-introduction • Learning Reinforcement Learning (with Code. 2015) is a value-function-based reinforcement learning frame- work for education and research.html • Reinforcement learning with unsupervised auxiliary tasks (Jaderberg et al. https://yanpanlau. https://github. 2017a.io/2016/10/11/Torcs-Keras. https://github.github. https://github.. 2016). http://bit..com/miyosuda/unreal • Learning to communicate with deep multi-agent reinforcement learning.com/matthiasplappert/keras-rl • Code Implementations for NIPS 2016 papers.io/blog/2016/08/21/ddpg-rl.. https://github. the Facebook Go engine (Github). 7. https://github. Exercises and Solutions).. 2016) to play TORCS.wildml.io/2016/07/10/FlappyBird-Keras.. Kaiser et al. https://keon.html • Deep Deterministic Policy Gradient (Lillicrap et al. 2009) is a language-independent software for RL experiments. 2016).com/iassael/torch-bootstrapped-dqn • DarkForest.com/devsisters/DQN-tensorflow • Deep Q Learning with Keras and Gym. https://github.com/carpedm20/deep-rl-tensorflow • Deep reinforcement learning for Keras.ly/2hSaOyx • Benchmark results of various policy optimization algorithms (Duan et al.github...com/rllab/rllab • Tensor2Tensor (T2T) (Vaswani et al. • RLGlue (Tanner and White. 2016) in Tensorflow. 2016) in Tensorflow.ly/2pVIP4i • TensorFlow implementation of DeepMind’s Differential Neural Computers (Graves et al.com/awjuliani/Meta-RL 41 ..com/Mostafa-Samir/DNC-tensorflow • Learning to Learn (Reed and de Freitas. https://github.html • Deep Deterministic Policy Gradients (Lillicrap et al.com/zhongwen/predictron • Meta Reinforcement Learning (Wang et al. 2016). • Shangtong Zhang. • RLPy (Geramifard et al.com/TheAbhiKumar/tensorflow-value-iteration-networks • Tensorflow implementation of the Predictron (Silver et al. https://sites.io/deep-q-learning/ • Deep Exploration via Bootstrapped DQN (Osband et al.google.com/dqn/ • Tensorflow implementation of DQN (Mnih et al..b) • DQN (Mnih et al.com/iassael/learning-to-communicate • Deep Reinforcement Learning: Playing a Racing Game . One direction of future work is to further refine this section. 2017a)  reduce variability and instability with averaged-DQN (Anschel et al. 1992) and target network (Mnih et al. finding optimal policy (classical work) proposed approaches: – Q-learning (Watkins and Dayan.. Nachum et al.edu. 2015) – distributed proximal policy optimization (Heess et al. 2017) • issue: the deadly triad: instability and divergence when combining off-policy.. 1992)  reduce variance of gradient estimate: baseline..cn/files/gcforest. 2017b) 42 . 2016). 2016)  better exploration strategy (Osband et al.b. Mahmood et al.... 2017) – GTD (Sutton et al. • Generative adversarial imitation learning (Ho and Ermon..zip 8 B RIEF S UMMARY We list some RL issues and corresponding proposed approaches covered in this overview. https://github. 2016b) – asynchronous methods (Mnih et al. 2016) – trust region policy optimization (Schulman et al. 2017. 2017). containing an im- plementation of Trust Region Policy Optimization (TRPO) (Schulman et al.. 2000) – actor-critic (Barto et al. 2016)  optimality tightening to accelerate DQN (He et al.. function approximation. 1992.. actor-critic – actor-critic with experience replay (Wang et al. 2015. 2016a)  prioritized experience replay (Schaul et al..com/thuml/transfer-caffe • DeepForest (Zhou and Feng.nju. 2016) • issue: train perception and control jointly end-to-end proposed approaches: – guided policy search (Levine et al. 2016). Schulman et al.. and bootstrapping proposed approaches: – DQN with experience replay (Lin. 1988) • issue: control.. 1983) – SARSA (Sutton and Barto. 2017... as well as some classical work. Gu et al.. 2009a.... Sutton et al. 2017) – dueling architecture (Wang et al.. https://github. advantage function (Williams. 2014) – Emphatic-TD (Sutton et al.com/openai/imitation • Starter code for evolution strategies (Salimans et al. 2017). 2015)  overestimate problem in Q-learning: double DQN (van Hasselt et al. especially for issues and solutions in applications. • issue: prediction. 1992) – policy gradient (Williams. policy evaluation proposed approaches: – temporal difference (TD) learning (Sutton. 2015)...com/openai/evolution-strategies-starter • Transfer learning (Long et al. https://github. http://lamda.. 2016a) • issue: data/sample efficiency proposed approaches: – Q-learning.. 2017. 2017) – combine policy gradient and Q-learning (O’Donoghue et al. Weiss et al. separating from computation proposed approaches: memory – differentiable neural computer (DNC) with external memory (Graves et al. 2016) – learn invariant features to transfer skills (Gupta et al. 2015) – image captioning (Xu et al. 2009.. 2017) • issue: benefit from both labelled and unlabelled data proposed approaches: semi-supervised learning (Zhu and Goldberg.. 2017) • issue: model-free planning proposed approaches: – value iteration networks (Tamar et al.. policy gradient and Q-learning (O’Donoghue et al. 2014) – neural machine translation (Bahdanau et al.. 2016) – predictron (Silver et al.. 2016) • issue: model-based learning proposed approaches: – Dyna-Q (Sutton. 2016) • issue: benefit from non-reward training signals in environments proposed approaches: unsupervised Learning – Horde (Sutton et al. 2009) 43 ... 2017) – generative adversarial networks (GANs) (Goodfellow et al. 1990) – combine model-free and model-based RL (Chebotar et al. 2017) – imitation learning with GANs (Ho and Ermon.. policy gradient with off-policy critic (Gu et al. – PGQ... 2017) • issue: data storage over long time. 2016b) • issue: focus on salient parts proposed approaches: attention – object detection (Mnih et al.. (2016). Lake et al.g. Retrace (Munos et al. 2016) – under-appreciated reward exploration (Nachum et al.. 2016b) • issue: exploration-exploitation tradeoff proposed approaches: – unify count-based exploration and intrinsic motivation (Bellemare et al.... e.... 2014) • issue: learn knowledge from different domains proposed approaches: transfer Learning (Taylor and Stone... 2016).. 2016. Duan et al.. Stadie et al. Pan and Yang. 2017) – learning to learn. Wang et al. Reactor (Gruslys et al. 2017) – Q-Prop. 2017) – return-based off-policy control. 2015) – replace CNN and RNN with attention in sequence modelling (Vaswani et al. 2017) – train dialogue policy jointly with reward model (Su et al.. 2017) – learn to navigate with unsupervised auxiliary learning (Mirowski et al. 2016) – variational information maximizing exploration (Houthooft et al... 2011) – unsupervised reinforcement and auxiliary learning (Jaderberg et al. 2000) – learn from demonstration (Hester et al. 2017) – deep exploration via bootstrapped DQN (Osband et al. (2015) • issue: reward function not available proposed approaches: – imitation learning – inverse RL (Ng and Russell.. 2010. (2016). 2017. and. Koch et al. six important mechanisms – attention and memory. multi-agent RL.. 2015). 2016) – integrate temporal abstraction with intrinsic motivation (Kulkarni et al. Tamar et al. and exploration. 44 . (2017a). 2015) • issue: learn. this overview is incomplete. computer vision. Kottur et al. 2016a) • issue: neural networks architecture design proposed approaches: – neural architecture search (Bello et al. reinforcement learning. Ravi and Larochelle. (2016b) 9 D ISCUSSIONS It is both the best and the worst of times for the field of deep RL. 2017b. (2016). exciting new methods and applications... (2016b).. 2016a). and. 2017) – learn with expert’s trajectories and those may not from experts (Audiffren et al.. 2015. 2016) – stochastic neural networks for hierarchical RL (Florensa et al. reward.. Vaswani et al. spoken dialogue systems (Su et al. We also discuss background of machine learning. 2017) – learn a flexible RNN model to handle a family of RL tasks (Duan et al. 2017. e. model. Zoph and Le. 1999). 2017. 2017) – lifelong learning with hierarchical RL (Tessler et al.. plan- ning. 2016) • issue: gigantic search space proposed approaches: – integrate supervised learning. 2016. improvements for and applica- tions of deep Q-network (Mnih et al... 2017... As a consequence. and represent knowledge with spatio-temporal abstraction at multiple levels proposed approaches: hierarchical RL (Barto and Mahadevan. Novel architectures and applications using deep RL were recognized in top tier conferences as best papers in 2016: dueling network architectures (Wang et al... Vinyals et al. neural architecture design.. (2017) at EMNLP (short). 2000) – strategic attentive writer to learn macro-actions (Vezhnevets et al. natural language processing.. 2016b) at ICML..g. 2015. policy. Kaiser et al. we attempt to summarize important achievements and discuss potential directions and applications in this amazing field. Silver et al. In this overview. Kaiser et al. MAXQ (Dietterich. finance... healthcare. Lake et al.. plan. – learn with MDPs both with and without reward functions (Finn et al. Gelly and Silver (2007) was the recipient of Test of Time Award at ICML 2017.. 2016. and we expect to see much more and much faster. 2015) and AlphaGo (Silver et al. Baker et al. and Monte-Carlo tree search as in AlphaGo (Silver et al. We have been witnessing breakthroughs.. for the same reason: it has been growing so fast and so enormously. Wang et al. intelligent transportation systems. information extraction (Narasimhan et al. including deep Q-network (Mnih et al.. In 2017.. in the sense of both depth and width. 2016) at EMNLP. 2016) – one/few/zero-shot learning (Duan et al. Industry 4. However.. and reinforce- ment learning. value iteration networks (Tamar et al. transfer learning. deep learning. and learning to learn. robotics. smart grid. 2016b) at ACL (student). and list a collection of RL resources. and twelve applications – games. 2017) – new architectures. business management.0. we summarize six core elements – value function. 2016) at NIPS. Johnson et al. 2003) – options (Sutton et al. (2017).. We have seen breakthroughs about deep RL. There have been many extensions to. 2017) • issue: adapt rapidly to new tasks proposed approaches: learning to learn – learn to optimize (Li and Malik. Bacon et al. unsupervised learning. Wang et al.. (2017) at AAAI (student). the following were recognized as best papers. hierarchical RL. and computer systems. is more mature and well-accepted. intuitive psy- chology.One Hundred Year Study on Artificial Intelligence: Report of the 2015-2016 Study Panel. dual learning for machine translation (He et al. 2017). Automation. https://www. and has been validated by products and market.. In contrast.Exciting achievements abound: differentiable neural computer (Graves et al.g.g.ly/2qpehcd. reinforcement learning and deep learning have been making steady progress even during the last AI winter. e. The coverage of AI by premier journals like Nature and Science and the launch of Science Robotics illustrate the apparent importance of AI. USA. See also the recent AI Frontiers Conference. 2016). robotics. conducting leading research in deep reinforcement learning. Stanford University.... Sci- ence Robotics launched in 2016. In fact.com. e.. Nature in May 2015 and Science in July 2015 featured survey papers on machine learning/AI. Policy optimization approaches have been gaining traction. having a better understanding of how deep learning works is helpful for deep learning and general machine learning community. in this third wave of AI. generative adversarial imitation learning (Ho and Ermon..g. There are works in this direction as well as for interpretability of deep learning as we list in Section 6. as we have already seen from its many achievements. in many. However. learning to learn. 2016a). Executive Office of the President. for stronger AI. Science has a special issue on July 7. See also Peter Norvig’s perspective at http://bit.. attention. will deeply influence deep learning. This is the renaissance of reinforcement learn- ing (Krakovsky. 2017). spoken dialogue systems. scalability. and neural architecture design (Zoph and Le. recently opened its first ever international AI research office in Alberta. causal model. and learning to learn. asynchronous methods (Mnih et al. data effi- ciency. guided pol- icy search (Levine et al. speed. RL has lots of (potential. e. We will see both deep learning and reinforcement learning prospering in the coming years and beyond. like intuitive physics. so it is not clear how it works. Deepmind. New learning mechanisms have emerged. It is interesting to mention that when Pro- fessor Rich Sutton started working in the University of Alberta in 2003. A popular criticism about deep learning is that it is a blackbox. e. 2016). It is worthwhile to envision deep RL considering perspectives of government. Deep learning. academia and industry on AI. It is important to investigate comments/criticisms. yet few products so far. neural architecture design. like stability. etc. It is interesting to mention that NIPS 2017 main conference was sold out after opening for registration for two weeks.aifrontiers. Canada. e. machine translation. and safety. diverse applications. and more new mechanisms will be emerging.g. using transfer/unsupervised/semi-supervised learning to improve the quality and speed of learn- ing. simplicity. may still need better algorithms. 2017 about AI on The Cyberscientist. Machine Learning and Data Fuel the Future of Productivity by The Goldman Sachs Group. Value function is central to reinforcement learning. computer vision. Reinforcement learning. Reinforcement learning was among MIT Technology Review 10 Breakthrough Technologies in 2017. co-locating with the major research center for reinforcement learning led by Rich Sutton. Deep learning was among MIT Technology Review 10 Breakthrough Technologies in 2013. promising) applications. and act in real time (Lake et al. We have been witnessing the dramatic development of deep learning in both academia and industry in the last few years. and AI. convergence. etc..g. It is essential to consider issues of learning models.. and the economy. he named his lab RLAI: Reinforcement Learning and Artificial Intelligence. robustness. may still need products and market validation. 2016a).. will have deeper influences. and this list is boundless. as a more general learning and decision making paradigm. etc. rather. 2016). it is probably the right time to nurture. Artificial Intelligence. unsupervised reinforcement and auxiliary learning (Jaderberg et al. accuracy. from conginitive science. 2016). Inc. machine learning. in deep Q-network and its many exten- tions. Artificial Intelligence and Life in 2030 . and artificial intelligence in gen- eral. educate and lead the market. has ”conquered” speech recog- nition. interpretability. Deep learning has made many achievements.. and now NLP. This should not be the reason not to accept deep learning. 45 . 2016).. compositionality. Fong. A. Hai Fang. A. Jinke Li. and Bottou. T. E. and Suleman. Csaba Szepesvári. Leibo. R EFERENCES Abbeel. Yao- liang Yu.. P. Rich Sutton. Lazaric. Aravind Lakshminarayanan. In the International Joint Conference on Artificial Intelligence (IJCAI). Apprenticeship learning via inverse reinforcement learning. David Silver.. M. M. Hua He... In the International Conference on Machine Learning (ICML). K. and de Freitas.. and Tan... Anschel. Osband. C. and Shimkin. Qing Yu. ArXiv e-prints. ArXiv e-prints. Wolski. Hoffman. B. Maximum entropy semi- supervised inverse reinforcement learning.. G. S. A survey of robot learning from demonstration. Modular multitask reinforcement learning with policy sketches. J. Bhairav Mehta. N. Welinder. Ba. V. D. Klein.. J. Azar. A. He. (2017). M. Junling Hu. Steinhardt. and Ng.. D. J. Deisenroth. Andreas. Andrychowicz... Contextual Explanation Networks.. Denil.. Audiffren. P. Asri. Al-Shedivat. C. Shilling- ford. R. strategies. Olah.. M. N. M. and Browning. (2017). M. attendants of various seminars and webinars. M.. Dale Schuurmans.... Lin. Arulkumaran. (2016). L.. (2016). and Levine. Powell. P. J. and Ghavamzadeh. A. P.. Anderson. Hinton. W. Veloso. (2017). P.. Using fast weights to attend to the recent past.. W. Amodei. (2016). R.... Proceedings of the IEEE. and Munos.. (2011). A. N. J. a seminar at MIT on AlphaGo: Key Techniques and Applications. (2014).ACKOWLEDGEMENT We appreciate comments from Baochun Bai.. O. 99(6):1098–1115. In the International Conference on Machine Learning (ICML). and Bharath.-P. M. Ruitong Huang.. A Brief Survey of Deep Reinforcement Learning. In the International Conference on Machine Learning (ICML)... Christiano. Abbeel. J. Colmenarejo. Niyato. J. In the Annual Conference on Neural Information Processing Systems (NIPS).. in particular. ArXiv e-prints. 46 . A. Hindsight Experience Replay. and an AI seminar at the University of Alberta on Deep Rein- forcement Learning: An Overview.. ArXiv e-prints. M.. S.. ArXiv e-prints. and Zaremba. Minimax regret bounds for reinforcement learning. Argall. W. E.. Dubey. Any remaining issues and errors are our own. Arjovsky. Pfau. S. IEEE Communications Surveys & Tutorials. Baram. and Mané. 57(5):469–483. Valko. Z. M. Robotics and Autonomous Systems. D.. A sequence-to-sequence model for user simulation in spoken dialogue systems. (2017). Alsheikh. W. Cameron Upright. (2017). Averaged-DQN: Variance reduction and stabi- lization for deep reinforcement learning. D. McGrew. S. Andrychowicz. B. A. and Scott. E... H. S. and Xing. Concrete Problems in AI Safety. G. B. M. (2004). Chintala. J. Machine learning in wireless sensor networks: Algorithms. R. Tobin.. K.. Mnih. P. N.. Boulanger. Chernova.. Brundage. B... (2009). J. Lihong Li. (2016). Kan Deng. F. D. A. M. Learning to learn by gradient descent by gradient descent. Schneider. (2017). 16(4):1996–2018. M. B. Y. I. (2017). In the International Conference on Machine Learning (ICML). Adaptive stochastic control for the smart grid. Wasserstein GAN. D. G.. In the Annual Conference on Neural Information Processing Systems (NIPS).. Schaul. Schulman. L... Yi Wan. Arash Tavakoli.. Ray. and Ionescu.. (2015). and applications. In Annual Meeting of the International Speech Communication Association (INTERSPEECH).. A. ArXiv e-prints. Bello. Naddaf.. Brockschmidt. and Bengio. Naik. Q. (2017). L. G. Bazzan. Gaffney. (2017).... B. S. Veness. ArXiv e-prints.. W. and Munos... J. Baker. V. Danihelka. DeepMind Lab. and Raskar.. In the International Conference on Learning Representations (ICLR).. Hoyer. and Klügl. M. Nowozin. Goyal. (2016). Brakel. In the International Conference on Learning Representations (ICLR). Hassabis. A. Y... S. K. Gupta. Norouzi. Bellemare. M.. (2017). R. Neural Combinatorial Optimiza- tion with Reinforcement Learning. and Precup. Sadik. (2014).. Q. P. J. S. Balog. J.. (2016).. and Kavukcuoglu. Beattie. (2017). Baird. Bahdanau. N. V. M. Le. In the International Conference on Learning Representations (ICLR). H. I. Leibo. M.. S. The arcade learning environment: An evaluation platform for general agents. Infrastructure for Usable Machine Learning: The Stanford DAWN Project. (2017). (1983). O. S. I. Mnih. (2017). J. R. S. Lowe. A.. A. Bacon... In the International Conference on Learning Representations (ICLR). M. V. and Munos. W. and Bowling. and Bengio. Barto. C. Clemons. Pineau. Morgan & Claypool.. S. Gaunt.. Pham. Y.. A.. J. and Petersen. K.. V. V. S.. G. A.. Neural optimizer search with reinforcement learning. S. S. 13:835–846. (1995). Multiple object recognition with visual attention. In the AAAI Conference on Artificial Intelligence (AAAI). J.. (2015). Introduction to Intelligent Systems in Traffic and Transportation. Recent advances in hierarchical reinforcement learning. R. Srinivasan. and Tarlow. Journal of Artificial Intelligence Research. Unifying count-based exploration and intrinsic motivation... Cain.. C. Mohamed. In the Annual Conference on Neural Information Processing Systems (NIPS). A.. Babaeizadeh. T. S. C. Zoph. and Zaharia. J. Designing neural network architectures using reinforcement learning. D. (2017)... Bailis. Valdés. Dabney. D.. S. L. Wainwright.. M. Vasudevan. A. Neuronlike elements that can solve difficult learning control problems. The Cramer Distance as a Solution to Biased Wasserstein Gradients. Bellemare. L.. K. (2014). York. In the International Conference on Machine Learning (ICML).. Schrittwieser. Anderson. Schaul.. P. Lefrancq. Ward.. Discrete Event Dynamic Systems. D. Olukoton..Ba. King.. 47:253– 279. (2017). and Mahadevan.. Saxton.. G. In the International Conference on Learning Representations (ICLR). An actor-critic algorithm for sequence prediction.. Green. Reinforcement learn- ing through asynchronous advantage actor-critic on a gpu. Bello. W. D. M.. K.. G. Neural machine translation by jointly learning to align and translate.. Dabney.. In the International Conference on Machine Learning (ICML). (2016). Z. 13(4):341–379. Re.. R. R. B. Man... In the International Conference on Machine Learning (ICML). A.. G. A distributional perspective on reinforcement learning. H. Bellemare. G. Cant. Bolton. J. IEEE Transactions on Systems. and Bengio.. P.. (2017). Bellemare. I. D. B. Küttler. S. Ostrovski. I. Frosio. Legg.. D.. Deepcoder: Learning to write programs. Bahdanau. and Kautz. K.. ArXiv e-prints. Lakshminarayanan. T. Teplyashin. and Cybernetics. Xu.. and Munos..-L. R.. A. F. and Anderson.. M. D. and Le. Barto. Courville. In the International Conference on Learning Representations (ICLR). M. Y. Cho. M.. 47 . Tyree. M. Harb.. (2003).. M.. Sutton.. (2013).. ArXiv e-prints. Residual algorithms: Reinforcement learning with function approximation.... H.. The option-critic architecture. G.. IEEE Transactions on Computational Intelligence and AI in Games. 2(1):1–127. (2009). (1996).. Machine Learning.. 114(33):8689–8692. 48 . (2017). D. Learning deep architectures for ai. W. Cai. Foundations and trends in R Machine Learn- ing. Athena Scientific. (2017). B. Jackel.. L. S. R.. Zhang.. N. A. Testa.... ”http://www. S. Yeres. Bishop. Deng. J. TORCS. C. Jackel. L. E. Convex Optimization. (2017). Firner. L.. Y. J. P. (2004). Babuska. Y.. Man. J. (2012). J.Part C: Applications and Reviews. and Vandenberghe. A. D. Whitehouse. S. P.. Brandt. Guionneau.. Perez. and Colton. A. A. L.. 18(3):831–873.. Learning end-to-end goal-oriented dialog. Burch.. B. Dimitrakakis. BEGAN: Boundary Equilibrium Generative Ad- versarial Networks. A. J. N. Cambridge University Press. Schumm. P.. and Stroud. PNAS.. X. K. 38(2).. 4(1):1–43. D. and Weston. Briot. Science. M. Busoniu. ArXiv e-prints. A comprehensive survey of multiagent rein- forcement learning. Bojarski. J. C. R.. (2005). Massachusetts.. Dynamic programming and optimal control (Vol. Lucas. M. (2017). Bowling. and Song. M.... Powley. Dworakowski. (2017). L.. The Review of Financial Studies. Springer. S. Athena Scientific. Celikyilmaz. F. Explaining How a Deep Neural Network Trained with End-to-End Learning Steers a Car. C. Bengio. M. A. Flepp. Blei. R. (1996). ArXiv e-prints. D. Browne.. (2014). Monfort. Scaffolding Networks: Incremental Learn- ing and Teaching Through Questioning.. and Weston.. and Rémi Coulom. M. USA. D. and Tammelin. Li. (2011). Pattern Recognition and Machine Learning. O. P. (2017). and Barto. M. L.. P. G. End to End Learning for Self-Driving Cars. (2012). A simulation approach to dynamic portfolio choice with an application to learning about return predictability. D. U. Bojarski.. P.. D.. Berthelot. Bordes. and Wang. In the Inter- national Conference on Machine Learning (ICML). J.. G. B. J. D.. Heads-up limit hold’em poker is solved. and Metz.. S.. (2017). Bertsekas. (2017). (2008). Choromanski. Deep Learning Techniques for Music Generation - A Survey. Making neural programming architectures generalize via recursion. C.Bengio. G. K.. I. C.. Neuro-Dynamic Programming. Goyal. (2015). (2016). Bradtke.. Zhao. Tavener.... Johanson. P. Santa-Clara. P.. Samothrakis.. A survey of Monte Carlo tree search methods.-L.. 22(1-3):33–57. T. D. D. and Tsitsiklis.. Solving the quantum many-body problem with artificial neural networks. M. U. Goyal. Y..torcs. ArXiv e-prints. Curriculum learning. Bertsekas.. Bernhard Wymann. Zhang. Collobert.. D. Carleo. and Smyth. In the International Conference on Learning Representations (ICLR). Firner. R. Boyd. Boureau. D. Choromanska.org”. B. and Pachet. The Open Racing Car Simulator. Cowling. ArXiv e-prints. E.. II. S. Science.. 4th Edition: Approxi- mate Dynamic Programming). Science and data science. 355(6325):602–606.. and Zieba. J. S. and Muller.. Shin. Linear least-squares algorithms for temporal difference learning. ArXiv e-prints. Louradour. Hadjeres. E. 347(6218):145–149. Muller. L. Rohlfshagen. (2009). IEEE Transactions on Systems. and Schutter. and Troyer.-P. M. and Cybernetics . J. In the International Conference on Learning Representations (ICLR). . W. Chebotar. D.. K.. K. Frostig.. Wayne. G. Z. Legg. Kalchbrenner. J. M. Associative long short-term memory. S. G. L.. (2016a). He. Celikyilmaz. Wu.. F.. Natural Language Understanding with Distributed Representation. Y... O.. (2016). Hewlett.. Y. M.. and Amodei. M. Ghadermarzy. and Berant. (2016). W. and Deng.. J. Good Semi-supervised Learning that Requires a Bad GAN. Annual Review of Statistics and Its Application. and Levine. In Annual Meeting of the International Speech Communication Association (INTERSPEECH).. Sun. and Murphy. The Game Imitation: Deep Supervised Convolutional Networks for Quick Video Game AI. C.. (2016). G. K. A.. M. S. Yang.Chakraborty. Chen... In the Annual Conference on Neural Information Processing Systems (NIPS). Leike. Zacharias Holland. Daniely. and Ronagh... G. ArXiv e-prints. A. and Salakhutdinov.. Hakkani-Tür. S. Dynamic treatment regimes. A.. T.. Gulcehre. A. Deep rein- forcement learning from human preferences. K.-N. F. S. B. R. Moura. In the International Conference on Machine Learning (ICML). Levit. Cho. L. Morgan & Claypool Publishers. (2016). X. S.. M.. ArXiv e-prints. 1:447–464. and Levine. Brown. Y. Com- bining model-based and model-free updates for trajectory-centric reinforcement learning. Schaal. and Bengio. Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning. S. Sukhatme. and Yi. Yahya. Y. (2017). Huang. E. L. 49 .. Z. S.. Y. ArXiv e-prints. ArXiv e-prints.. D. Yang. S.. ArXiv e-prints. F.. Gao. Jaderberg. W. H. S. (2016). (2016b)... Knowledge as a Teacher: Knowledge-Guided Structural Attention Networks. D. D. J. Martic. (2017)... van Merrienboer... Z. J. and Deng. He. M. Path integral guided policy search. (2015). (2017). Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. M. Lee. In the Association for Computational Linguistics annual meeting (ACL).. Gao. Crawford.. Y. A. In the International Conference on Machine Learning (ICML). Uria. Das. A... (2017). A. P. (2017).. S. and Kavukcuoglu.. J.... Reinforcement Learning Using Quantum Boltzmann Machines.. W. Y. N. W. Gao. Bougares. Y. Understanding Synthetic Gradients and Decoupled Neural Interfaces. Hernandez-Garcia.. D. (2016). Li. Lifelong Machine Learning.. F. G.. Oberoi. B. and Liu... and Singer.. Z. Uszkoreit. B. R. (2017).. Cheng. Cho. P. (2017). Kalakrishnan. and Batra. J. (2017). Z. Schwenk. J. Świrszcz. D. Hakkani-Tur.. Chebotar.. K.. Schaal. Czarnecki.. (2014). J. Multi-step Reinforcement Learning: A Unifying Algorithm. and Sutton. Osindero. V. P... Tur.-S. In Conference on Empirical Methods in Natural Language Processing (EMNLP). G. Chen. In the Association for Computational Linguistics annual meeting (ACL). ArXiv e-prints. Coarse-to-fine question answering for long documents. (2014).-N. Hausman. ArXiv e-prints. ArXiv e-prints. De Asis.. Vinyals. Christiano. I. Semi-supervised learning for neural machine translation.. Unsupervised Learning of Predictors from Unpaired Input-Output Samples. D. and Deng. Polosukhin. A. B.. and Graves. R. Tur. He. H.... Chen. Kottur. Chen. ArXiv e-prints. and Liu. End-to-end memory networks with knowledge carryover for multi-turn spoken language understanding. N. (2016). J. J. S. Xu. Cohen. Dai... Chen. Lacoste.. Zhang.. Choi. A. Danihelka.. ArXiv e-prints. ArXiv e-prints. Learning phrase representations using RNN encoder-decoder for statistical machine translation. B. I. V.. and Liu.. In the International Conference on Learning Representations (ICLR).. (2017). Schulman. Xiao. (2017). Multi-task learning for multiple language translation. B. (2017)... Hierarchical reinforcement learning with the MAXQ value function de- composition. S. Three generations of spoken dialogue systems (bots).. Du. G. https://www. Deep direct reinforcement learning for financial signal representation and trading. and Li. Y. Erez. X. and Zaremba. sched- uled August 2017). Y. Ho. Andrychowicz. L. B. L. Machine learning paradigms for speech recognition: An overview. D. S. Deng. (2014). J. P. L. P. P. (2013).. ArXiv e-prints. Yu. Deng. Saxton. Programmable Agents.. Y. Benchmarking deep rein- forcement learning for continuous control. H.net/AIFrontiers/ li-deng-three-generations-of-spoken-dialogue-systems-bots.. Communications of the ACM. He. Bao. Schulman. Dosovitskiy. D. T. Chen. ArXiv e-prints. B. Cabi.. 55(10):78–87.. J. Multiagent reinforcement learning for integrated network of adaptive traffic signal controllers (marlin-atsc): methodology and large- scale application on downtown toronto. (2013).. L.. 50 . T.. Degris.. Kong. Chen.. Stochastic variance reduction methods for policy evaluation. Chen. J. Sunehag. G. IEEE Transactions on Intelligent Transportation Systems. and Wang.. Chen. Lillicrap.. T. L. G. Ahmed. Kulkarni. and Abbeel. H.. In the International Conference on Learning Representations (ICLR). (2016).. In the Association for Compu- tational Linguistics annual meeting (ACL). J.. Y. Wu. Y. (2017). X.. Dulac-Arnold. S. Bartlett. W. T. T. Doshi-Velez.. Denil. Y. Dietterich. A few useful things to know about machine learning. Deep reinforcement learning in large discrete action spaces. Stadie. (2017). M. 13(1):227–303. D. RL2 : Fast Reinforcement Learning via Slow Reinforcement Learning. F. L. T. Mann. Abbeel. A survey on policy search for robotics. van Hasselt. D. Hunt. 2:1–142. and Dai.. F. and Peters. Gao. P. and Coppin... and Kim.. Learning to act by predicting the future. 21(5):1060–1089. Deng. X. J.slideshare.. and Dong. (2000). In the International Conference on Machine Learning (ICML). and Language Processing. H.. C. J. (2016). Dhingra. Duan. Houthooft. F.. In the International Conference on Machine Learning (ICML). Ren.. L. Deng. R. P... talk at AI Frontiers Conference... (2017). Sutskever. and Koltun. and Deng.-N.. Gómez Colmenarejo. S. Domingos. B.. and de Freitas. Abdulhai. Battaglia. Sutskever. I.. L.. (2016). W. P.Deisenroth... Now Publishers Inc. Li. ArXiv e-prints. P. Denil. Deng. Y.. T.. Neumann.. M. Agrawal. X. One-Shot Imitation Learning. P. (2012).. M. (2017). End-to-end reinforcement learning of dialogue agents for information access.. Foundations and Trend in Robotics. Towards A Rigorous Science of Interpretable Machine Learn- ing. Li.. A.. (2015).. Dong.. L. Weber. (2017). H. P... Q. Y. N.. B. In the International Conference on Machine Learning (ICML). Evans. 14(3):1140–1150. and Abbeel. (2013). I. and de Freitas. M. (2016). Springer. Deep Learning in Natural Language Processing (edited book. Duan.. IEEE Transactions on Audio. Z. Learning to perform physics experiments via deep reinforcement learning... and Zhou. ArXiv e-prints. (2017). IEEE Transactions on Neural Networks and Learning Systems. Li. R. J. D. In the Association for Computational Linguistics annual meeting (ACL). El-Tantawy. J. N. and Abdelgawad. Deep Learning: Methods and Applications. S. Duan. Schneider. Journal of Artificial Intelligence Research. Speech. . (2016). and Whiteson.. Abbeel. Xue... P. R. P.. C. In the Annual Conference on Neural Information Processing Systems (NIPS).. Y. C. J.. Smart grid . Y.. 51 . repeat: Fast scene understanding with generative models. D. (2017). ArXiv e-prints. 14(4):944–980. K. and Lempitsky.. In NIPS 2016 Workshop on Adversarial Train- ing. E. B. Torr. and energy-based models. F. (2017). C. S. M. (2016). M. A. and Abbeel. Foerster.. Fatemi.. M. S.. Tassa.... (2016). Garcı̀a. V. and Abbeel.. Fu. Pritzel. S. A. T... X. Policy networks with two- stage training for dialogue systems... V. (2016). and Whiteson. Banarse. W. Foerster. (2017).. Generalizing skills with semi-supervised reinforcement learning. J. Learning to communicate with deep multi-agent reinforcement learning. Whitney. and Yang. L. Ustinova.. M. Stabilising experience replay for deep multi-agent reinforcement learning. M. and Shanahan.. infer. Munos. Firoiu... A connection between GANs.Eric.. Farquhar. G.. M. and Legg. Foerster. In the Annual Conference on Neural Information Processing Systems (NIPS). D. J. In the International Conference on Learning Representations (ICLR). K. He. S. Hassabis... D. IEEE Communications Surveys Tutorials. and Levine.. J. C. G. with Deep Reinforcement Learning. N. and Tenenbaum. S. Asri. B. and Suleman. Farquhar. Ha.. (2017).. J. Blundell. T. In the International Conference on Learning Representations (ICLR). N. Finn. Kavukcuoglu. A. H. Finn. M. The Journal of Machine Learning Research. Finn.. (2005). D.. Domain-adversarial training of neural networks.. P. S. Finn... Mnih.. Fernando. Szepesvári.. N. and Wierstra.... Fang... Blundell. Germain. C. and Fernàndez. Yu. C. In the International Conference on Machine Learning (ICML). Stochastic neural networks for hierarchical reinforce- ment learning. (2017). Tree-based batch mode reinforcement learning. P. Eslami. O... P.. Duan. inverse reinforcement learning. Misra. N. ArXiv e-prints. H. Osband. 6:503–556. Deep visual foresight for planning robot motion. Abbeel.. F. F. Graves. and Whiteson. and Wehenkel. S. (2017).the new and improved power grid: A survey. Ernst. H. Guided cost learning: Deep inverse optimal control via policy optimization. J. Attend. de Freitas. Menick. Counterfactual Multi- Agent Policy Gradients. Weber. and Levine. Fortunato. M. Beating the World’s Best at Super Smash Bros. ArXiv e-prints. S. A. J. D. G... I. E. Nardelli. Garnelo. Levine. In the International Conference on Machine Learning (ICML). Christiano. ArXiv e-prints. Larochelle. (2012). Y. Journal of Machine Learning Research. C.. In the Annual SIGdial Meeting on Discourse and Dialogue (SIGDIAL). T. (2016a). 17(59):1–35. Kohli... ArXiv e-prints.. S. (2016). Pietquin. Ajakan. P.. L. Noisy Networks for Exploration. C.. Gheshlaghi Azar. ArXiv e-prints. V.. (2016b). Marchand. Heess. P. A comprehensive survey on safe reinforcement learning. The Journal of Machine Learning Research.. Y. PathNet: Evolution Channels Gradient Descent in Super Neural Networks. S. Towards Deep Symbolic Reinforcement Learning. C. Afouras. K. Y. and Hinton. Arulkumaran.. H. J. A... Geurts. and Levine. Rusu. (2016). Assael. G.. D. (2015). (2017). and Manning.. S. Florensa... D. E. Nardelli. M. P. Laviolette.. ArXiv e-prints. P. D. Schulz. 16:1437–1480. A Copy-Augmented Sequence-to-Sequence Architecture Gives Good Performance on Task-Oriented Dialogue. (2017). Zwols.. Piot. Ganin. Bengio. N. Klein. L. ArXiv e-prints. G. Summerfield.. S. C. Graves.. Goldberg. Courville. Roy. 15(4):1761–1777. Y. The Reactor: A Sample-Efficient Actor-Critic Architecture. Hermann. Goldberg.. S... Zwols. T. D. (2016). G. G. Nature. (2013). Foundations and Trends in Machine Learning. Ozair. P.. T... Communications of the ACM. Y. Neural Turing Machines. 538:471–476. A. In the International Conference on Machine Learning (ICML).. Auli. G.. IEEE Communications Surveys Tutorials. G. Y. Teytaud. K. Harley.. and Szepesvári. and How.. Rezende. and Dauphin. ArXiv e-prints. A.Gavrilovska. D. Ramalho. Fonteneau. (2015). 52 . Gelly.. Munos. nech Badia. Danihelka.. Grabska-Barwińska. 6(4):375–451. Morgan & Clay- pool Publishers. R. Mirza. and Hassabis... and Bengio. Foundations and Trends in Machine Learning. Graves... A. (2012). K. (2012).. Convolutional Sequence to Sequence Learning. Dann. Kavukcuoglu. S. A.. A. IEEE Transactions on Systems. Journal of Machine Learning Research. Silver. Grondman. and Babuška. G. P. and Kosorok. and Danihelka.. D. 40(1):529–560. E. Cain. Automated Curriculum Learning for Neural Networks.. A. Geramifard. Y. Mannor.. and DaSilva. J. M. O. A. Deep Learning. Grangier. (2017). (2012). R. I. Combining online and offline knowledge in uct. (2017). and Tamar. Sebag. A. J. Menick. M. Col- menarejo. J. Bellemare. and Silver.. S.. In the Annual Conference on Neural Infor- mation Processing Systems (NIPS). 8(5-6):359–483.. Gehring... Dabney. Goodfellow. G. Man. D. R. L. M. In The 20th World Congress of the International Federation of Automatic Control. (2014). King. M.. A survey of actor-critic rein- forcement learning: Standard and natural policy gradients. Y... Hybrid computing using a neural network with dynamic external memory. Graves. Ostrovski.. A. Annals of Statistics.. M. ArXiv e-prints. MIT Press. L.... K. (2017)... D. D... Gheshlaghi Azar. Agapiou. Gelly. Graves.. A tutorial on linear function approximators for dynamic programming and reinforcement learning. (2015).. A. Bellemare.. Wayne. J. V.. Kocsis.. Reynolds. Macaluso. M. Chowdhary. .. Glavic. J. Walsh. and Munos. M.. Danihelka. S. C. (2014). 55(3):106–113. I. (2013). Busoniu. and Ernst. Blunsom. (2016). Pouget-Abadie. M... H. Gregor. Neural Network Methods for Natural Language Processing. I. and How. (2015). I. Gruslys. Yarats. I. In the International Conference on Machine Learning (ICML). 42(6):1291–1307.. C. Geramifard.. Reinforcement learning for electric power system decision and control: Past considerations and perspectives. Learning and reasoning in cognitive radio networks. Grefenstette.. G. Pineau. Xu. P.. (2017). Y.. L.. Warde-Farley. Bayesian reinforcement learning: a survey. P. page 2672?2680... Atanasovski. and Wierstra.. The grand challenge of computer go: Monte carlo tree search and extensions. Part C (Applications and Reviews). (2017). A.. Rlpy: A value-function- based reinforcement learning framework for education and research. K.. (2007). Generative adversarial nets. NIPS 2016 Tutorial: Generative Adversarial Networks. D. Wayne. A.. A. Schoenauer. I. and Kavukcuoglu. W. I. D. Tellex. and Cybernetics. (2017). M.. J.. N. M.. D. J. I. Lopes.. Q-learning with censored data. H. R. M. T. R. J. R. and Courville. A. ArXiv e-prints. Goodfellow.. A.. ArXiv e-prints.. B.. S.. Draw: A recurrent neural network for image generation. 16:1573–1578. Ghavamzadeh. M. Goodfellow.. F. P... Ghahramani. Abbeel. In the International Conference on Learning Representations (ICLR). and Prediction. In the International Conference on Learning Representations (ICLR). Continuous deep Q-learning with model- based acceleration. M.. Haykin. Dynamic Neural Turing Machine with Soft and Hard Addressing Schemes. I. O. Prentice Hall. In the AAAI Conference on Artificial Intelligence (AAAI). and Ostendorf. He. and Deng. In the International Conference on Learning Representations (ICLR). and Riedl. D. S.. Y. Pasupat. 53 . Springer.. In the International Conference on Machine Learning (ICML). He. Wang. In Conference on Empirical Methods in Natural Language Processing (EMNLP). S. M. J. and Levine. and Liang. Gu. Z. and Peng. Abbeel. (2017). P. T. (2009).... T. Li.. L. Learning invariant feature spaces to transfer skills with reinforcement learning. ArXiv e-prints. Sutskever.. I. J. J. S.. P. In the International Conference on Learning Rep- resentations (ICLR). Holly. Tibshirani. H... He. Improved Training of Wasserstein GANs.-Y. H. Liu. Rationalization: A Neural Machine Translation Approach to Generating Natural Language Explanations. Hadfield-Menell. J. T. Liu. X. A. Chen. S. M. K. Li.. A. B. (2017). J. J. S.. E. and Bengio. Schwing. (2016c). N..-Y.. C. ArXiv e-prints. and Levine. Haarnoja. Reinforcement learning with deep energy- based policies. T. Y. (2015). (2016b). (2016a). J.. ArXiv e-prints. R. L. Chandar. L... Deep recurrent Q-learning for partially observable MDPs. Gulrajani... Tang. and Russell. The Elements of Statistical Learning: Data Mining. (2016). Neural Networks and Learning Machines (third edition).. In the Annual Conference on Neural Information Processing Systems (NIPS). X.. Z. Gupta.. E. ArXiv e-prints. Liu. Deep reinforcement learning with a natural language action space.. Learning to play in a day: Faster deep reinforcement learning by optimality tightening. Lillicrap. Mao.. Y. S. (2017a). Lillicrap. P. T. W. In the Association for Computational Linguistics annual meeting (ACL). R. M. (2017). Harrison.Gu. G.... S. and Levine.. (2017). Guu.. L. (2016). C. A. Liu. (2008).. S. E.. Turner. He. P. Gao. (2017). Deep reinforcement learning with a combinatorial action space for predicting popular reddit threads.. and Levine. and Levine. and Friedman. J. F... (2017). In the Association for Computational Linguistics annual meeting (ACL). W. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. Deng.. and Stone. Hastie. Inference. Y. Cognitive radio: brain-empowered wireless communications. (2016a).. L. A.. Yu. (2005). Ostendorf. Ahmed. P. In the International Conference on Machine Learning (ICML).. Gu. Devin. In the Annual Conference on Neural Information Processing Systems (NIPS). K.. Dragan. Abbeel. and Courville.. Gao. Ehsan. Cho.... 23(2):201–220... Gulcehre. (2016). V. Chen.. M. S. Han. T. Lillicrap. S.. Qin.. T.. U. He.. Q-Prop: Sample- efficient policy gradient with an off-policy critic.. He. S. Deep compression: Compressing deep neural net- works with pruning. Xia. Dual learning for machine translation. (2016b). IEEE Journal on Selected Areas in Communications.. J.. S. trained quantization and Huffman coding. and Dally. D. and Ma.. From language to programs: Bridging reinforcement learning and maximum marginal likelihood. Cooperative inverse reinforce- ment learning. Hausknecht. S. S. Dumoulin. Haykin. Arjovsky. . X. and Precup. 5.. (2015). 82. In NIPS 2016 Deep Reinforcement Learning Workshop. P.. Osband. Heaton. T. Sendonaris.. and Manning. S.. Yu. Advances in natural language processing. Gupta.. Deep neural networks for acoustic modeling in speech recognition. Papernot. Leibo. Mnih. Henaff. Z. Czarnecki. (2017). ArXiv e-prints. D. Y. J. (2016). O. and Stone. B. D. Vanhoucke. No. Hinton. and Ermon.. M. Heinrich. I.. V. T. E. Deep semantic role labeling: What works and what’s next. Piot. S. A. Artificial Intelligence. Chen. He. N. Sainath. Z. 101. R. Nguyen. Huang. (2012). Jaitly.. and Abbeel. A.. K. Erez.. Lanctot. ArXiv e-prints. (2017). On Unifying Deep Generative Models. Merel.. T. R. Y. D... Schaul. J. G.. Agapiou. Deep reinforcement learning from self-play in imperfect- information games. M. Lemmon. Ho. Whitney. and Silver. Hull. Hu.. J.. Hendricks. Proceedings of the IEEE — Vol. Tassa. Model-free imitation learning with policy optimization.. (2013). and Abbeel. W. TB. C. C. D. Salakhutdinov. M. Y. J. Geng. Options. and Rohrbach. (2017).. Lee. Adversarial Attacks on Neural Network Policies. P. B. Ho.. Hirschberg.. M. G. 101(5):1116–1135. M. T. T. and Kavukcuoglu. Islam. Dulac-Arnold. J.. Science. Silver.. and Kingsbury. L. H. V. B. T.. Prentice Hall. Wang. Hester... May 2013. D. 247:170–86. and Gruslys. Applied Stochastic Models in Business and Industry. In the International Conference on Machine Learning (ICML). C. J. B. Henderson.. ArXiv e-prints. (2016). Yang. ArXiv e-prints.He.. T.. Huk Park.. and Xing. J. Goodfellow. Futures and Other Derivatives (9th edition). Heess.. D.. and LeCun.. ArXiv e-prints. (2017). (2016). Wayne.. Lewis. P. (2017).. N. Hester. Schulman. Model-Based Planning in Discrete Action Spaces. M. Speech-centric information processing: An optimization-oriented approach. P. D. Duan. ArXiv e-prints. F.. (2017). Eslami..... Schiele... Duan. and Witte. Deng. J. (2014). Vecerik. G... and Silver. (2017). 349(6245):261–266. Riedmiller. N. A.. L. and Abbeel.. Learning from Demonstrations for Real World Reinforcement Learning. Dahl. E. Schaul. and Ermon. Z.. L. (2017b).... Reproducibility of benchmarked deep reinforcement learning tasks for continuous control... Leibo. N. P. S. J. IEEE Signal Processing Magazine.. N. Intrinsically motivated model learning for developing curious robots. Darrell. Polson. Senior.. and Deng. Generative adversarial imitation learning. L.. P.. Jaderberg. P. J. Pietquin. (2017). Z. Deep learning for finance: deep portfolios. W. J. D. 54 . Vime: Variational information maximizing exploration.. R.. Houthooft. (2016). D.. L. Emergence of Locomotion Behaviours in Rich Environments. Z. In the Annual Conference on Neural Information Processing Systems (NIPS). M. . Automatic Goal Generation for Reinforce- ment Learning Agents... (2017). Turck.. K.. Z. In the Annual Conference on Neural Information Processing Systems (NIPS). F.. and Zettlemoyer.. X. G. Akata. D... A. In the Association for Computational Linguistics annual meeting (ACL). ArXiv e-prints. (2016). A. S.. M. Sriram. Atten- tive Explanations: Justifying Decisions and Pointing to the Evidence. Held. Reinforcement learning with unsupervised auxiliary tasks. rahman Mohamed. J. Florensa.. Y. G. Gomrokchi. In the International Confer- ence on Learning Representations (ICLR). K. J. In ICML 2017 Reproducibility in Ma- chine Learning Workshop. (2016).. A. I. J. X. ArXiv e-prints. P.. ArXiv e-prints. R. Ł. L.. Jordan. G. Chen. Kansky. M.. 55 . S. Eldawy. M.. P. In the International Conference on Learning Representations (ICLR). Witten. Gu. perspectives. Schema networks: Zero-shot transfer with a generative causal model of intuitive physics.. S... Zisserman.. and Dean.. (2016). Grave. M. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL). Agarwal. Springer. (1996). and Kleindienst. Roy... N. S... (2017). and Schapire. N. S. James. Bachrach. Kaelbling. Hastie. (2017). Y. (2013). Karpathy. N. Turner. Kaiser. T.. M. P. Prentice Hall. (2015). Kaiser. and Kavukcuoglu. and Fei-Fei. Text Understanding with the Atten- tion Sum Reader Network. A.. Corrado. One Model To Learn Them All. Z. Johnson. Contextual Decision Processes with Low Bellman Rank are PAC-Learnable. Kandasamy. Bajgar. Can active memory replace attention? In the Annual Conference on Neural Information Processing Systems (NIPS). T. Parmar. L.. and Risi. 349(6245):255–260. Dorfman. A. J. and Carter. I. Simonyan. A. ArXiv e-prints. (2015).. E. (2002). and Risi. Mély. M. Le.. and Tibshirani.. Krishnamurthy. Lázaro-Gredilla. Tuning recurrent neural networks with reinforcement learning. T.. D. M.. An Introduction to Statistical Learning with Applications in R. Jaques. Visualizing and understanding recurrent networks. M. Batch policy gradient methods for improving neural conversation models.. Learning to Remember Rare Events. (2016).. (2013). Langford.. X. and Uszkoreit.. L. J. H. O.. S. J. Learning macromanagement in starcraft from replays using deep learning. Jones. N.. A. T. Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation.. M.. Reinforcement learning: A survey. K. Kadlec. J.. Recurrent continuous translation models. Wu. Kakade. D. Schuster. In the Annual Conference on Neural Information Processing Systems (NIPS). Thorat. R. A natural policy gradient. and Blunsom. D.. Hughes.Jaderberg. and prospects. A. In Conference on Empirical Methods in Natural Language Processing (EMNLP). J.. Sci- ence. (2017). Joulin... Kalchbrenner. J... M. P. and Mikolov. R. Vaswani. Krikun. (2016). A. Sidor. In the International Conference on Machine Learning (ICML).. N. Journal of Artificial Intelligence Research. Machine learning: Trends. D. Phoenix. K. V. (2016)... A. Y... Watten- berg. In the Annual Conference on Neural Information Processing Systems (NIPS).. and George. A. L. Gomez. L. Tomioka. Silver. R. 4:237–285. (2016). and Martin. draft). ArXiv e-prints.. Spatial transformer networks.. (2017). and Mitchell. Bag of tricks for efficient text clas- sification. R. Schmid. Jurafsky. L. S. In the International Conference on Learning Representations (ICLR). Justesen. A. and Bengio. (2017b).. and Bengio. D. Jiang.. (2017). In IEEE Conference on Computational Intelligence and Games (CIG). N. (2017). Speech and Language Processing (3rd ed. K. D. Kaiser.. M. Lou. Togelius. (2017). E... A. Shazeer. N... Bontrager.. N.. D.. Justesen.. (2017a). N. In ICLR 2016 Workshop. Deep Learning for Video Game Playing. Bojanowski. J.. M. N. and Moore. Viégas.. Q. S. F. Littman. O. Johnson. S. Submitted to Int’l Conference on Learning Representations. E. ArXiv e-prints. Nachum.. G. and Eck.. K. Tarlow. Sutskever. R.. W. J. Overcoming catastrophic forgetting in neural networks. 34:2767–2787. (2016). M. B. Rabinowitz. 247:313–335.. (2013).. R. R. (2015). Deng. Luciw. PNAS. Imagenet classification with deep convo- lutional neural networks. Semi-supervised learning with deep generative models. ViZDoom: A Doom- based AI research platform for visual reinforcement learning. Reinforcement renaissance. A.. (2017). (2017). massoud Farahmand. and Liang. A. and Rush. S.. and Precup. T. I.. and Tenenbaum. M. C. Learning from limited demon- strations.. (2017). (2017). A. R. D. E. A new softmax operator for reinforcement learning.. A. Pineau. G. D... K. Understanding black-box predictions via influence functions. and Lo. Adaptive Treatment Strategies in Practice: Plan- ning Trials and Analyzing Data for Personalized Medicine... In the International Conference on Machine Learning (ICML). M. Human-level concept learning through probabilistic program induction. W.. Krakovsky. A.. Science. A. 56 . (2015).. N. Hassabis. (2012). and Hadsell. Kompella. Kempka. 350(6266):1332–1338. M. Kober. Quan.. and Tenenbaum. In the International Conference on Machine Learning (ICML). Saeedi.. In the Annual Conference on Neural Information Processing Systems (NIPS). 59(8):12–14. A. ArXiv e-prints.. (2016). M. Kottur.. T.. D.. Runc. R.. J. G.. and Moodie.. S. Unterthiner. M.. Kumaran. Kingma. Desjardins. B. Clopath. Mayr. M.. 114(13):3521– 3526. Mohamed... In the Annual Conference on Neural Information Processing Systems (NIPS). Y.. S. G. Natural language does not emerge ’naturally’ in multi-agent dialog. G. P. Applied Predictive Modeling. A. (2014). M. A. Artificial Intelligence. Khandani. In the Annual Conference on Neural Information Processing Systems (NIPS). (2017). J. Springer. Krizhevsky.. and Jasḱowski.. J. M.. R. A. Kosorok. In the Annual Conference on Neural Information Processing Systems (NIPS).. (2016). E.. International Journal of Robotics Research. D.. J. Koh. Lake. D. R. Consumer credit-risk models via machine- learning algorithms. Zemel.. Toczek.. In Conference on Empirical Methods in Natural Language Processing (EMNLP). Stollenga. J. P. Klambauer.. S. J. Ramalho. Bagnell. Milan. OpenNMT: Open-Source Toolkit for Neural Machine Translation. In the International Conference on Machine Learning (ICML). A. B. Rezende. M. Veness.. Senellart. Kulkarni. Lee... P. and Schmidhuber. J. T. Rusu. J. 32(11):1238–1278. Wydmuch. Reinforcement learning in robotics: A survey. E. and Hinton. ArXiv e-prints. J.. G. Kirkpatrick. Moura. E... Salakhutdinov. Kim. In IEEE Conference on Computa- tional Intelligence and Games. (2015). and Salakhutdinov. and Hochreiter. D. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. J. K. A.. J. Kuhn. Kim. Pascanu. V. (2014). W. J. (2017). Grabska-Barwinska. Koch. J. and Peters. Communications of the ACM. (2013). and Batra.. (2010). J. Y. K. M. D. ASA-SIAM Series on Statistics and Applied Probability. and Johnson... M. Self-Normalizing Neural Networks. L. Siamese neural networks for one-shot image recognition. B. G. Narasimhan.Kavosh and Littman. Kim. and Welling. Journal of Banking & Finance.. M. Continual curiosity-driven skill acquisition from high-dimensional video inputs for humanoid robots. R. Klein. (2017). Tenenbaum. M. J. (2016a). Barzilay. Li. Levine. Levine. Gao.. Chopra.. M. Parikh.. Li. and Hinton. W.. B. G. Le. Li. Y. Monga. and Shoham.. End-to-End Task-Completion Neural Dialogue Systems. V.. In the International Conference on Learning Representations (ICLR). Devin.. Marecki. A. A contextual-bandit approach to person- alized news article recommendation. The Journal of Machine Learning Research. Multi-agent reinforce- ment learning in sequential social dilemmas... Li.. S. R.. Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection. (2012). J. D.. A. A. (2017). In Conference on Empirical Methods in Natural Language Processing (EMNLP). M. Krizhevsky.. G. Ranzato. Chen. S. Playing FPS Games with Deep Reinforcement Learning. D. Lei. H. Learning through dialogue interactions by asking questions. J. J. 57 .. M. Dialogue learning with human-in-the-loop.. Learning to Decode for Future Success. L. Ullman. and Jurafsky. V. Deal or no deal? end-to-end learning for negotiation dialogues. and Schapire.. T.. W. A. Monroe..Lake. In FAIR. Darrell.. Y. (2017). W. ArXiv e-prints. J. In the Annual Conference on Neural Information Processing Systems (NIPS). T. Building high-level features using large scale unsupervised learning. ArXiv e-prints. (2016b).. (2016). In the International Conference on Autonomous Agents & Multiagent Systems (AAMAS). and Batra.. and Weston. Morgan & Claypool Publishers. (2015). D. A. and Jurafsky. Bengio. E. R.. Deep reinforcement learning for dialogue generation. and Quillen.. P. B. and Graepel... D. Pastor. Miller. Professor forcing: A new algorithm for training recurrent networks.. In the International World Wide Web Conference (WWW). Learning to optimize. (2017). (2016c). and Jurafsky. and Gao. Dauphin. Monroe. M. Monroe. (2008). Leyton-Brown. Zhang. Galley. (2016). S. Langford. A. X.. T. (2017).. T. Y.. W. J. Monroe.. Li. Lee.. S. and Ng. Zhang.. A Simple.. Goyal. Ranzato. and Weston. K. Y. L.. Fast Diverse Decoding Algorithm for Neural Generation. M.. T.. (2010). Li... Yarats.-N... D.. 17:1–40. (2017b).. and Jurafsky. Dean. Finn. J. and Chaplot.. Essentials of Game Theory: A Concise. (2016a). Li. J. and Jaakkola. N. ArXiv e-prints.. Building machines that learn and think like people. J. Q. P. C. In Conference on Empirical Methods in Natural Language Processing (EMNLP).. Z.. S... Li. and Abbeel.. M. D. In the International Conference on Learning Representations (ICLR). R. D. M. J. In the International Conference on Learning Representations (ICLR). (2017a). P. 521:436–444.. Ranzato. In the International Conference on Machine Learning (ICML). K. and Bengio. J. Li. Y. W. Lanctot.. Learning visual servoing with deep features and trust region fitted Q-iteration. A. and Abbeel. Courville. Rationalizing neural predictions. Chen. Deep learning. Nature. S. K. Corrado. D. (2016).. Zambaldi. J. Li. Understanding Neural Networks through Representa- tion Erasure. and Gershman.. J.. ArXiv e-prints. LeCun. In the International Conference on Learning Representations (ICLR). J. Lamb. ArXiv e-prints. Chu. D. H. Multidisciplinary Introduction. S. Lample. Y. and Malik. Lewis. A. Ritter. X. Behavioral and Brain Sciences.. A. Y. (2017b). Chopra.. (2017a). 24:1–101.. Miller. J. ArXiv e-prints. S. S. Y... D. End-to-end training of deep visuomotor policies. J. Leibo. G. Levine. J. (2016).. (2016b). and Synnaeve. D.. (2017). K. and Murphy. Silver. A. Y. Z. The Mythos of Model Interpretability. J.. Lipton. (2009). In the International Conference on Autonomous Agents & Multiagent Systems (AAMAS). Berant. A. Morgan & Claypool Publishers.. (2015). Hunt. Li. Gao. L. Ye.. (2016). (2016). Xu. Journal of Portfolio Management. F. Unsupervised Sequence Classification using Sequential Output Statistics. P.. Valuing American options by simulation: a simple least-squares approach... Y. J. N.. Gao.. Lipton. Liang. ArXiv e-prints. C.. Self-improving reactive agents based on reinforcement learning. J. F. Le. ArXiv e-prints. Li. In 37th IEEE International Conference on Distributed Computing (ICDCS 2017). M. Guadarrama. 30:15–29. X. Erez.. M. C. J.. and Deng.. Zhu.. Gao. Continuous control with deep reinforcement learning. Forbus. (2016). In the International Conference on Learning Representations (ICLR)... Cao. S.. D. A User Simulator for Task-Completion Dialogues.. E. V. L. Liu. Long. (2016).Li.. and Chen. Machado. G. (2017). Stardata: A starcraft ai research dataset. (2015). J. J. Lo. I. State of the art control of atari games using shallow reinforcement learning. D.. Szepesvári. Li.. A hierarchical frame- work of cloud resource allocation and power management using deep reinforcement learning. Tang. Y.. Learning exercise policies for American options. (2016d). Y. (2017a). (2016). Heess. Machine learning. T.. M. and Bowling. Z. Li. J. and Deng. L.... Talvitie.. M. Improved Image Captioning via Policy Gradient optimization of SPIDEr. Z. Wang. Pritzel. J. In the Annual Conference on Neural Information Processing Systems (NIPS). and Schwartz. Y.-N.. Li. J. Dhingra... and He.. Neural symbolic machines: Learn- ing semantic parsers on freebase with weak supervision. Liu.. Z. Li... Unsupervised domain adaptation with residual transfer networks. (2016). M. (1992).. J. S. Liang. Khalidov. and Wang. In the Association for Computational Linguistics annual meeting (ACL). S. planning and teaching. D. B. Learning transferable features with deep adaptation networks. He. Z. K. 58 . (2012). Q. Q. L. (2015). 8(3):293–321. Lillicrap. Liang. (2017).. H.. J. and Jordan... Chen. ArXiv e-prints. 14(1):113–147. L. W. Le.. Tassa.. Liu. N. In AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment (AIIDE). C. A. Y. and Schuurmans. Wang. M. X. and Lao.. J. X. In the Association for Computational Linguistics annual meeting (ACL).. Qiu. Chen. ArXiv e-prints.. Z. L. N.. B. ArXiv e-prints.. Reinforcement learning improves behaviour from evaluative feedback... Deng.. Q. L. Lipton. Sentiment Analysis and Opinion Mining. S. J. and Lao. Y. J. D. C. Ahmed. Recurrent Reinforcement Learning: A Hybrid Approach. (2017b). and Jordan. K. Lin. J. Littman. 521:445–451.. ArXiv e-prints.. (2001). Na- ture... Longstaff. N. The Review of Financial Studies. Lin. N. C. Liu. Li. C. Efficient Exploration for Dialogue Policy Learning with BBQ Networks & Replay Buffer Spiking. Berant. Xu..-J. Z.. M. X. L. and Wierstra. E. T. In International Conference on Artificial Intelligence and Statistics (AISTATS09). In the International Conference on Machine Learning (ICML). Forbus. Lin. I. Gehring.. The Adaptive Markets Hypothesis: Market efficiency from an evolutionary perspective. Long. Zhu. (2004).. Neural symbolic machines: Learn- ing semantic parsers on freebase with weak supervision. C.. M.. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. G. Cohen. Y. E.. Lau. J.. Rodriguez-Natal. Dex-Net 2. Lopez-Paz. Introduction to Information Retrieval... Speech. (2017). In the Annual Conference on Neural Information Processing Systems (NIPS). H. V. A. Oxford University Press. L... ArXiv e-prints. Deep Network Guided Proof Search. X. Alarcón. C.... Dauphin. Y. Hibbett. J. Y. Mesnil. J. M... Xie. Knowledge-Defined Networking. and Romera-Paredes. ArXiv e-prints. and Schütze. J.. and Sutskever... D. A. Barkai. IEEE/ACM Transactions on Audio.. In ACM Workshop on Hot Topics in Networks (HotNets). Szegedy. and Zweig. H. (2016).. Doan. and Schulman. Manning. (2016). Learning human behaviors from motion capture by adversarial imitation. J. (2017).. Heck. Solé.. Mestres. K. M. Makelov. Wayne. G. Gradient Episodic Memory for Continuum Learning. C. M. Learning Online Alignments with Contin- uous Rewards Policy Gradient. and Socher. G. Srinivasan. Matiisen. S. R. T. Walrand. (2017). (2016). J. Towards Deep Learning Models Resistant to Adversarial Attacks. A. A. J. K.. P.. N. van Hasselt. Lemmon. and Vladu. Maurer. Li. O. ArXiv e-prints. Teacher-Student Curriculum Learning. Liang.. B. In Robotics: Science and Systems (RSS).. ArXiv e-prints. Schmidt. Mannion...0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics... H. The Journal of Machine Learning Research. Z. Wu. H. Luenberger. Luo. Cam- bridge University Press. Tamar... Mao.. G. Madry.. ArXiv e-prints.. TB. Muntés. Y. Machado. F. R. G. (1997). Z. (2017). Lowe.. Yao. and Goldberg. A. Abbeel. P. Tassa. J. C.. S. Alizadeh... Raghavan. and Cabellos.. 59 . and Schumann R. C. Weighted importance sampling for off-policy learning with linear function approximation. (2014). Least Squares Generative Adversarial Networks. I. Using recurrent neural networks for slot filling in spoken language understanding. Latapie. M... S. Laskey. Autonomic Road Transport Support Systems. M. C. Ermagan. J. and Mordatch. Carner. (2017). I. Harb. C. J. and Heess. Mahmood.. D. Aparicio Ojea. R. S. (2017). Rana. A. Parikh. Pontil. (2008). Irving... S. M. Y. S. (2016). and Kandula. Bellemare.. F. E. ArXiv e-prints. (2015). Tur. Hakkani-Tür. R. edited by McCluskey.. Wang. N. K... X.. Müller... J. ArXiv e-prints. G. 17(81):1–32.. P.. A.. Kotsialos. (2017). J. Duggan. Niyaz. Cham. Tsipras.. Menache..... K. Deng. ArXiv e-prints... Resource management with deep reinforcement learning. In the International Conference on Machine Learning (ICML). (2016). 23(3):530–539. Chiu. Evans. Y. P. D.. Oliver..-C. D.. Liu. D... Mao.... A.. (2016). Xiong. F.. Yu. Meyer. I. ArXiv e-prints. ArXiv e-prints. A. H.. and Language Processing. G. Merel.. Investment Science. V.. D. (2017)..... R. M. Maino.. Mahler. pages 47–66. X. and Wang. An experimental review of reinforcement learn- ing algorithms for adaptive traffic signal control. A.. Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning.. and Howley. Coras. T. Q. C. Barlet-Ros. and Ranzato. (2016).. J. L. and Kaliszyk. A. R... Estrada. and Sutton. Bengio. The benefit of multitask representation learning. Klügl. Jaitly. D. He... D. J. T. L. Mar̀uf... Cassar. A Laplacian framework for option dis- covery in reinforcement learning. D. and Bowling. M. G. Lu.. Springer International Publishing.Loos. ArXiv e-prints. In the In- ternational Conference on Machine Learning (ICML). Zhang. S. N.. Samy Bengio. Fidjeland.Mikolov.. Recurrent models of visual attention. Stepleton. D. and Bowling. R. Burch.. Jiang. Silver. D. Science. Lillicrap. Lisý. G. Graves. Moravčı́k.... In the International Conference on Learning Representations (ICLR). Silver.. and Dean. T. (2013). J... K.. D. M. R... F. R. Neural Models for Information Retrieval. T. Sadik. Bellemare. Q. Q.. ArXiv e-prints.. (2016). and Hadsell. S.. K. D. Moody.. A. Device placement optimization with reinforcement learning. Learning to trade via direct reinforcement. Machine Learning: A Probabilistic Perspective. and Schuurmans. (2017). M. Artificial Intelligence. V. Alcicek.. T. Badia. ArXiv e-prints... (2015). In the Annual Conference on Neural Information Processing Sys- tems (NIPS). S. I. Ostrovski. O..... In the Annual Conference on Neural Information Processing Systems (NIPS). H.. N. Pascanu.. O. V. Bard. C. Mo. Pham. Legg. K.. (2014). (2017). and Kavukcuoglu. 518(7540):529–533. G.. D.. Legg. T.. A. D.. Mitra. A.. Munos.. (2012). S. Computer go. T. and Bellemare.. Xu. S. Li. C. S. Panneershelvam. Nachum. Petersen. Kavukcuoglu. A. A. Murphy.. P. Müller. (2017). M. K. and Silver.. Nachum. M. IEEE Transactions on Neural Networks. A. and Hassabis. Ballard. Le. Kumaran. K... Veness. Deep learning for healthcare: review.. (2016). R. Soyer. M.. N.. Goroshin.. Kumar.. G. N. and Moham- mad Norouzi. K. T. Corrado. J.. D. Monroe. Mnih. Mirhoseini. Chen... V. Wang.. Norouzi... Schmid. pages 1–11. and Craswell. Harley. S. H.. D. P. Graves. C. Personalizing a Dialogue System with Transfer Learning. B. X. V. J. Rusu.. Communications of the ACM. and Kavukcuoglu. In the International Conference on Machine Learning (ICML). Waugh. K. V. (2017). A. (2015). M. Safe and efficient off- policy reinforcement learning... 60 . In the International Conference on Learning Representations (ICLR). Y. R. D. King.. M... Wierstra. 12(4):875–889. Improving policy gradient by exploring under-appreciated rewards. Human-level control through deep reinforcement learning. (2017)... F. Viola. K. Heess. D. and Schuurmans. (2017)...... Banino.. V. R. A. M.. and Saffell. Harutyunyan. M. Petersen. Explanation in Artificial Intelligence: Insights from the Social Sciences. Beattie.. Asynchronous methods for deep reinforcement learning. Wang. M. Y. M. Steiner. L. P. (2001). J. Miotto.. P. A. Kavukcuoglu. Morrill. Nair. De Maria. (2017). M. Miller.. A. Blackwell. Jo- hanson.. D.. Graves.. 60(6):12–14. Larsen.. V. J. opportunities and challenges.. The MIT Press. Fearon. Norouzi.. Antonoglou.. (2016). J. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. B. Mnih. Deep learning takes on translation. Mnih. Zhou. Denil. N. P. K. A. Srinivasan. Learning to navigate in complex environments.. Efficient estimation of word representations in vector space. G.. Beattie. and Dudley. Riedmiller. Mnih. Sifre. A. K.. Nature. Li. 134(1-2):145–179... (2017). Suleyman.. In the Annual Conference on Neural Information Processing Systems (NIPS). In the International Conference on Learning Representations (ICLR). T. Bridging the gap between value and policy based reinforcement learning. Mirowski. A.. M.. K.. Mirza. A. R. H.. Davis. M. Briefings in Bioinformatics. Kavukcuoglu. (2017). Massively parallel methods for deep reinforcement learning.. and Yang. In ICML 2015 Deep Learning Workshop.. D. (2002).. Kumaran. and Cho... Leahy... R. Goodfellow. Pazis. K. Chockalingam. K. (1997). A. Deep exploration via bootstrapped DQN. P. Nogueira.. A. and Cho. Nedić.. Ú.. Nogueira.. S. and Singh. S. D. (2017). I.. G. Singh. van den Oord. A survey on transfer learning. Richoux. PGQ: Combining policy gradient and Q-learning. In the Annual Conference on Neural Information Processing Systems (NIPS).. Q. R. R. B. Count-Based Exploration with Neural Density Models.. Ontañón.. Narasimhan. K. Pritzel. (2017). IEEE Transactions on Knowledge and Data Engineering.. (2017). R. H. G. and action in minecraft. (2015). N. R. A. R.. J. Semi-supervised knowledge transfer for deep learning from private training data. In the International Conference on Machine Learning (ICML). and Preuss. Yala. Bruton. J. (2017). Ostrovski. J. and Lee. and Munos. H. Enhancing q-learning for optimal asset allocation. and Lee.. In the International Conference on Machine Learning (ICML). Control of memory. Blundell.. Neyshabur. Ng.. M. Salakhutdinov. C. (2017). Guo. K. In Conference on Empirical Methods in Natural Language Processing (EMNLP). Munos. and Roy. In the Annual Conference on Neural Information Processing Systems (NIPS). and Sivic. H. Big data in manufacturing: a systematic mapping study. T.Narasimhan. J. K. Tomioka.. Lewis. Churchill. and Yang. Algorithms for inverse reinforcement learning.. In the International Conference on Machine Learning (ICML). Osband.. Oquab. J. Discrete Event Dynamic Systems: Theory and Applications. In the International Conference on Learning Representations (ICLR). M. (2003). Kulkarni. Is object localization for free? – weakly- supervised learning with convolutional neural networks. ArXiv e-prints. B.. 22(10):1345 – 1359. A survey of real-time strategy game ai research and competition in starcraft. R. (2016). S. Kavukcuoglu.. R. S. ArXiv e-prints. S. N. J. S. and Russell. T. J. P. K. Papernot. J.. O’Donovan. I. (2016). ArXiv e-prints. and Barzilay.. active perception. Improving information extraction by acquiring external evidence with reinforcement learning. and O’Sullivan. V. 5(4):293–311. (2016).... and Mnih. F. Neuneier. and Vian.. Singh. Pan. O’Donoghue. and Talwar. (2017). Bottou. B. Erlingsson... (2015). How. Lee. K. IEEE Transactions on Computational Intelligence and AI in Games. Action-conditional video prediction using deep networks in atari games. Deep decentralized multi-task multi-agent reinforcement learning under partial observability. Oh. D. L. Uriarte.. A.. Oh. Synnaeve. S. In the Annual Conference on Neural Information Processing Systems (NIPS). V. (2010). Omidshafiei.. In Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP).. G. Amato. J. Value Prediction Network. X. R. V. Laptev.. M. and Bertsekas. P. Language understanding for text-based games using deep reinforcement learning. 61 . ArXiv e-prints. (2015)... K. (2015). 13:79–110. Least squares policy evaluation algorithms with linear func- tion approximation. (2017).. M. ArXiv e-prints. and Srebro. R... Task-Oriented Query Reformulation with Reinforcement Learn- ing. A. Abadi.. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).. (2016). In the International Conference on Learning Representations (ICLR)... and Barzilay. End-to-End Goal-Driven Web Navigation. (2013). Oh. Journal of Big Data. Bellemare. (2000). Geometry of Optimization and Implicit Regularization in Deep Learning. A. 2(20). C. D. I. G. A.. Vecerik. L. Jie.. R. Pérez-D’Arpino. Erez. In Conference on Empirical Methods in Natural Language Processing (EMNLP). Tang.. Multiagent Bidirectionally-Coordinated Nets for Learning to Play StarCraft Combat Games. Y. L.0: an adversarial machine learning library. Goodfellow. I. (2017). and Ilie-Zudor. rahman Mohamed. Celikyilmaz. Zhou.. Q. R. Li. R. and Wang. J.. Hafner. Connecting Generative Adversarial Networks and Actor-Critic Methods..0..Papernot. T. Peng. Sheatsley.. ArXiv e-prints. Xiong. E. V. DeepXplore: Automated Whitebox Testing of Deep Learning Systems. Lee. Pritzel. and Wong... Pei. and Fawcett. M. Yang. Journal of Ambient Intelligence and Smart Environments. C.. Learning to Generate Reviews and Discover- ing Sentiment. D... Phua. cleverhans v1. Parisotto. M. Y. C.. B. C. Paulus. Composite task-completion dialogue system via hierarchical deep reinforcement learning.. Pfau. Robust adversarial reinforcement learning. Wen. R.. Marcus..0.. 62 . R.. Ba.. A. Lillicrap. Jozefowicz. (2017). Pinto. Srinivasan. Markov decision processes : discrete stochastic dynamic programming.. Yang. (2010). A. Lee... Popov. (2017). Lampe. O’Reilly Media. R. (2017). W. In the International Conference on Learning Representations (ICLR). K... (2017). 9(3):287–298.. and Sutskever. J. M. (2005). D.. Preuveneers. and McDaniel. R. (2013). L. D. (2017a). Sukthankar. In the International Conference on Machine Learning (ICML)... T. E. and Szepesári. P. Reinforced video captioning with entailment rewards. research challenges and opportunities in industry 4.. J.. Z. K. A. ArXiv e-prints. ArXiv e-prints. (2016). Approximate Dynamic Programming: Solving the curses of dimensionality (2nd Edition). C-learn: Learning geometric constraints from demon- strations for multi-step manipulation in shared autonomy.. Davidson.. Y. J. A. and Salakhutdinov. and Vinyals. (2017). R. Wiley-Interscience.. L. Puigdomènech. P... L. Radford. Yuan. Barth-Maron. Data-efficient Deep Reinforcement Learning for Dexterous Manipulation. and Kohli. Neuro- symbolic program synthesis.. N.. Neural Episodic Control. Heess. J. I.. Vinyals.. Li. A Comprehensive Survey of Data Mining-based Fraud Detection Research. R.. Smith. In the International Conference on Machine Learning (ICML). C. R. (2011). E.. (2017). O. The intelligent industry of the future: A survey on emerg- ing trends. Tassa. (2016). (2017b). B. Puterman......-F. Data Science for Business.. In IEEE International Conference on Robotics and Automation (ICRA).. and Shah. D. (2017). ArXiv e-prints. O. In the International Conference on Learning Representations (ICLR). and Riedmiller. ArXiv e-prints. Singh. L. (2016). Peng. S. A. Cumulative prospect theory meets reinforcement learning: Prediction and control. (2016). T. M. P. X. and Bansal. C. S. Long. B. and Socher. F. John Wiley and Sons. S. A.. H. ArXiv e-prints. ArXiv e-prints.. Actor-mimic: Deep multitask and transfer reinforcement learning. ArXiv e-prints.. ArXiv e-prints. and Blundell. M. D. S. Fu. (2017). Cao. Pasunuru. In Conference on Empirical Methods in Natural Language Processing (EMNLP).. and Jana. A Deep Reinforced Model for Abstractive Summa- rization. Gao. K. Hassabis. R.. and Gupta. and Gayler. Parisotto. (2017). N. T. I.. J. Powell. C.. Li. Provost. Wierstra. Uria. Feinman. Y. Prashanth. Khapra. Attend. Ribeiro. B. I. Schaul... and Silver. Chen. IEEE Intelligent Systems. (2015).. 29(4):82–87. (2016).. D. R.. M. J. S. M. The IBM 2016 English Conversational Telephone Speech Recognition System. and Chen. T. P. and Ravindran.. In the International Conference on Learning Representations (ICLR). S. PP(99):1–11. ArXiv e-prints. Universal value function approximators. J. ArXiv e-prints. Optimization as a model for few-shot learning. Pascanu. Barrett. Vandael. Ganguli. End-to-end Differentiable Proving. R. Schulman. Neural Networks. D. (2017).. P. (2009). An Overview of Multi-Task Learning in Deep Neural Networks. Sercu. Schaul. M. Ho. Artificial Intelligence: A Modern Approach (3rd edition).. the International Conference on Learning Representations (ICLR). Salakhutdinov.-K. a talk at Deep Learn- ing School. Reed. I. Claessens. M. Antonoglou. (2016). (2017). and de Freitas.... Gregor. S. Res- idential demand response of thermostatically controlled loads using batch reinforcement learning. M. Chopra.bayareadlschool. and Zaremba. (2016).. H. Survey of Expres- sivity in Deep Neural Networks. T. Singh. Raposo. 61:85– 117. Pearson. Kleinberg. In the International Conference on Machine Learning (ICML). G. Saon. In Annual Meeting of the International Speech Commu- nication Association (INTERSPEECH). (2017). S. ArXiv e-prints.. F. Rajendran.. N. A $3 trillion challenge to computational scientists: Transforming healthcare deliv- ery. J. Schutter. (2017). and Lillicrap. Poole. Russell. In the Interna- tional Conference on Learning Representations (ICLR). Deep learning in neural networks: An overview.first experiences with a data efficient neural rein- forcement learning method. ArXiv e-prints. A.. Schmidhuber.. J.com/watch?v= rK6bchqeaN8. P. Ravi.. P. Sandholm.. Evolution Strategies as a Scalable Alterna- tive to Reinforcement Learning... (2016). and Sutskever. X. Ruelens... Riedmiller. C. Babuška. R. adapt and transfer: Attentive deep architecture for adaptive transfer from multiple sources in the same domain..org. In ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). S. 347(6218):122–123. and Belmans. Prioritized experience replay. T.. H. D. T.. (2015). X. ArXiv e-prints. J. J. D. Auli.. T.youtube. and Silver. R. Solving imperfect-information games. Horgan. J. (2016). Ruder.. Ranzato. Science. (2015)..Raghu. Lakshminarayanan. Rocktäschel.. and Norvig. G. S. IEEE Transactions on Smart Grid. Rennie. Saria. A.. In European Conference on Machine Learning (ECML). and Larochelle. T.. K. T. 63 . ”why should i trust you?” explaining the pre- dictions of any classifier. Malinowski. https://www. J. (2005). (2016).. (2014). Abbeel. Neural programmer-interpreters. Sequence level training with recurrent neural networks. (2017). B. B.. (2016). M. Equivalence Between Policy Gradients and Soft Q-Learning. T.. (2017). W. and Riedel. Santoro. B. P. Battaglia. S.. D. https://www. In the International Conference on Learning Representations (ICLR). S. Neural fitted Q iteration .. D. (2016). S. S. and Sohl-Dickstein. S.. M.. Salimans. T. Quan. Foundations of unsupervised deep learning. and Guestrin. M. In the International Conference on Learning Representations (ICLR). A simple neural network module for relational reasoning. ArXiv e-prints. J. (2017). and Kuo. T. Game-Theoretic. Schrittwieser. ArXiv e-prints. I. Panneershelvam. Cambridge University Press. (2017)... Reichert. Huang. J. D. (2007). (2017). Germain. Charlin. Chandar. Moritz. C. Preuss. A... and Shammah. Shen. (2016b). A. J. Michalski. (2016a). S. 529(7587):484–489. V. M.. M. I. Dulac-Arnold. D. N. and Waller. L. S.. Lin. S... Subramanian. S. Machine Learning. P. (2009). D. I. M.-S.. M.. what is the question? Artificial Intelligence. N. L. Stroup.. Reasonet: Learning to stop reading in machine comprehension.. If multi-agent learning is the answer. Sankar. D. Sharma. J. (2017). B. Learning to repeat: Fine grained action repetition for deep reinforcement learning. I. M.. and Murphy. S. In the International Conference on Machine Learning (ICML).. Rabinowitz.. (2017). Suhubdy. S. M. In NIPS 2016 Deep Reinforcement Learning Workshop. and Riedmiller. Hakkani-Tür. N..... S. Lakshminarayanan. et al. Gao. N. Learning to Plan Chemical Syntheses. M. A. G. Silver.. R. P.. D... Schaul. Shamir.. and Bengio. Deep reinforcement learning.. The predictron: End-to-end learning and planning.. Segler. Wierstra. In the International Conference on Learning Representations (ICLR). W. Levine.. (2013). R. Lowe. S. a tutorial at ICML 2016. L. van Hasselt. (2015)... In the International Conference on Machine Learning (ICML).. P. A. G. and Zinkevich. T. (2016). R. Degris..... Barker.. Ke. Guez.. Y. In the Association for Computational Linguistics annual meeting (ACL). (2015). Harley. H. Antonoglou. J. G.cc/ 2016/tutorials/deep_rl_tutorial. and Ravindran. Y. H. Guez. and Tishby. V.. D.. M. Newnham. C. S.. and Leyton-Brown. R. E. (2017). In the International Conference on Machine Learning (ICML). Serban. A survey of available corpora for building data-driven dialogue systems. (2017).Schulman. Weller. Huang.. abs/1512. Nature. Zhang. M.... S.. A. and Degris. Silver. Jordan.. Interactive reinforcement learning for task-oriented dialogue management. Trust region policy optimization. R.. P. Concurrent reinforce- ment learning from customer interactions. M. and Chen. J. L. In- forming sequential clinical decision-making through reinforcement learning: an empirical study. and Grenager... and Heck. Barreto.pdf. L. J. http://icml. D. Serban. and Abbeel. Shwartz-Ziv. V... Silver. S. Silver. K. Kim. S. 171:365–377. Pieper. D. ArXiv e-prints.. T. and Pineau. (2011)... A. In the Annual Conference on Neural Information Processing Systems (NIPS)... ArXiv e-prints. 64 . (2017). P.05742...... In NIPS 2016 Deep Learning for Action and Interaction Workshop.. 84:109–136. Y. Sifre. S.. Failures of Gradient-Based Deep Learn- ing. D. In ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Silver. T. O. Schuurmans. (2016). Lanctot. and Logical Foundations. arXiv e-prints. Maddison. Shortreed. A. (2016). Heess. Y. Deep learning games... Laber. D. T. A Deep Reinforcement Learning Chatbot. V. (2014). S. Lever. Pineau. Nguyen. Mudumba. Lizotte. M.. Powers. A. ArXiv e-prints. Mastering the game of go with deep neural networks and tree search. Hessel. J. T. Shoham.. Shah.. J. Van Den Driessche. D. M. Deterministic policy gradient algorithms. Sotelo. Pineau. Z. Multiagent Systems: Algorithmic. T. Interactive learning for acquisition of grounded verb semantics towards human-robot communication. D.. Shoham. and Chai. J. J. and McFall. Opening the Black Box of Deep Neural Networks via Information.. Shalev-Shwartz. J. She. de Brebisson... D. S. J. Su. R. C. E. N. 65 . R. J. and reacting based on approxi- mating dynamic programming.-H. Pennington. Huang. ArXiv e-prints. D.. 3(1):9–44.. In Conference on Empirical Methods in Natural Language Processing (EMNLP).net/sutton/609%20dropbox/. Bhatnagar. Third person imitation learning. O. R. and White.. S. N. M. and Young. Learning multiagent communication with back- propagation. (2016b). A. C. S. and Le. Snell... (2015). S. Rojas-Barahona. R. planning.. In the International Conference on Machine Learning (ICML). L.. (2017).. and Fergus. N. L. In NIPS 2016 Deep Reinforcement Learning Workshop. Q. End-to-end optimization of goal-driven and visually grounded dialogue systems. S. MIT Press. S. Rojas-Barahona. M. In the International Conference on Machine Learning (ICML). Vandyke. Smith. Learning to predict by the methods of temporal differences. R. Sutton. T.. Guided deep reinforcement learning for additive manufacturing control application... Reinforcement learning for artificial intelligence.-H. R. (2016a). K. G. Scalable and sustainable deep learning via randomized hash- ing.. F. H. Sutton. A. Surana. Strub. P. in preparation). (2009a). (2014). (2016). course slides. The Journal of Machine Learning Research. H. In the Annual Conference on Neural Information Processing Systems (NIPS)... An emphatic approach to the problem of off-policy temporal-difference learning. ArXiv e-prints. C. T. and Sutskever. Socher.. (2017). S. Fast gradient-descent methods for temporal-difference learning with linear function approximation... Wen. R. Prototypical Networks for Few-shot Learning.. S. (1990). Semi-supervised re- cursive autoencoders for predicting sentiment distributions.-H. Mrksic. R. (2017). http://www. In the Association for Computational Linguistics annual meeting (ACL). R. 17:1–29. Gasǐć. C. (1998).. Vinyals. P. Continuously Learning Neural Dialogue Management. and Barto. Ultes. In the Annual Conference on Neural Information Processing Systems (NIPS). A. Sutton... D.. G. Gasic. and Fergus.. B. Integrated architectures for learning. and Reddy. R. S. O. (2016). A. Socher. (2017). C. Precup. K.. Perelygin. R. Y. In Conference on Empirical Methods in Natural Language Processing (EMNLP). Wu. Mrksǐć. and Barto. A. Ultes.. Sutton.... Sukhbaatar. E. Reinforcement Learning: An Introduction.-H. R. Ng. and Potts. (2017).. Courville. Su.. A. Manning. A.. (2017). On-line active reward learning for policy optimisation in spoken dialogue systems. End-to-end memory networks. Swersky. R. and Young.. Spring. J.. (2017). Sutton. incompleteideas. J.. Abbeel.. I. Sukhbaatar. ArXiv e-prints. Piot. D.. (2016). and Manning. Szepesvári. Sutskever. K. O. J. Sutton. Sutton. Silver. L. A Brief Introduction to Machine Learning for Engineers. Chuang. P.. Wen.. Maei. S.. A. Szlam. ArXiv e-prints. Machine Learning. Mary. and Pietquin. Weston. M.bank. (2016). I. Stadie. S. Ng. B. Sarkar.... and Zemel. MIT Press. In the Annual Conference on Neural Information Processing Systems (NIPS). Vandyke. de Vries. and Wiewiora. (1988). ArXiv e-prints. H. R. S. Reinforcement Learning: An Introduction (2nd Edition. Sequence to sequence learning with neural networks. A. Recur- sive deep models for semantic compositionality over a sentiment tree. V. S. R.... In ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). In the Interna- tional Conference on Learning Representations (ICLR). S. Best Practices for Applying Deep Learning to Novel Applications.. (2013). (2011). S. D.. J... A. and Shrivastava.Simeone. Mahmood. Taylor.. In the AAAI Conference on Artificial Intelligence (AAAI). Y. Trischler. E.. B. J. Value iteration networks. G. M. A. W. Zahavy. G. S.. Regression methods for pricing complex American-style options. Sutton. (2016). Richoux. T.. 42(5):674–690. (2017). A deep hierarchical approach to lifelong learning in minecraft. ArXiv e-prints. Auvolat. Saurous.. M. An analysis of temporal-difference learning with function approximation.. S.. Y. T. and Mannor.. B. R. M. Tamar.. (2009).. S. 6(2):215–219. (2016). C. J. Chen.. X. In the Annual Conference on Neural Information Processing Systems (NIPS). and Precup. Delp. Pilarski. D. and Singh. J. S.-J. (2001).. P... J. (2017).. Neural Computation. (2000). Personalized ad recommendation systems for life-time value optimization with guarantees.. and Maei. Lin. Ye. Tessler. E.. V. P. Y. (1994). and Suleman. (2009b).. Singh.. M. Thomas. S. Episodic exploration for deep de- terministic policies: An application to StarCraft micromanagement tasks. and White. Synnaeve. Morgan & Claypool. Efficient Processing of Deep Neural Net- works: A Tutorial and Survey. N. Givony. S. M. Policy gradient methods for rein- forcement learning with function approximation. Z. Szepesvári. TorchCraft: a Library for Machine Learning Research on Real-Time Strategy Games. (2017). ArXiv e-prints. 112(1-2):181–211. (2017)... Synnaeve. and Blei. IEEE Transactions on Neural Networks. and Van Roy. Wu. Z.. Journal of Machine Learning Research. B. Tesauro. A.. S. S. A. Z.. D. (2011). RL-Glue : Language-independent software for reinforcement- learning experiments. T. Y. 10:1633–1685.. Brevdo. and Emer. T. Modayil.. In the Annual Conference on Neural Information Processing Systems (NIPS). A.. S. IEEE Transactions on Automatic Control. ELF: An Extensive. McAllester. D. Degris. H. . White. and Usunier. S.. G. (2017). M.. and Mansour. Chintala.. (2015). S. R. Szepesvári. Tsitsiklis. K. Sze. (2016). D... In Conference on Empirical Methods in Natural Language Processing (EMNLP). and Chintala. R. R. Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. and Abbeel. Tsitsiklis. In the Annual Conference on Neural Information Processing Systems (NIPS).. A convergent O(n) algorithm for off-policy temporal-difference learning with linear function approximation. Theocharous. and Van Roy. Precup. 10:2133–2136. S. Shang.. Sutton. A. R. C.. (2009). and Ghavamzadeh. Journal of Machine Learning Research. 66 . achieves master-level play. Sutton.Sutton. G..... Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. D. C. In the International Conference on Learning Representations (ICLR). N. N.. (1997). Levine. F. Deep probabilistic programming. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS).-H. P. R. N.. Algorithms for Reinforcement Learning. Mankowitz. Hoffman. ArXiv e-prints. L. proc. G... TD-Gammon. of 10th. Gong. Tian. Lightweight and Flexible Research Platform for Real-time Strategy Games. P. In the International Conference on Learning Representations (ICLR). Natural language comprehension with the epireader. J. D. Y. N.. D. Wu.. Transfer learning for reinforcement learning domains: A survey. a self-teaching backgammon program. K. (1999). Yuan. A. Thomas. Murphy. and Zitnick. Usunier. Nardelli. Artificial Intelligence. Lacroix. and Stone. (2010). Tanner... Tran.. Q. In the International Joint Conference on Artificial Intelligence (IJCAI). Yang. Lin. 12(4):694–703. (2017). R. T.. Embed to control: A locally linear latent dynamics model for control from raw images... (2016a). and Dayan. O. T. L. H. In the Annual Conference on Neural Information Processing Systems (NIPS). (2017). K. Laroche. Uszkoreit. V. A.. Fortunato.. Blundell. In NIPS’16 Workshop on Learning. Lillicrap. F. (2016b). Attention is all you need.. (2017). Vezhnevets. R. M. P.. Wei. and Silver. Hybrid reward architecture for reinforcement learning. and Heess. J. V. Chang. In the Annual Conference on Neural Information Processing Systems (NIPS). Silver. (2017). (2015). Watter. Guez. and Zhou.. (2016).... M. Wang. Mnih. Du- eling network architectures for deep reinforcement learning.. M. Z. X. Graves.. Watkins. .van der Pol. J. D. and Kavukcuoglu. Z... J. D.. Heess. M. J. J. (1992).. Vaswani. Parmar. Vinyals. W. (2017). K. Sample efficient actor-critic with experience replay. Gomez. J.. Kaiser... I. Learning values across many orders of magnitude. Fatemi. Osindero. Machine Learning. J.. O. (2016). (2016). and Manning. N... Blundell. S. ArXiv e-prints.... Wang.. and Botvinick. D. Vinyals. Wang. N. Mnih... Z. T. On the Origin of Deep Learning. H. 67 . W. Yang.. N. O. van Hasselt. Wang. Leibo. Munos. and Riedmiller... Pointer networks.. C. Strategic attentive writer for learning macro-actions. M. P.. ArXiv e-prints. J.. and Gao. A. Liang.. S. van Hasselt. (2016b). B. In the Annual Conference on Neural Information Processing Systems (NIPS). Gated self-matching networks for reading comprehension and question answering. Wang. Z. S. In the Association for Computational Linguistics annual meeting (ACL). B. D. D.. Hessel. (2017b). In the AAAI Conference on Artificial Intelligence (AAAI). Z. K. In the International Conference on Machine Learning (ICML). ArXiv e-prints. A. Deep reinforcement learning with double Q- learning. and Tsang. Boedecker. H. M. E. J. N. J. C.. N... N. I. S. Q-learning.. Schaul.. (2016a). T. N. N. M. In the International Conference on Learning Representations (ICLR). and Polo- sukhin. van Hasselt.. Merel. Bapst. Kumaran. G. Tirumala. (2017).. A. Lanctot.. Munos. C. In the Annual Conference on Neural Information Processing Systems (NIPS).. F.. Inference and Control of Multi-Agent Systems. A. Wang. H... Wang.. A. and de Freitas.. Schaul.. and de Freitas. M... N... M. de Freitas.. Wayne. Kavukcuoglu. Jones.. D. Guez. Romoff. Wang... Kurth-Nelson. and Jaitly. Learning to reinforcement learn. Shazeer.. W. Mnih. R. Agapiou. Learning language games through interaction.. V. Heess. Wang.. N.. Jaderberg. Kavukcuoglu. (2015). Reed. K... A. A.. Springenberg. R. C. Hessel. N. (2017a). J. 8:279–292. Wang. ArXiv e-prints. and Raj.. Osindero. Feudal networks for hierarchical reinforcement learning. Vezhnevets. Robust Imitation of Diverse Behaviors. Vinyals.. V. and Oliehoek. In the Association for Computational Linguistics annual meeting (ACL).. In the Annual Conference on Neural Information Processing Systems (NIPS). D. and Kavukcuoglu. Barnes. and Silver. S. T. M. Soyer.. C. van Seijen. In the Annual Conference on Neural Information Processing Systems (NIPS). L. H. In the International Confer- ence on Machine Learning (ICML). Beyond Monte Carlo Tree Search: Playing Go with Deep Alternative Neural Network and Long-Term Evaluation. H. (2017).. Matching net- works for one shot learning. and Wierstra. H. In the Annual Conference on Neural Information Processing Systems (NIPS). Coordinated deep reinforcement learners for traffic light control. S. . S. Hadsell. Gasic. A. D. (2016). and Ba.. N.. Young. E. J.. Hybrid code networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning. Xiong. 6(5):2312–2324.... M.. In the International Conference on Learning Representations (ICLR). Visual Interaction Networks. Machine Learning.. J. Hughes. M.. Pascanu. V. Optimal demand response using device-based rein- forcement learning.. (2016). Y. D. R.. End-to-end LSTM-based dialog control optimized with supervised and reinforcement learning.. T. 8(3):229–256.. R.. Srebro. N... A survey of transfer learning. and Tian. Liao. Wu.. Wu. Zhong. Bapst. and Zweig. X.. Gouws. ArXiv e-prints. In the International Conference on Learning Representations (ICLR). A. Adversarial Neural Machine Translation. Vandyke.. (2017).. Q. and Young. R. Kaiser. and Chen... In the Association for Computational Linguistics annual meeting (ACL).. Heess. Jaitly. and Zoran. IEEE Transactions on Smart Grid. (2017). Williams. Q. Kurian. A network-based end-to-end trainable task-oriented dialogue system.. Wu. C.. J.. Asadi.. T. Battaglia. Z. T. Training agent for first-person shooter game with actor-critic curriculum learning. Wen. Z. Wen. N. Chopra.-H. (2015). and Young. M. Williams. W. Journal of Big Data. ArXiv e-prints. Dynamic coattention networks for question answer- ing. Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation. The Marginal Value of Adaptive Gradient Methods in Machine Learning. Wiering.. Y.. L.. (2016). N. T.. K. J. Gasic. Ultes.. In the International Conference on Autonomous Agents & Multiagent Systems (AAMAS). H. A.. and Bordes. Weber.. M. C.. M. Rojas-Barahona.. Wu. Quan. (2017)... Liu. Y.. N. Khoshgoftaar. M. and Dean.. Norouzi. and Liu. Mrksic. G. Weston.. Su. P. K.. Corrado. O’Neill. Mansimov. L. Distral: Robust multitask reinforcement learning. (2015a). M. 3(9). Lai. Z. and van Otterlo. Schuster. D. Kudo.. Wang.. Kazawa. Macherey.. Qin. G. Xia. M. Wu.. (2015b). Shah. M.. In the International Conference on Learning Representations (ICLR). (1992). White. S. ArXiv e-prints. P. In Pro- ceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL). R. F. J.. Chen. (2017a). M. M.-H. Cao. D.. (2016)... 68 . Su. Mrksic. Klingner... A... Semantically con- ditioned LSTM-based natural language generation for spoken dialogue systems.. Zhao.. S.. Rudnick. (2017). J. In Conference on Empirical Methods in Natural Language Processing (EMNLP). Sequence-to-Sequence Models Can Directly Transcribe Foreign Speech. N.. and Socher.. Macherey. A. S. Chorowski. and Maei. Google’s neural machine translation system: Bridging the gap between human and machine translation. Williams. R. R. In the Annual Conference on Neural Information Processing Systems (NIPS).Watters. Y. Y. and Zweig. J. O. Stern. ArXiv e-prints. and Recht. Stevens.. V... D.-H. A. J. H. Y. Reinforcement Learning: State-of-the-Art (edited book). and White. Wilson.. J. G. and Pascanu.. Patil.. T. Y.. and Wang. Gao. Weiss.... S.. ArXiv e-prints. N.. Springer. Johnson. T. L. Czarnecki. Roelofs. M. ArXiv e-prints.-H. (2017). J. Kirkpatrick. Krikun. Le... Investigating practical linear temporal difference learning. Tian. W. Y. Smith. V. J. Vinyals. R.. Kato. J. ArXiv e-prints. D. (2017). Tacchetti. D. J.. M. S. M. P. W. Simple statistical gradient-following algorithms for connectionist reinforce- ment learning. B. Riesa. K. C. (2017). R.-Y. (2017). Vandyke. (2017).. (2012). Y. Grosse. Whye Teh. J. T.. Wen. Memory networks. K. Weiss.... L. G. Stolcke. W.. and Wang.. End-to-End Joint Learning of Natural Language Understanding and Dialogue Manager. Seide. Y.. Crook. ArXiv e-prints. In the Annual Conference on Neural Information Processing Systems (NIPS). C. J. W. Yu. P. Deng. Grefenstette. D. Sutton. S. M.. Yang. Hu. A. 69 . Hoang. Understanding deep learning requires rethinking generalization. Szepesvari. H. B. Zhang. R. Li. Chebotar. J.. T. Wu. J. X. (2017). Blunsom. O. and Ling. Li. In Conference on Empirical Methods in Natural Language Process- ing (EMNLP). Gašić.. J. Reinforcement Learning Neural Turing Machines .. K. and Li.. W. (2017)... In the International Conference on Learning Representations (ICLR).. and Cohen. Zaremba. Zagoruyko.. X. Xiong. (2016). Salakhutdinov. IEEE Transactions on Industrial Informatics. Interact and Talk: Learning to Speak via Interaction.-N. (2014). X. Huang. A.. J. W. Gao. W. L. Y.. (2014). Paying more attention to attention: Improving the per- formance of convolutional neural networks via attention transfer. B. In the Annual Conference on Neural Information Processing Systems (NIPS). Y. Yahya. and Stolcke. R. and Yu. 101(5):1160–1179. (2017). Show. W. W. Kiros.. W. POMDP-based statistical spoken dialogue systems: a review. and Deng.. In the International Conference on Machine Learning (ICML). Yang. A. X.. C.. Kalakrishnan.. Universal option models.. (draft). R. and Vinyals. W. Xu. Thomson.. How transferable are features in deep neural networks? In the Annual Conference on Neural Information Processing Systems (NIPS). Ba. Artificial Intelligence and Games. attend and tell: Neural image caption generation with visual attention. Collective robot reinforce- ment learning with distributed asynchronous guided policy search. Clune. Zhang. A general projection property for dis- tribution families. S. S. (2017). Xiong. S. Zhang. Internet of things in industries: A survey.. and Togelius.-L.. L. Y. Yang. Stacked Attention Networks for Image Question Answering. Hakkani-Tur. R. The microsoft 2016 conversational speech recognition system. PROC IEEE. and Lipson. ArXiv e-prints. J. (2017c).. C. Bengio. In The IEEE International Con- ference on Acoustics.. and Xu.Xiong. and Zweig.. E. H. (2017). R. S. W. Seltzer. The Microsoft 2017 Conversational Speech Recognition System. Xu. Courville. (2015). X.. N. Speech and Signal Processing (ICASSP). D. Seqgan: Sequence generative adversarial nets with policy gradient. Huang.. Salakhutdinov... L. In the AAAI Conference on Artificial Intelligence (AAAI). (2009)... Dyer.. ArXiv e-prints.. C. H. Semi-supervised qa with generative domain-adaptive nets. In the International Conference on Learning Representations (ICLR). Yogatama... Recht. J... K. J. (2013). S. and Smola. W. J. (2015). Gao. ArXiv e-prints. G. L... J.. A. D. (2016). S.. Young. Alleva. Yosinski. Droppo... Y. M. J. Z... and Komodakis. Y. He. Deeppath: A reinforcement learning method for knowledge graph reasoning. I. Yu. Hardt. Y. Modayil. and Levine. Yu.. P. Y. D.. J. D.... (2017). He. Cho... Z. Listen. and Schuurmans. F. (2017a).. Wang. Yannakakis... Yu. Droppo. Szepesvári. Bengio.. In the International Conference on Learning Representations (ICLR). M. L. L. In the Association for Computational Linguistics annual meeting (ACL). and Bengio.. (2017b).. and Sutskever. Learning to compose words into sentences with reinforcement learning. A. N.. Chen. Zemel. H.. and Bhatnagar. 10(4):2233–2243. (2015). ArXiv e-prints. ArXiv e-prints. (2014). S. D. F. Li.. A... M. (2017).. Yao. G.Revised. and Williams.. Towards end-to-end learning for dialog state tracking and man- agement using deep reinforcement learning. ArXiv e-prints.. A. Pezeshki. (2017). Kim. Wallace.. Z. In the Annual SIGdial Meeting on Discourse and Dialogue (SIGDIAL). M.. T. C. Gupta. B... Introduction to semi-supervised learning.. J.. B..-L.. Vasudevan. and Lapata.. M. ArXiv e-prints. M. Zhao. Rules of Machine Learning: Best Practices for ML Engineering. Zinkevich. Target-driven visual navigation in indoor scenes using deep reinforcement learning. T. Z.. Sun. Y. Zoph. Zhong. Q. C... Zhou. and Le. Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks. ArXiv e- prints.. Y. Angert. (2017). Zhang. E. Zhang.. and Lease. (2017).... M. R. Zhang. and Feng. Z. M. (2016). Yan. Zoph...... A. H. Q.org/rules of ml/rules of ml. E. V. ArXiv e-prints. C. ArXiv e-prints..-L. Braylan. X.. Y. http://martin. Brakel. (2016).zinkevich. (2017)... Lim. M.. A.. Luan. S.. (2009). T. Li. and Goldberg. Thanh Nguyen. Y. Xu. Huang. (2017). Y.. and Courville. V. In the International Conference on Learning Representations (ICLR)... Banner. Shlens. In IEEE International Conference on Robotics and Automation (ICRA). A. and Liu. V. THUMT: An Open Source Toolkit for Neural Machine Translation.Zhang. V. and Le. and Liu.. and Liu. ArXiv e-prints. B.-H.-H. and Farhadi. 70 . Khetan.pdf. Q. J. Mottaghi. D. McDonnell. M. J. S.-F. B. (2017). J. Cheng. Emotional Chatting Machine: Emo- tional Conversation Generation with Internal and External Memory. Machine Learning (in Chinese). (2016). Mustafizur Rahman. M. Tsinghua University Press. (2017). Dang. Morgan & Claypool. ArXiv e-prints. X. B. P. China. Neural architecture search with reinforcement learning. Shen. J. Deep Forest: Towards An Alternative to Deep Neural Networks.. Zhang. In Conference on Empirical Methods in Natural Language Processing (EMNLP). J. F.. Beijing. A. McNamara. L. Learning Transferable Architectures for Scalable Image Recognition. Chang... and Eskenazi. (2017). H. Zhu. (2017b). Zhang. Practical Network Blocks Design with Q-Learning. Ding.. Zhu. A. Yoshua Bengio. B. H. H. X. Neural Information Retrieval: A Literature Review. Y. Sentence simplification with deep reinforcement learning.. (2017c). Kolve. A. Zhou. Zhu. Zhou. Documents Similar To 1701.07274Skip carouselcarousel previouscarousel nextCaffe PaperData Science in BriefActive One Shot Learning 2016QMDP-Net- Deep Learning for Planning Under Partial ObservabilityThe 9 Deep Learning Papers You Need to Know About 3Deep Learning Tutorial and Recent Trends by Song HanUsing Pre-Trained ModelsMT14_NeuralMT.pdfdl a surveyMachine Learning Syllabus PDFCfm ParametersDeepRLDecoupling Wide-Area Networks From Forward-Error Correction in SmalltalkCS224d-Lecture1ENERGY AWARE REINFORCEMENT LEARNING NETWORK SECURITY1. a Detailed Approach to Reinforcement Learning - A Semi-batch Reactor Case Study 201300b7d51532c0bf3ceb000000.pdfLearning What to Valueicaps-weng-v1live-301-1562-jair.pdf1-s2.0-S0925231213004888-main2 1 LearningMachine Theory of MindStock Prices Forecast Using Radial Basis Function Neural Network12428-55644-1-PBActivity Based Costing in High‐Variety ProductionEetop.cn_dIY Deep Learning for Vision- a Hands-On Tutorial With CaffeRefinery Ops Planning804512Decentralized, Wireless Algorithms for SymmetricMore From Diego Alejandro Gomez MosqueraSkip carouselcarousel previouscarousel nextvaloresyantivalores-090621131829-phpapp02TradiccionesCircular Decimo[1]8977927 Robbie Williams Feel Piano Partitura Sheet Music Noten Partition SpartitiFooter MenuBack To TopAboutAbout ScribdPressOur blogJoin our team!Contact UsJoin todayInvite FriendsGiftsLegalTermsPrivacyCopyrightSupportHelp / FAQAccessibilityPurchase helpAdChoicesPublishersSocial MediaCopyright © 2018 Scribd Inc. .Browse Books.Site Directory.Site Language: English中文EspañolالعربيةPortuguês日本語DeutschFrançaisTurkceРусский языкTiếng việtJęzyk polskiBahasa indonesiaYou're Reading a Free PreviewDownloadClose DialogAre you sure?This action might not be possible to undo. Are you sure you want to continue?CANCELOK
Copyright © 2025 DOKUMEN.SITE Inc.