参考文献
[1] Sigaud,O.,Garcia,F..Markov Decision Processes in Artificial Intelligence[M].Wiley,John&Sons,Incorporated,2013.
[2] Surhone,L.M.,Tennoe,M.T.,Henssonow,S.F..Partially Observable Markov Decision Process[M].Betascript Publishing,2010.
[3] Sutton,R.S.,Barto,A.G..Reinforcement Learning:An Introduction[J].IEEE Transactions on Neural Networks,1998.
[4] C Szepesvári.Algorithms for Reinforcement Learning[J].Synthesis Lectures on Artificial Intelligence and Machine Learning,2010,4(1).
[5] 郭宪.深入浅出强化学习原理入门[M].北京:电子工业出版社,2018.
[6] Watkins,C.,Dayan,P..Q-learning[J].Machine Learning,1992,8(3-4):279-292
[7] Rummery,G.A.,Niranjan,M..On-Line Q-Learning Using Connectionist Systems[J].Technical Report,1994.
[8] Lagoudakis,M.G.,Parr,R..Least-Squares Policy Iteration[J].Journal of Machine Learning Research,2003,4(6):1107-1149.
[9] Silver,D.,Schrittwieser,J.,Simonyan K.,et al.Mastering the game of go without human knowledge[J].Nature,2017,550(7676):354-359.
[10]刘建伟,高峰,罗雄麟.基于值函数和策略梯度的深度强化学习综述[J].计算机学报,2019,042(006):1406-1438.
[11] Van Hasselt,H.,Guez,A.,Silver,D..Deep Reinforcement Learning with Double Q-learning[A].In Thirtieth AAAI Conference on Artificial Intelligence,2016.
[12] Hester,T.,Vecerik,M.,Pietquin,O.,et al.Learning from Demonstrations for Real World Reinforcement Learning[J].arXiv preprint arXiv:1704.03732,2017.
[13] Raghu,A.,Komorowski,M.,Celi,L.A.,et al.Continuous State-Space Models for Optimal Sepsis Treatment-a Deep Reinforcement Learning Approach[J].arXiv preprint arXiv:1705.08422,2017.
[14] Jaques,N.,Gu,S.,Bahdanau,D.,et al.Sequence Tutor:Conservative Fine-tuning of Sequence Generation Models with Kl-control[A].In Proceedings of the 34th International Conference on Machine Learning,2017,70:1645-1654.
[15] S.Gu,T.Lillicrap,I.Sutskever,etc.Continuous deep q-learning with model-based acceleration[A].In International Conference on Machine Learning[C],2016;2829-2838.
[16] Ng,A.Y.,Jordan,M.I..PEGASUS:A Policy Search Method for Large MDPs and POMDPs[J].Morgan Kaufmann Publishers Inc,2013.
[17] Williams,R.J..Simple Statistical Gradient-following Algorithms for Connectionist Reinforcement Learning[J].Machine Learning,1992,8(3-4):229-256.
[18] S.Kakade.A natural policy gradient[M].In Advances in Neural Information Processing Systems,Cambridge,MIT Press,14:1531-1538.
[19] Peters,J.,Schaal,S..Reinforcement Learning for Operational Space Control[C].IEEE International Conference on Robotics and Automation,2007.
[20] Amari,S..Natural Gradient Works Efficiently in Learning[J].Neural Computation,1998,10(2):251-276.
[21] M.P.Deisenroth,G.Neumann,and J.Peters.A survey on policy search for robotics.Foundations and Trends in Robotics,2(1-2):1-142,2013.
[22] Kober,J.,Peters,J..Policy Search for Motor Primitives in Robotics[J].Machine Learning,2011,84(1):171-203.
[23] Dayan,Peter,Hinton,et al.Using Expectation-maximization for Reinforcement Learning.[J].Neural Computation,1997,9(2):271-278.
[24] Casas,N..Deep Deterministic Policy Gradient for Urban Traffic Light Control[J].arXiv preprint arXiv:1703.09035,2017.
[25] Schulman,J.,Levine,S.,Moritz,P.,et al.Trust Region Policy Optimization[J].Computer Science,2015:1889-1897.
[26] Pérez-Cruz,F..Kullback-Leibler Divergence Estimation of Continuous Distributions[A].In 2008 IEEE international symposium on information theory,2008:1666-1670.
[27] Schulman,J.,Wolski,F.,Dhariwal,P.,et al.Proximal Policy Optimization Algorithms[J].arXiv preprint arXiv:1707.06347,2017.
[28] Zhao,T.,Hachiya,H.,Gang,N.,et al.Analysis and Improvement of Policy Gradient Estimation[J].Neural Networks,2012,26(2):118-129.
[29] Zhao,T.,Hachiya,H.,Gang,N.,et al.Analysis and improvement of policy gradient estimation[J].Advances in Neural Information Proceeding System (NIPS),2011,24:262-270.
[30] Zhao,T.,Hachiya,H.,Hirotaka,et al.Efficient Sample Reuse in Policy Gradients with Parameter-Based Exploration[J].Neural Computation,2013,25:1512-1547.
[31] T.Zhao,G.Niu,N.Xie,J.Yang and M.Sugiyama.Regularized policy gradients:Direct variance reduction in policy gradient estimation.Proceedings of the 7th Asian Conference on Machine Learning (ACML 2015),vol.45,pp.333-348,Hong Kong,China,Nov.20-22,2015.
[32] Sehnke,F.,Zhao,T..Baseline-Free Sampling in Parameter Exploring Policy Gradients:Super Symmetric PGPE[J].Springer Series in Bio-/Neuro informatics,Artificial Neural Networks,2015:271-293.