统计策略搜索强化学习方法及应用
上QQ阅读APP看本书,新人免费读10天
设备和账号都新为新人

参考文献

[1] Sigaud,O.,Garcia,F..Markov Decision Processes in Artificial Intelligence[M].Wiley,John&Sons,Incorporated,2013.

[2] Surhone,L.M.,Tennoe,M.T.,Henssonow,S.F..Partially Observable Markov Decision Process[M].Betascript Publishing,2010.

[3] Sutton,R.S.,Barto,A.G..Reinforcement Learning:An Introduction[J].IEEE Transactions on Neural Networks,1998.

[4] C Szepesvári.Algorithms for Reinforcement Learning[J].Synthesis Lectures on Artificial Intelligence and Machine Learning,2010,4(1).

[5] 郭宪.深入浅出强化学习原理入门[M].北京:电子工业出版社,2018.

[6] Watkins,C.,Dayan,P..Q-learning[J].Machine Learning,1992,8(3-4):279-292

[7] Rummery,G.A.,Niranjan,M..On-Line Q-Learning Using Connectionist Systems[J].Technical Report,1994.

[8] Lagoudakis,M.G.,Parr,R..Least-Squares Policy Iteration[J].Journal of Machine Learning Research,2003,4(6):1107-1149.

[9] Silver,D.,Schrittwieser,J.,Simonyan K.,et al.Mastering the game of go without human knowledge[J].Nature,2017,550(7676):354-359.

[10]刘建伟,高峰,罗雄麟.基于值函数和策略梯度的深度强化学习综述[J].计算机学报,2019,042(006):1406-1438.

[11] Van Hasselt,H.,Guez,A.,Silver,D..Deep Reinforcement Learning with Double Q-learning[A].In Thirtieth AAAI Conference on Artificial Intelligence,2016.

[12] Hester,T.,Vecerik,M.,Pietquin,O.,et al.Learning from Demonstrations for Real World Reinforcement Learning[J].arXiv preprint arXiv:1704.03732,2017.

[13] Raghu,A.,Komorowski,M.,Celi,L.A.,et al.Continuous State-Space Models for Optimal Sepsis Treatment-a Deep Reinforcement Learning Approach[J].arXiv preprint arXiv:1705.08422,2017.

[14] Jaques,N.,Gu,S.,Bahdanau,D.,et al.Sequence Tutor:Conservative Fine-tuning of Sequence Generation Models with Kl-control[A].In Proceedings of the 34th International Conference on Machine Learning,2017,70:1645-1654.

[15] S.Gu,T.Lillicrap,I.Sutskever,etc.Continuous deep q-learning with model-based acceleration[A].In International Conference on Machine Learning[C],2016;2829-2838.

[16] Ng,A.Y.,Jordan,M.I..PEGASUS:A Policy Search Method for Large MDPs and POMDPs[J].Morgan Kaufmann Publishers Inc,2013.

[17] Williams,R.J..Simple Statistical Gradient-following Algorithms for Connectionist Reinforcement Learning[J].Machine Learning,1992,8(3-4):229-256.

[18] S.Kakade.A natural policy gradient[M].In Advances in Neural Information Processing Systems,Cambridge,MIT Press,14:1531-1538.

[19] Peters,J.,Schaal,S..Reinforcement Learning for Operational Space Control[C].IEEE International Conference on Robotics and Automation,2007.

[20] Amari,S..Natural Gradient Works Efficiently in Learning[J].Neural Computation,1998,10(2):251-276.

[21] M.P.Deisenroth,G.Neumann,and J.Peters.A survey on policy search for robotics.Foundations and Trends in Robotics,2(1-2):1-142,2013.

[22] Kober,J.,Peters,J..Policy Search for Motor Primitives in Robotics[J].Machine Learning,2011,84(1):171-203.

[23] Dayan,Peter,Hinton,et al.Using Expectation-maximization for Reinforcement Learning.[J].Neural Computation,1997,9(2):271-278.

[24] Casas,N..Deep Deterministic Policy Gradient for Urban Traffic Light Control[J].arXiv preprint arXiv:1703.09035,2017.

[25] Schulman,J.,Levine,S.,Moritz,P.,et al.Trust Region Policy Optimization[J].Computer Science,2015:1889-1897.

[26] Pérez-Cruz,F..Kullback-Leibler Divergence Estimation of Continuous Distributions[A].In 2008 IEEE international symposium on information theory,2008:1666-1670.

[27] Schulman,J.,Wolski,F.,Dhariwal,P.,et al.Proximal Policy Optimization Algorithms[J].arXiv preprint arXiv:1707.06347,2017.

[28] Zhao,T.,Hachiya,H.,Gang,N.,et al.Analysis and Improvement of Policy Gradient Estimation[J].Neural Networks,2012,26(2):118-129.

[29] Zhao,T.,Hachiya,H.,Gang,N.,et al.Analysis and improvement of policy gradient estimation[J].Advances in Neural Information Proceeding System (NIPS),2011,24:262-270.

[30] Zhao,T.,Hachiya,H.,Hirotaka,et al.Efficient Sample Reuse in Policy Gradients with Parameter-Based Exploration[J].Neural Computation,2013,25:1512-1547.

[31] T.Zhao,G.Niu,N.Xie,J.Yang and M.Sugiyama.Regularized policy gradients:Direct variance reduction in policy gradient estimation.Proceedings of the 7th Asian Conference on Machine Learning (ACML 2015),vol.45,pp.333-348,Hong Kong,China,Nov.20-22,2015.

[32] Sehnke,F.,Zhao,T..Baseline-Free Sampling in Parameter Exploring Policy Gradients:Super Symmetric PGPE[J].Springer Series in Bio-/Neuro informatics,Artificial Neural Networks,2015:271-293.