![统计策略搜索强化学习方法及应用](https://wfqqreader-1252317822.image.myqcloud.com/cover/936/41202936/b_41202936.jpg)
参考文献
[1] Schacter,D.,Gilbert,D.,Wegner,D.,et al.Psychology:European Edition[J].Worth Publishers,2011.
[2] Mitchell,T.M..The Discipline of Machine Learning[R].Technical Report CMU ML-06108,2006.
[3] Murphy,K.P..Machine Learning:A Probabilistic Perspective[M].MIT Press,Cambridge,MA,2012.
[4] Bishop,C.M..Pattern Recognition and Machine Learning (Information Science and Statistics)[M].Secaucus,NJ,USA:Springer-Verlag New York,Inc.,2006.
[5] Sutton,R.S.,Ba Rto,A.G..Reinforcement Learning:An Introduction[J].IEEE Transactions on Neural Networks,1998,9(5):1054.
[6] Kaelbling,L.P.,Littman,M.L.and Moore,A.W..Reinforcement Learning:A Survey[J].Journal of Artificial Intelligence Research,1996,4:237-285.
[7] Poole,D.,Mackworth,A.K..Artificial Intelligence:Foundations of Computational Agents[M].Cambridge University Press,2010.
[8] Kirk,D.E..Optimal Control Theory:An Introduction[J].Positively Aware the Monthly Journal of the Test Positive Aware Network,2004,23(2):13-5.
[9] Bertsekas,D.P..Dynamic Programming and Optimal Control:2nd Edition[J].Athena Scientific,1995.
[10] Sutton,R.S.,Barto,A.G.,and Williams,R.J..Reinforcement Learning is Direct Adaptive Optimal Control[J].IEEE Control Systems Magazine,1992,12(2):19-22.
[11] Busoniu,L.R.,Babuška,R.,Schutter,B.D.,et al.Reinforcement Learning and Dynamic Programming Using Function Approximators[M].CRC Press,Inc,2010.
[12]陈春林.基于强化学习的移动机器人自主学习及导航控制[D].合肥:中国科学技术大学,2006.
[13] Peters,J.,Schaal,S..Policy gradient methods for robotics[C].In Proceedings of the IEEE/RSJ International Conferece on Intelligent Robots and Systems,2006:2219-2225.
[14] Tesauro,G..TD-Gammon,a Self-Teaching Backgammon Program,Achieves Master-Level Play[J].Neural Computation,1944,6(2):215-219.
[15] Abe,N.,Kowalczyk,M.,Domick,M.,et al.Optimizing Debt Collections Using Constrained Reinforcement Learning[C].16th ACM SGKDD Conference on Knowledge Discovery and Data Mining,2010:75.
[16] Williams,J.D.,Young,S..Partially Observable Markov Decision Processes for Spoken Dialog Systems[J].Computer Speech and Language,2007,21(2):393-422.
[17]李琼,郭御风,蒋艳凰.基于强化学习的智能 I/O 调度算法[J].计算机工程与科学,2010(7):58-61.
[18]张水平.在策略强化学习算法在互联电网 AGC 最优控制中的应用[D].广州:华南理工大学,2013.
[19]刘智勇,马凤伟.城市交通信号的在线强化学习控制[C].第26届中国控制会议,2007.
[20]祖丽楠.多机器人系统自主协作控制与强化学习研究[D].长春:吉林大学,2007.
[21]陈鑫,魏海军,吴敏,等.基于高斯回归的连续空间多智能体跟踪学习[J].自动化学报,2013,39(012):2021-2031.
[22] Lee,D.,Choi,M.,and Bang,H..Model-Free Linear Quadratic Tracking Control for Unmanned Helicopters Using Reinforcement Learning[C].5th International Conference on Automation,Robotics and Applications (ICARA),2011.
[23] Valasek,J.,Doebbler,J.,Tandale,M.D.,et al.Improved Adaptive-Reinforcement Learning Control for Morphing Unmanned Air Vehicles[J].IEEE Transactions on Systems Man and Cybernetics Part B,2008,38(4):1014-1020.
[24] Crespo,A.,Li,W.,and Timoszczuk,A.P..ATFM Computational Agent Based on Reinforcement Learning Aggregating Human Expert Experience[C].Integrated and Sustainable Transportation System IEEE,2011.
[25] Xie,N.,Hachiya,H.,and Sugiyama,M..Artist Agent:A Reinforcement Learning Approach to Automatic Stroke Generation in Oriental Ink Painting[J].IEICE Transactions on Information and Systems,2012,E96D(5).
[26] Silver,D.,Huang,A.,Maddison,C.J.,et al.Mastering the Game of Go with Deep Neural Networks and Tree Search[J].Nature,2016,529 (7587):484-489.
[27] Thrun,S.,Burgard,W.,and Fox,D..Probabilistic Robotics (Intelligent Robotics and Autonomous Agents)[M].The MIT Press,2005.
[28] Kober,J.,Bagnell,J.A.,and Peters,J..Reinforcement Learning in Robotics:A Survey[J].International Journal of Robotics Research,2013.
[29] Deisenroth,M.P.,Neumann,G.,and Peters,J.R..A Survey on Policy Search for Robotics[J].Foundations and Trends in Robotics,2013,2(1-2):1-142,.
[30] Cheng,G.,Hyon,S.H.,Morimoto,J.,et al.CB:A Humanoid Research Platform for Exploring NeuroScience[J].Advanced Robotics,2007,21(10):1097-1114..
[31] Watkins,C.,Dayan,P..Q-learning[J].Machine Learning,1992,8(3-4):279-292.
[32] Sutton,R.S..Learning to Predict by the Methods of Temporal Differences[J].Machine Learning,1988,3(1):9-44.
[33] Rummery,G.A.,Niranjan,M..On-Line Q-Learning Using Connectionist Systems[J].Technical Report,1994.
[34]高阳,陈世福,陆鑫.强化学习研究综述[J].自动化学报,2004,30(001):86-100.
[35]蒋国飞,高慧琪,吴沧浦.Q 学习算法中网格离散化方法的收敛性分析[J].控制理论与应用,1999,16(002):194-198.
[36]蒋国飞,吴沧浦.基于 Q 学习算法和 BP 神经网络的倒立摆控制[J].自动化学报,1998,24(005):662-666.
[37] Lagoudakis,M.G.,Parr,R..Least-Squares Policy Iteration[J].Journal of Machine Learning Research,2003,4(6):1107-1149.
[38]陈兴国.基于值函数估计的强化学习算法研究[D].南京:南京大学,2013.
[39] Sugiyama,M.,Hachiya,H.,Towell,C.,et al.Geodesic Gaussian Kernels for Value Function Approximation[J].Autonomous Robots,2008,25(3):287-304.
[40] Hachiya,H.,Akiyama,T.,Sugiayma,M.,et al.Adaptive Importance Sampling for Value Function Approximation in Off-policy Reinforcement Learning[J].Neural Networks,2009,22(10):1399-1410.
[41] Akiyama,T.,Hachiya,H.,Sugiyama,M..Efficient exploration through active learning for value function approximation in reinforcement learning[J].Neural Networks,23(5):639-648,2010.
[42] Sugiyama,M.,Hachiya,H.,Kashima,H.,et al.Least Absolute Policy Iteration--A Robust Approach to Value Function Approximation[J].IEICE Transactions on Information and Systems,2010,93(9):2555-2565.
[43] S Schaal,S.,Peters,J.,Nakanishi,J.,et al.Learning Movement Primitives[J].Springer Tracts in Advanced Robotics.Ciena,Italy:Springer,2004.
[44] Bagnell,J.A.,Schneider,J.G..Autonomous Helicopter Control using Reinforcement Learning Policy Search Methods[C].IEEE International Conference on Robotics and Automation,2001.
[45] Kober,J.,Peters,J..Policy Search for Motor Primitives in Robotics[J].Machine Learning,2011,84(1):171-203.
[46] Ng,A.Y.,Kim,H.J.,Jordan,M.I.,et al.Autonomous Helicopter Flight Via Reinforcement Learning[J].Advances in Neural Information Processing Systems,2004,16.
[47] Ng,Y.,Jordan,M..PEGASUS:A policy search method for large MDPs and POMDPs[C].In Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence,2000,406-415.
[48] Sehnke,F.,Osendorfer,C.,Thomas Rückstie,et al.Parameter-exploring policy gradients[J].Neural Networks,2010,23(4):551-559.
[49] Williams,R.J..Simple Statistical Gradient-following Algorithms for Connectionist Reinforcement Learning[J].Machine Learning,1992,8(3-4):229-256.
[50] Kakade,S..A Natural Policy Gradient[J].Advances in Neural Information Processing Systems(NIPS),2002.
[51] Dayan,Peter,Hinton,et al.Using Expectation-maximization for Reinforcement Learning[J].Neural Computation,1997,9(2):271-278
[52] Peters,J.,Schaal,S..Natural Actor-Critic[J].Neurocomputing,2008,71(7-9):1180-1190.
[53] Barto,A.G.,Mahadevan,S..Recent Advances in Hierarchical Reinforcement Learning[J].Discrete Event Dynamic Systems,2003,13(1-2):341-379.
[54]周文吉,俞扬.分层强化学习综述[J].智能系统学报,2017,12(5):590-594.
[55]杜威,丁世飞.多智能体强化学习综述[J].计算机科学,2019,46(8):1-8.
[56]刘全,翟建伟,章宗长,等.深度强化学习综述[J].计算机学报,2018,041(1):1-27.
[57]赵冬斌,邵坤,朱圆恒,等.深度强化学习综述:兼论计算机围棋的发展[J].控制理论与应用,2016,33(6):701-717.
[58] Osa,T.,Pajarinen,J.,Neumann,G.,et al.An Algorithmic Perspective on Imitation Learning[J].Foundations and Trends in Robotics,2018,7(1-2):1-179.
[59] Sermanet,P.,Xu,K.,and Levine,S..Unsupervised Perceptual Rewards for Imitation Learning[J].arXiv preprint arXiv:1612.06699,2016.
[60] Maeda,G.J.,Neumann,G.,Ewerton,M.,et al.Probabilistic Movement Primitives for Coordination of Multiple Human-robot Collaborative Tasks[J].Autonomous Robots,2017,41(3):593-612.
[61]张凯峰,俞扬.基于逆强化学习的示教学习方法综述[J].计算机研究与发展,2019,56(2):254-261.
[62]李帅龙,张会文,周维佳.模仿学习方法综述及其在机器人领域的应用[J].计算机工程与应用,2019,55(04):22-35.
[63] Pan,S.J.,Yang,Q..A Survey on Transfer Learning[J].IEEE Transactions on Knowledge and Data Engineering,2009,22(10):1345-1359.
[64]王皓,高阳,陈兴国.强化学习中的迁移:方法和进展[J].电子学报,2008,36(S1):39-43.
[65] Finn,C.,Abbeel,and P.,Levine,S..Model-agnostic Meta-learning for Fast Adaptation of Deep Networks[C].In Proceedings of the 34th International Conference on Machine Learning,2017:1126-1135.
[66] Todorov,E.,Erez,T.and Tassa,Y..MuJoCo:A Physics Engine for Model-based Control[C],2012 IEEE/RSJ International Conference on Intelligent Robots and Systems,2012,5026-5033.