Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Reinforcement learning method and related device

A technology of reinforcement learning and mutual influence, applied in the field of machine learning, it can solve the problems of low learning efficiency of complex strategies and achieve the effect of improving efficiency

Pending Publication Date: 2021-04-06
UNIV OF SCI & TECH OF CHINA
View PDF4 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] In order to learn stochastic policies, existing learning methods suffer from multiple deficiencies, among which the main problem includes the problem of low learning efficiency for complex policies

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Reinforcement learning method and related device
  • Reinforcement learning method and related device
  • Reinforcement learning method and related device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0037] As mentioned in the background technology, the methods for learning random strategies in the prior art mainly include methods such as soft actuator-evaluator (SoftActorCritic), soft Q-learning (SoftQ-learning) and path consistency learning (PathConsistency Learning). Both utilize a reinforcement learning framework based on entropy regularization. In this type of framework, the agent needs to maximize an additional entropy regularization term in addition to maximization. Commonly used entropy terms include Shannon entropy (Shannonentropy) and Tsallis entropy. The former can improve the sample efficiency of policy learning, that is, use fewer samples to learn better policies; while the solution using the latter is closer to the optimal solution of the original reinforcement learning problem.

[0038] However, entropy regularization often falls into a dilemma between simple policy representation and complex and inefficient training process. The general form of existing r...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a reinforcement learning method and a related device, a regularization item used by the reinforcement learning method is a regularization item based on a sample, the regularization item based on the sample comprises additional rewards for executing different actions by a target intelligent agent, the availability degree of exploration behaviors is defined, mutual influence between executed actions is modeled, a probability density function does not need to be calculated when the regularization term based on the sample is used for reinforcement learning, the reinforcement learning efficiency can be improved, and particularly the reinforcement learning efficiency for complex strategies is improved. In addition, the sample-based regularization term is used for reinforcement learning, so that geometrical information of an action space can be utilized, and the method can be compatible with a wider strategy structure.

Description

technical field [0001] The present application relates to the technical field of machine learning, and more specifically, relates to a reinforcement learning method and related devices. Background technique [0002] Reinforcement Learning (RL) has achieved great success in fields such as video games and robot control. The goal of reinforcement learning is to find an optimal policy by maximizing the cumulative reward, which usually leads to a deterministic policy. In practical application, the effect of adopting random strategy is better than that of adopting deterministic strategy. For example, for new tasks, stochastic policies tend to allow for better exploration; moreover, using stochastic policies tends to be more robust than deterministic ones when unexpected situations occur. [0003] To learn stochastic policies, existing learning methods suffer from multiple deficiencies, among which the main issue includes the low learning efficiency for complex policies. Conten...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06N3/08G06N3/04
CPCG06N3/08G06N3/045
Inventor 王杰李厚强周祺匡宇飞
Owner UNIV OF SCI & TECH OF CHINA
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products