Falsification of Robot Safety Systems using Deep Reinforcement Learning
Robot stations today have large safety zones around each station where humans are not allowed to be while the robots are moving. Often these zones are fenced, but they can also be monitored to stop the motions if anyone enters. In production facilities, the large safety zones take up a lot of space and the unnecessary interruptions of humans decrease the factories' productivity. Therefore, one would like to minimize the safety zones and increase the robots working time to increase the productivity in factories. However, the human's safety should not be sacrificed while doing so, thus it needs to be ensured that no unexpected collisions in the robot cells occur and that the human can trust the robot. The following article gives an insight to the master thesis “Falsification of Robot Safety Systems using Deep Reinforcement learning”, where Reinforcement Learning (RL) has been used on simulated robot cells to falsify the system and finds deficiencies.
Falsifying is a method to find deficiencies in a system. In this project, this has been done by creating an agent that mimics human behavior which tried to fool the safety system with RL and collided with the robot in a virtual environment. This can then be used as a verification system for the cell designers to evaluate if there are any dangers within the environment which saves a lot of manual testing. The environments in this project consist of a robot, an agent and safety zones; a stop and warning zone. Stop and warning zones are often used as safety measures when creating collaborative robot cells and the idea is to find faults within the design of these cells where the human can get injured by the robot.
Reinforcement learning algorithms have been used to solve this task. To begin with, Q-learning and Deep Q-Network were applied on simplified discrete environments and then finally Soft Actor-Critic (SAC) was implemented on continuous environments. SAC is a continuous algorithm that has the benefit of being able to scale its exploration depending on how much exploration is needed and how certain the solution is. It is a rather complex deep reinforcement learning algorithm containing five neural networks. The algorithm explores and estimates the value of the different states and action pairs so that it can optimize its actions. By exploring and learning the most valuable actions for each possible state, it can solve very complex problems.
Below is a video that shows how the agent find collisions on all four developed models:
The results seen in the video, show that the SAC algorithm performed well on the environments in different models tested through the thesis. It does not only solve and find very obvious collisions, but also unexpected collisions in the stop zone when the appearance of the environment seems to be fully secured. The agent was able to find these unexpected collisions by using the built-in system delay and finding certain paths with specific angles and speeds to collide with the robot before it has completely stopped. A function like this can be used as a tool to find both dangerous collisions but also to verify that the safety zones in the environments are not too small. It also has the potential to use its computed state-values (Q-values) to indicate danger or safe areas where safety zones can be safely reduced.
The evaluation of the developed function in this thesis showed that the solution became too case-specific. It was able to find deficiencies on a specifically trained workstation, but could not be applied more generally on an altered workstation without having to retune the algorithm for each station. The reason for this was mainly because it found a specific position in a workstation rather than the relationship between states. Divergence was also a problem that the solution suffered from, which could have been due to “the Deadly Triad”, which is known for diverging when combining function approximation, bootstrapping and off-policy, which SAC does.
Even if a perfect solution had been developed, it would at this stage not be possible to reduce much of the safety zones. Mainly, because it will lead to a bad work environment for the humans that works around robots with minimal safety margins and robots with increased speed as it would be a constant fear or stress for them. It will also be very hard to fulfill the current strict ISO standards. With further work to make the solution more general, we believe that the feature can be used extensively to verify the safety of robot cells, and over time, safety zones may be reduced as humans will slowly gain trust in the robot.
Tags
This article is tagged with these tags. Click a tag to see all the articles with this tag.