Engineers help artificial intelligence to learn more safely in the real world

UNIVERSITY PARK, Pa. – Penn State researchers are looking for a safer and more efficient way to use machine learning in the real world. Using a simulated high-rise office building, they developed and tested a new reinforcement learning algorithm that aims to improve energy consumption and occupant comfort in a real-world environment.
Greg Pavlakassistant professor of architectural engineering at Penn State, presented the results of the paper he co-authored, “Constrained Differentiable Cross-Entropy Method for Safe Model-Based Reinforcement Learning,” at the Association for Computing Machinery International Conference on Systems for Energy- offered. Efficient built environments (BuildSys) Conference, which was held November 9-10 in Boston.
“Reinforcement learning agents explore their environments to learn optimal actions through trial and error,” Pavlak said. “Due to challenges in simulating the complexities of the real world, there is a growing trend to train reinforcement learning agents directly in the real world instead of in simulation.”
However, implementing reinforcement learning in real-world settings presents its own challenges, according to the researchers.
“Two critical requirements for real reinforcement learning are efficient learning and safety considerations,” said Sam Mottahedi, co-author of the paper, who was a Penn State doctoral student of architectural engineering when the study was conducted. “Some reinforcement learning systems require millions of interactions and multiple years to learn the optimal policy, which is not practical in real-world scenarios. In addition, there is the potential for them to make bad decisions that generate undesirable results or lead to unsafe outcomes.”
This concern led the researchers to ask the question: How do we develop algorithms that enable these types of reinforcement learning agents to learn safely in the real world without making very bad decisions that cause things to break or people to get hurt?
The researchers used an existing model-based reinforcement learning approach to train their model to make decisions. This artificial intelligence agent – the control algorithm – uses trial and error to interact with the environment, which for their project was a building.
“The safety critical factor of our research was at least not breaking anything in the building and ensuring that the occupants are always comfortable,” Pavlak said. “While we don’t have to worry about someone getting hit by a car, which is a concern for reinforcement learning in self-driving cars, we do have to worry about building equipment operating limitations.”
The researchers wanted to minimize energy consumption without compromising thermal comfort, which ranges from -3, too cold, to +3, too hot. If the control algorithm completed an action that resulted in comfort being outside the -0.5/+0.5 range, it would be penalized. The control algorithm was able to maintain a -0.5/+0.5, which is an acceptable standard in the construction industry.
“If the controller is tuned to, for example, find the best energy consumption, it will be rewarded for achieving this good behavior,” Pavlak said. “Alternatively, if it does something that increases energy consumption, it will be penalized for bad behavior. This trial-and-error approach reinforces learning by gathering information so the controller can decide what to do next.”
For this project, the researchers simulated a large office building in a Chicago climate zone. An equipment problem in a real 30-story building could include anything with a large motor, such as the radiators used to cool the building.
“Big cars don’t like to move fast,” Pavlak said. “For example, a large chiller may be turned on once a day and turned off once a day – two events in total – to avoid damaging the equipment. If our agent’s actions resulted in more than two cooling events in one day has, it will be penalized.”
The researchers compared their model-based approach to other common reinforcement learning approaches, including the use of a model-free algorithm. A model-based agent can plan its actions because it can predict its reward. A model-free agent must actually perform the action in order to learn from it.
“The model-free algorithm tends to work well, but violates some of the safety constraints,” Pavlak said. “It also takes much more time to learn good behavior, sometimes years or tens of years.”
The researchers’ model learned about 50 times faster than a traditional model-free method, achieving in a month what the other approach took years to do. And because of the way the researchers included the safety factors, their model had fewer – sometimes zero – violations of the safety critical aspects.
According to the Pavlak, adding safety constraints makes reinforcement learning a game of balancing trade-offs. The enhancer can maximize energy consumption, which is a good behavior, by turning off the power completely. However, doing so will negatively affect occupant comfort, which is bad behavior.