Assignment on the course Topics on Intelligent Systems (MSc in Artificial Intelligence, 1st semester)
The aim of this project was to replicate and enhance a paper related to the relation of Multi-Agent Systems and Machine Learning. The chosen paper was "Deep Multi-Agent Reinforcement Learning for Highway On-Ramp Merging in Mixed Traffic". This project that can be viewed on the following GitHub repository: MARL_CAVs
The paper explores the application of multi-agent reinforcement learning (MARL) to address the challenges of highway on-ramp merging in mixed traffic scenarios involving autonomous vehicles (AVs) and human-driven vehicles (HDVs). A decentralized MARL framework is proposed, enabling AVs to learn safe and efficient merging policies while adapting to dynamic HDV behaviors. The problem is modeled in a gym-like simulation environment with varying traffic densities (easy, medium, hard). Key components include a state space capturing vehicle positions, speeds, and distances; an action space comprising acceleration, deceleration, and lane changes; and a reward function incentivizing safety and efficiency. To enhance learning and safety, the framework incorporates parameter sharing for scalability, action masking to exclude unsafe actions, and a priority-based safety supervisor to manage lane change.
- Simulated Scenario: The model was trained in a highway environment with two main lanes and one merging lane, replicating real-world traffic conditions.
- Action Masking: Unsafe actions, such as leaving the highway boundaries, were disabled during training to ensure the agents' safety and focus on valid maneuvers.
- Curriculum Learning: While mentioned as a potential enhancement, this technique—progressively introducing more complex scenarios—was not implemented in this study.
- Priority-Based Safety Supervisor: This innovative mechanism evaluated the safety of lane-changing actions by predicting potential collisions up to 8 steps ahead, ensuring that only safe actions were executed.
- Shared Network: A shared network structure allowed the agents to learn collectively, improving training efficiency and leveraging shared experiences among them.
These components contributed to creating a robust framework for training autonomous vehicles to merge safely and efficiently in challenging highway scenarios.
The goal of our approach was to recreate and enhance the study presented in the paper, incorporating additional features and modifications to improve learning efficiency, safety, and adaptability in highway merging scenarios.
While the paper mentioned curriculum learning, it was not implemented in the original study. We introduced this technique to facilitate faster and safer learning by gradually increasing the difficulty levels:
- Difficulty Level 1: Controlling 1 to 3 vehicles in a simple environment.
- Difficulty Level 2: Controlling 2 to 4 vehicles, increasing complexity.
- Difficulty Level 3: Controlling 4 to 6 vehicles, representing more challenging scenarios.
We opted for the Proximal Policy Optimization (PPO) algorithm, as it showed superior baseline performance compared to A2C in the paper’s benchmarks. PPO is known for its stability and efficiency in reinforcement learning tasks.
We experimented with different reward adjustments to evaluate their impact on agent performance:
- Collision Penalty: Adjusted from -200 to -100 and -1000 to study the influence on safety.
- Speed Incentive: Increased from 1 to 2 and 4, promoting faster but controlled driving.
- Penalty for Staying in the Merging Lane: Modified from 4 to 2 and 6 to encourage quicker merging.
- Proximity Penalty: Incentivized maintaining a safe distance from the car in front, the distance is multiplied by 4, and variations of 2 and 6 will be tested to optimize behavior.
To expand the study, we explored different highway configurations:
- Single Lane: Simplifying the environment to focus on merging behavior.
- Three-Lane Highway: Increasing complexity for more realistic scenarios.
We analyzed the agents' performance based on:
- Average Speed: To evaluate efficiency.
- Collision Rates: To measure safety and identify reward structures yielding optimal results.
-
Behavioral Variations in Non-Controlled Vehicles:
To test the adaptability of our model, we introduced diverse driving styles for vehicles not controlled by the agent. These behaviors were defined as follows:- Aggressive Behavior:
- Higher desired maximum acceleration.
- Higher deceleration capability to minimize unnecessary braking.
- Reduced desired distance to the vehicle in front.
- Shorter time gap to the vehicle ahead.
- Lower minimum acceleration gain required to initiate a lane change.
- Increased maximum braking force imposed on other vehicles during lane changes.
- Shorter time allowed to reach maximum acceleration.
- Cautious Behavior:
- Lower desired maximum acceleration.
- Lower deceleration capability to avoid abrupt stops.
- Increased desired distance to the vehicle in front.
- Higher time gap to the vehicle ahead.
- Higher minimum acceleration gain required to initiate a lane change.
- Reduced braking force imposed on other vehicles during lane changes.
- Longer time allowed to reach maximum acceleration.
- Aggressive Behavior:
-
Creative Idea – Delivery Vehicles ("Couriers"):
Inspired by real-world delivery objectives, we designed a scenario simulating vehicles with delivery tasks, prioritizing speed and efficiency. These "couriers" exhibited the following characteristics:- Rewards and Behavior Adjustments:
- Increased high-speed reward (from 1 to 3) to incentivize faster movement.
- Shorter time allowed to reach maximum acceleration, enabling quicker starts.
- Faster reaction times for lane changes, optimizing their ability to navigate traffic.
- Increased allowed speed (from 30 to 40), reflecting the urgency of reaching their destination.
- The agents aimed to achieve high speeds and efficient maneuvers while minimizing disruptions to surrounding traffic, striking a balance between aggressive and safe driving.
- Rewards and Behavior Adjustments:
Our chosen model achieved a higher average velocity among all the cars and, in terms of collision, exhibited low peaks, achieving results comparable to the paper's model. Overall, Mappo demonstrated the best performance due to its ability to enhance average velocity, maintaining traffic flow, and reduce the number of collisions.
Having a low penalty on collision (100) led to better results on both graphs. As the penalty increased, the average velocity decreased. The collision rate was very low with the lower reward, but the other values also performed well, achieving very few collisions. Based on this, we chose a penalty value of 100 for further testing.
As expected, higher speed reward resulted in a higher average velocity, while the collision rate remained low. This suggests the conclusion that higher speed do not necessarily lead to more collisions. Based on the results, we chose a reward value of 4 for further testing.
Both adjustments led to improved average velocity, but the lower penalty value produced the best result. The number of collisions was lower even with reduced penalization when vehicles were close to each other. Based on these results, we selected a penalty value of 2 for further testing.
With the increase in the penalty, the cars achieved higher average velocity, likely due to the necessity to merge faster onto the highway and them speed up. Both penalty adjustments resulted in fewer collisions. Once again, we selected the penalty value that yielded the best results, which in this case was 6.
Here, we compared the paper's model with our approach across different environments. In Mappo, our approach utilized the optimal rewards identified through reward shaping, while the paper's model retained its original rewards.
Our approach demonstrated better results in terms of average speed. In this environment with only one highway lane, the merge was performed effectively, maintaining the traffic flow. The collision rate fluctuated significantly during training but ultimately achieved a result similar to that of the paper's model.
Mappo initially showed promising results but experienced a drastic decline in the average speed towards the end. In terms of collisions, both models fluctuated during training but ultimately achieved similar outcomes.
To investigate the drastic drop in average speed, we explored potencial explanations. One possibility is that the rewards previously identified were effective on an individual basis but, when combined, could lead to outcomes like this one. Another explanation is that the model became confident in its decisions up to that point and began exploring alternative set of actions.
Mappo exhibit a substantial increase in average speed, demonstrating strong adaptability to a different scenario. The collision rate was low, which was expected, as the increased space likely kept the right lane clearer for merging vehicles.
Both behaviour changes yielded good results, with higher average speeds and lower collision rates. As expected, the model with more agressive vehicles experienced more collisions compared to the one with cautious vehicles. However, the model adapted to the agressive behaviour and ultimately achieved better results than the original.
It started well but ultimately achieved a lower average speed than the original. In terms of collisions, it experienced a few, which, while not optimal, were expected due to the changes made to prioritize faster arrival at the destination.
Overall, this approach requires further refinement, both in the reward structure and in the new methods, to more effectively simulate courier behaviour.
conda create -n marl_cav python=3.6 -y
conda activate marl_cav
conda install pytorch==1.7.0 torchvision==0.8.1 torchaudio==0.7.0 -c pytorch
pip install -r requirements.txt
python run_xxx.py
python run_xxx.py --option eva --model-dir path_to_model
Replace path_to_model
with the path to the directory containing your trained model.