Our paper introducing a new CAN dataset is now available in Nature: Scientific Data.
The dataset contains 26 recordings of benign network traffic, amounting to more than 2.5 hours of traffic. We performed two attacks (injection and modification) with different configurations multiple times on each benign trace to create a comprehensive set of traffic logs. The dataset structure was explicitly designed with machine learning applications in mind.
Overview
Despite their known security shortcomings, Controller Area Networks are widely used in modern vehicles. Research in the field has already proposed several solutions to increase the security of CAN networks, such as using anomaly detection methods to identify attacks. Modern anomaly detection procedures typically use machine learning solutions that require a large amount of data to be trained. This paper presents a novel CAN dataset specifically collected and generated to support the development of machine learning based anomaly detection systems. Our dataset contains 26 recordings of benign network traffic, amounting to more than 2.5 hours of traffic. We performed two types of attack on the benign data to create an attacked dataset representing most of the attacks previously proposed in the academic literature. As a novelty, we performed all attacks in two versions, modifying either one or two signals simultaneously. Along with the raw data, we also publish the source code used to generate the attacks to allow easy customization and extension of the dataset.
Official citation
Gazdag, A., Ferenc, R. & Buttyán, L. CrySyS dataset of CAN traffic logs containing fabrication and masquerade attacks. Sci Data 10, 903 (2023). https://doi.org/10.1038/s41597-023-02716-9
Dataset creation
We captured multiple hours of traffic in various traffic scenarios to create a benign dataset. In order to create realistic attacked traces, we chose two approaches to perform attacks. On the one hand, we built a testbed with a physical CAN network to execute attacks affecting the message repetition times (message injection attacks). On the other hand, we developed an attack simulator to calculate the effect of timing in-different attacks, by modifying only the data part of the CAN messages in the simulator (message modification attacks). This hybrid generation approach results in a scalable but still realistic solution. An overview of our data collection and generation process can be seen here:
Besides the previously shown anomaly patterns, where the attacker modifies a single signal, we introduce a new modification of the benign signals: double attacks, where the same (or different) attack takes place simultaneously against two CAN signals. Our goal with these anomalies is to test more thoroughly detection systems designed to exploit system-wide communication information, such as signal correlations. We performed all our attacks in single-signal and double-signal modes.
Attack strategies
We chose two signals as the target of our tests: the vehicle speed and the engine revolution signals (next Figure). We found these signals in the CAN communication using manual reverse engineering steps and validated our finding with the method presented by Lestyán et al.
We defined six signal modification strategies that we performed during both the fabrication and the masquerade attacks. Furthermore, we executed the same attacks once only on one signal, then targeting two signals simultaneously. This wide range of attacks cover many strategies, allowing for a thorough evaluation of defense mechanisms. The chosen signal modification strategies are the following:
– CONST: The attacker replaces the CAN signal values with a constant in every message.
– REPLAY: The attacker replaces a CAN signal value with a previously captured value from the traffic. This attack takes twice as long compared to the others: first, the attacker records the signal values, then in the second half of the attack, it replays them.
– POS-OFFSET: The attacker adds a constant value to the CAN signal in each message.
– NEG-OFFSET: The attacker adds a constant value to the CAN signal in each message.
– ADD-INCR: The attacker adds a continuously incrementing value to the CAN signal in each message. This causes a slow but growing shift away from the original value.
– ADD-DECR: The attacker subtracts a continuously decrementing value in each message from the CAN signal. This causes a slow but growing shift away from the original value.
Data records
The dataset is available on Figshare.
Code availability
The source code used for the dataset generation is open source, which allows others to extend or modify the dataset. Fabrication attack generation requires a few easily accessible hardware components, while the masquerade attacks can be generated on any general-purpose computer.
Furthermore, all code for the visualizations are also available in the repository.