CAN We Trust Your Results? A Cross-Dataset Study of Automotive IDS Evaluation

Our new paper is available now here.
You can cite it like this: B. Koltai, G. Ács and A. Gazdag, CAN We Trust Your Results? A Cross-Dataset Study of Automotive IDS Evaluation, Euro S&P – ACSW26, 2026.

Modern vehicles are no longer just mechanical machines. They are complex computer systems on wheels, with many electronic control units constantly talking to each other. One of the main communication systems inside vehicles is the Controller Area Network, or CAN bus.
The CAN bus was designed to be fast and reliable, but not necessarily secure. As cars have become more connected, this has become a serious concern. If an attacker gains access to the vehicle network, they may try to inject fake messages, block legitimate ones, or modify vehicle signals. This is why many researchers have worked on intrusion detection systems, or IDSs, for automotive networks.
These systems are meant to detect suspicious activity on the CAN bus. But there is an important question behind this work:

If an IDS performs well in one study, can we trust that it will also work well in another vehicle, another dataset, or another attack scenario? Or in other words: Are some current methods too dependent on the specific datasets they used?

The problem with comparing IDS results

Many CAN intrusion detection methods report very strong results. In some cases, the numbers look almost perfect. But these results are often measured on a single dataset, under one specific experimental setup.
That matters because CAN datasets can be very different from each other. They may come from different vehicles, driving conditions, attack implementations, or data collection methods. Some datasets are collected from real vehicles, while others are generated or processed in different ways.
As a result, a detection method may appear very effective on one dataset but perform much worse on another. This does not always mean the method is bad. It may mean that the evaluation setup does not show the full picture.
This is especially important in security. A model can learn patterns that are specific to one dataset instead of learning what an attack really looks like. In that case, the results may look strong but may not transfer well to other conditions.

This is a known problem in machine learning, often discussed in terms of data drift or concept drift. (Useful material on the topic could be found for example here).

What we built

To study this problem, we created a unified benchmarking framework for CAN intrusion detection systems.
The framework brings different datasets into a common format, provides a shared interface for IDS methods, and applies a consistent evaluation process. This makes it easier to compare different detection approaches under the same conditions.
The goal was not only to see which method performs best, but also to see whether performance remains stable when the dataset changes.

What we found

The main finding is clear: IDS performance can change a lot from one dataset to another.

This can also be a dataset-problem and a method-problem. Some methods achieved very high performance on the datasets where they were originally evaluated. But when tested on other datasets, their performance sometimes dropped significantly. The ranking of methods also changed across datasets, meaning that the “best” method in one setting was not always the best in another.
This means that results from a single dataset should be interpreted carefully. A strong result on one dataset does not automatically prove that the method will work well in other vehicles or under different attack conditions.
We also found that different attack types are not equally easy to detect. High-volume injection attacks, such as denial-of-service attacks, were generally easier to detect. More subtle attacks, such as replay or suspension attacks, were much harder. This suggests that one IDS method alone may not be enough to cover the full range of possible attacks.
Another important result came from comparing vehicle environments. Even when the same attack implementations were used, performance still varied between vehicles. This shows that normal CAN traffic patterns can strongly influence detection results (and a method can overfit for these features).

Why this matters

For researchers, these findings show that evaluating a CAN IDS on only one dataset can be misleading. It may hide weaknesses that only appear when the method is tested in a different environment.
For industry, the message is also practical. Automotive intrusion detection should not be treated as a single machine learning model that is trained once and assumed to work everywhere. A more reliable approach is to think of IDS methods as parts of a larger security system.
Different detectors may be good at detecting different kinds of attacks. Lightweight methods can help catch simple attacks early, while more advanced methods can focus on more complex cases. Combining complementary approaches may lead to more robust and practical protection.

The main takeaway

Our study shows that CAN intrusion detection results depend strongly on the dataset, vehicle environment, and attack implementation used during evaluation. This does not mean that existing IDS methods are not useful. It means that they need to be evaluated more carefully. Cross-dataset benchmarking gives a more realistic view of how well a method generalizes. It helps reveal whether a detector is learning real attack behavior or only patterns that are specific to one dataset.