CESNET-TimeSeries24: A Novel Dataset for Network Traffic Analysis and Forecasting
The CESNET-TimeSeries24 dataset is a groundbreaking contribution to the field of network traffic analysis, anomaly detection, and forecasting. Created from the real-world CESNET3 ISP network, this dataset spans 40 weeks of data from over 275,000 active IP addresses, offering unparalleled diversity and depth for researchers and practitioners.
Why CESNET-TimeSeries24?
With the increasing importance of anomaly detection and traffic forecasting for network security, the need for high-quality datasets has never been greater. Most available datasets are either synthetic or lack the temporal depth required for robust evaluation. CESNET-TimeSeries24 addresses this gap by providing:
- Real-world data: Derived from live ISP network traffic.
- Diverse behaviors: Includes office computers, NATs, servers, Wi-Fi routers, honeypots, and even video game consoles.
- Rich anomaly representation: Covers point, contextual, collective, and trend anomalies.
- Scalability: Designed for forecasting tasks at various aggregation levels (10 minutes, 1 hour, 1 day).
Dataset Highlights
- Volume and Scope:
- 66 billion IP flows
- 4 trillion packets
- 3.7 petabytes of data
- Comprehensive Anonymization:
- Strict privacy protocols were followed, ensuring no identifiable information is present.
- Time Series Aggregation:
- Aggregated at multiple intervals: 10 minutes, 1 hour, and 1 day.
- Metrics include the number of flows, packets, bytes, unique destination IPs, transport layer ports, and much more.
- Data Hierarchies:
- Time series available at three levels:
- Individual IP addresses
- Institutions
- Institutional subnets
- Time series available at three levels:
Key Features
Metrics
The dataset offers a variety of metrics for each aggregation interval, such as:
- Volumetric Metrics: Number of flows, packets, bytes.
- Unique Metrics: Unique destination IPs, ASNs, and ports.
- Directional Ratios: Ratios for inbound/outbound packets and bytes.
- Averages: Flow duration and Time to Live (TTL).
Anomaly Coverage
CESNET-TimeSeries24 includes:
- Point Anomalies: Isolated data points deviating from patterns.
- Collective Anomalies: Subsequence or pattern-based deviations.
- Trend Anomalies: Unexpected shifts in data trends.
Validation and Usability
The dataset is thoroughly validated to ensure its suitability for:
- Anomaly Detection: Using unsupervised and supervised approaches.
- Traffic Forecasting: Ideal for time-series models like SARIMA, LSTM, and more.
Example Usage
To demonstrate the utility of the dataset, we’ve provided an example repository: GitHub Repository - CESNET-TimeSeries24-Example
The repository includes:
- Sample workflows for time-series forecasting.
- Preprocessing scripts to handle gaps and aggregation.
- Metrics evaluation using standard techniques like RMSE, SMAPE, and R².
Get the Dataset
The dataset is available for download at Zenodo.
File Structure
The dataset is organized into:
- Time series grouped by aggregation intervals.
- Identifiers for IP addresses, institutions, and subnets.
- Supplementary files detailing weekends, holidays, and other metadata.
Applications
CESNET-TimeSeries24 is perfect for:
- Anomaly Detection: Identify zero-day attacks using historical behavioral changes.
- Traffic Forecasting: Optimize resource allocation in networks.
- Security Research: Evaluate and compare machine learning models on real-world data.
Acknowledgments
This dataset was developed by researchers from CESNET and the Czech Technical University in Prague. It is a testament to the power of collaboration between academia and industry.
For questions or collaboration opportunities, feel free to reach out or contribute to the GitHub repository.