CICI: RSSD: LLMDaL — LLM-Driven Data Labeling for Training Machine Learning Models

NSF-funded project under the Cybersecurity Innovation for Cyberinfrastructure (CICI): Reference Scientific Security Datasets (RSSD) program.

Project Overview

The LLMDaL project leverages generative Artificial Intelligence (AI) and data from the AmLight international research and education (R&E) network to provide an essential, previously unavailable building block for automated network defense. The growing complexity and sophistication of modern networks are driving the need for automated cybersecurity and management.  Operators of critical infrastructure must increasingly rely on AI to cope with the sheer scale of information and the growing use of AI by adversaries. However, effective AI defenses depend on both the quantity and quality of data for training.  The lack of high-quality, labelled datasets from production environments presents a significant barrier. Without access to such datasets, advanced models often remain untested in real-world scenarios, limiting their effectiveness, as they fail to learn the complexity and uncertainty of production environments. Consequently, AI models essential for critical infrastructure defense will fail.

LLMDaL utilizes Large Language Models (LLMs) to automatically label packet-level data collected from AmLight, maintained at Florida International University. The technical, financial, and privacy challenges of providing such data remain substantial. To accurately and quickly label this real-world data, open-source LLMs are fine-tuned using data gathered from AmLight, along with known threat signatures and expert-annotated cybersecurity events. Validation is performed through a Retrieval-Augmented Self-Refinement process, cross-checking with an ensemble of LLMs, and verification through a human-in-the-loop approach. LLMDaL fills a critical gap in automating dataset labelling, enabling effective testing of AI models for real-world environments. LLMDaL will release AmLight datasets in batches to reflect the evolving threat landscape.

Goals

  • Automate dataset labelling of real-world packet-level network data drawn from a production R&E backbone.
  • Fine-tune open-source large language models (LLMs) using AmLight network traffic, threat-signature corpora, and expert-annotated security events.
  • Validate generated labels through Retrieval-Augmented Self-Refinement, ensemble cross-verification across multiple LLMs, and human-in-the-loop review.
  • Release labeled datasets in periodic batches to enable the research community to evaluate models against both current and emerging network threats.

Team

Principal Investigator

Co-Principal Investigators

Senior Project Personnel

Graduate Researchers

  • Bigyan Karki — Virginia Commonwealth University
  • Md Tasnim Jawad — Florida International University

Data and Resources

We are committed to open science and will make our datasets and resources publicly available where appropriate. LLMDaL will release labeled datasets curated from the AmLight R&E network in periodic batches that reflect the evolving threat landscape.

Contact

For general inquiries about LLMDaL, datasets, or research collaboration, please contact:

Principal Investigator: Kemal Akkaya - akkayak@vcu.edu

Senior Project Personnel: Abdulhadi Sahin - absahin@fiu.edu

Partners