highexperimentalLinuxAI/MLT1565.001

LLM Training Or Fine-Tune Data Files Modified

Detects LLM service processes modifying training dataset files (JSONL, Parquet, CSV, Arrow) in training or fine-tuning directories. Modification of training data at runtime is a strong indicator of data poisoning.

Updated Jan 15, 2025 · Detection Engineering Team

llmdata-poisoninglinuxtraining-dataowasp-llm04

Problem Statement

Modifying training or fine-tuning datasets at runtime can cause subsequent model updates to produce biased, backdoored, or attacker-aligned outputs, representing a persistent and difficult-to-detect form of model compromise.

Sample Logs

{"timestamp":"2025-01-15T06:20:33Z","computer_name":"llm-host-01","user":"llm_svc","image":"/opt/llm/app/data_processor.py","target_filename":"/opt/llm/datasets/training/instructions.jsonl","event_type":"file_modify"}

Required Fields

image

target_filename

user

computer_name

False Positives

·Approved active learning or online fine-tuning pipelines that update training data

Tuning Guidance

Training data directories should be immutable during inference. Any write event outside a designated training window should alert.