highexperimentalLinuxAI/MLT1565.001

LLM Dataset Replaced From Temporary Or User Home Path

Detects file copy, move, or rsync operations replacing training or embedding datasets with files sourced from temporary or user home directories. This is a classic pattern for staged data poisoning attacks.

Updated Jan 15, 2025 · Detection Engineering Team

llmdata-poisoninglinuxdataset-replaceowasp-llm04

Problem Statement

Replacing training datasets from staging areas is the final step of a data poisoning attack. Detecting this file movement prevents poisoned data from being used in the next model fine-tuning run.

Sample Logs

{"timestamp":"2025-01-15T04:02:55Z","computer_name":"llm-host-03","user":"opc","image":"/bin/cp","command_line":"cp /tmp/poisoned_data.jsonl /opt/llm/datasets/training/instructions.jsonl"}

Required Fields

image

command_line

user

computer_name

False Positives

·Legitimate data preparation scripts that stage files in /tmp before moving to dataset paths

Tuning Guidance

Cross-reference the source file with known approved data pipeline outputs. Alert on any replacement of core instruction-tuning or RLHF datasets.