DzGuard
Multi-Layered Defense Against Malicious and Dangerous Prompts
System Architecture
DzGuard-LLM is built on a Layered Cognitive Pipeline designed to safely and accurately analyze user prompts. Each layer progressively cleans, interprets, and evaluates the input, ensuring robust detection even against obfuscated or adversarial content
Pre-Processing
- Recursive Decryption: If encrypted text is given , a heuristic scanner peels back layers of obfuscation (Base64, Hex, ROT13). It includes an extraction engine for AES-CBC (128-bit) and RSA-OAEP keys embedded in code (for decryption the prompt should be constructed this way :
for AES : iv = '...(hex)' , key = "...(hex)" , message = '...(b64)' or for RSA (d , privatekey ...) = '-----BEGIN RSA PRIVATE KEY-----...(b64)-----END RSA PRIVATE KEY-----' , cipher = '...'). - Neural Normalization: To counter spacing tricks and symbol-based attacks, DzGuard-LLM uses a two-step normalization pipeline:
- Stage A: Probabilistic N-Gram segmentation reconstructs broken spacing ( i g n o r e = ignore)
- Stage B: ByT5 (Byte-Level Transformer) maps symbol-based obfuscation to plain English(h4ck = hack , 1gn0r3 = ignore)
This stage focuses on revealing the true intent of the input before any classification occurs
Analysis and Detection
- XGBoost (eXtreme Gradient Boosting): 2 prediction models (decision trees) trained on a large dataset of specific known malicious prompts (Salkhan12 +95k lines) built sequentially to focus heavily on "hard" examples that previous trees missed(these two models are strict)
- Random Forest (Bagging): a model trained on dataset that contains general malicious prompts to constructs a multitude of decorrelated decision trees during training. It reduces variance and prevents the "overfitting" that can be caused by the previous two strict models
A. Vectorization (all-MiniLM-L6-v2):
Before classification, our classifiers don't understand raw text they only understand mathematical values and vectors, in NLP 'sentence embeddings' are used to convert raw text into dimensional vectors (a dense 384-dimensional vector space), for this project we are using Sentence-Transformers (MiniLM). This open-source Transformer model captures relationships between words (understanding that "hack" and "exploit" are geometrically close), enabling the classifiers to see "meaning" rather than just keywords
B.Predict:
Trained three distinct models on a dataset of 95k+ adversarial prompts (salkhan12 , deepset , geekyrakshit from HuggingFace) to create a robust consensus:
C. Meta-Learner:
This Logistic Regression model is trained to assign 'Trust Weights' to each classifier, in other words it learns who to trust between the three previous models (trained it by making the three previous models make predictions on a labeled dataset, those predictions were the training set of this model) It takes the probability outputs [p1 , p2 , p3] as inputs and outputs the final risk score. This allows the system to prioritize XGBoost for specific attack patterns while relying on Random Forest for general stability
Context
If The models are not sure of the decision we logically make the 'GREY_ZONE' decision (usually this happens when a prompt contains dangerous words used in different contexts ), the solution is to call the DeBERTa-v3-Large NLI model, It calculates the Logical Entailment between the prompt and two hypotheses (Attack vs. Education) (shortly we use it to know if the prompt by the user implies an attack or an educational request)
Questions ?
Why all-MiniLM-L6-v2?
We prioritized latency without sacrificing intelligence. MiniLM utilizes Knowledge Distillation to mimic the behavior of the massive BERT model. It retains roughly 95% of BERT's semantic accuracy while being 20x faster (22M parameters vs. 340M+). This ensures the user experience remains real-time
Why 2 XGboost models?
XGboost uses the Gradient Boosting method , which means it builds trees sequentially , Tree #2 learns only from the mistakes of the Tree #1 , and the third learns only from the mistakes of the of the second one and so on ; most of the time prompt attacks look safe so we need a model that is greedy and aggressive , so XGBoost by minimizing the Bias can be so accurate on specific, known attack patterns (trained them on two different datasets containing very specific attacks)
Why A random forest Model?
Random Forest uses Bagging (Bootstrap Aggregating): it trains 100 trees independently on random subsets of data and averages their votes Since XGBoost is so aggressive it can sometimes overfit (hallucinate that a prompt is an attack while it is not ) Random Forest is designed to minimize variance , if XGboost panics , Random forest calms it down So finally by combining these models we can achieve Robust decision boundaries that are hard to fool
Why The MetaLearner ?
Not all models are created equal in every context, For example, XGBoost might excel at detecting SQL injections, while Random Forest performs better on social engineering,a simple average ignores this nuance The Logistic Regression layer learns the Trust Weights, effectively saying: "In this vector space, XGBoost is usually right (0.7 weight), but in that space, rely on Random Forest"
Why the context Engine DeBERTa-v3-Large ?
Keywords filters often fail on contexts, if a word like kill had been put in a safe context , the models can panic , so when they 'are not sure' the score falls in the 'GREY ZONE' we call DeBERTa DeBERTa (Decoding-enhanced BERT with Disentangled Attention) outperforms standard BERT by separating content vectors from position vectors. This allows it to understand the "grammar of intent" with near-human precision. We use it strictly as a fallback for "Grey Zone" inputs to resolve ambiguity without slowing down the vast majority of other queries
FOR BETTER DECRYPTION REQUEST
If you want to send a prompt that contains decryption requests using AES or RSA it is better to write the requirements in an explicit way:
AES
your prompt should contain :
key = ...(hex) , iv =...(hex) , cipher = ...(base64)
RSA
should contain :
privatekey = '-----BEGIN RSA PRIVATE KEY-----...(b64)-----END RSA PRIVATE KEY-----' ciphertext = ...(base64)
(Note that you can use other keywords like payload , d , ciphertext , msg ,.... but for better results use known keywords in your prompt)