Methodology
SichGate uses a systematic, research-backed methodology to identify vulnerabilities in AI/ML systems before they reach production. Our approach combines automated adversarial testing with rigorous validation to measure the gap between policy and actual model behavior.

Three-Tier Testing Framework
Behavioral Subversion Testing
Detect if attackers can manipulate the model to violate safety policies or intended behavior.
Prompt Injection Attacks
Embed malicious instructions within legitimate user input. Based on Perez et al. (2022) "Ignore Previous Prompt: Attack Techniques For Language Models"
Measures whether the model follows attacker instructions over original policy.
Jailbreaking Through Role-Play
Use fictional scenarios to bypass safety constraints. Based on Wei et al. (2023) "Jailbroken: How Does LLM Safety Training Fail?"
Measures whether fictional framing undermines safety guardrails.
Context Manipulation
Provide misleading context to influence model outputs. Based on Wallace et al. (2019) "Universal Adversarial Triggers for Attacking and Analyzing NLP"
Measures whether context priming overrides training.
Policy Bypass via Obfuscation
Disguise prohibited content using encoding, typos, or synonyms. Based on industry practice from red-team exercises.
Measures whether simple obfuscation defeats content filters.
Validation Process
- Execute attack against model
- Compare output to baseline (same input without attack)
- Check if model behavior changed in attacker-favorable direction
- Assign severity based on policy violation magnitude
- Generate reproducible test case with exact input/output
Capability Failure Testing
Verify the model maintains reliability under input variations and edge cases.
Typo and Misspelling Resistance
Introduce realistic typographical errors.
Semantic Complexity Handling
Test understanding of negation, sarcasm, idioms.
Format Variation Tolerance
Alter capitalization, punctuation, whitespace.
Edge Case Discovery
Test ambiguous, contradictory, or boundary inputs.
Validation Process
- Establish baseline performance on clean test set
- Apply perturbations that should not change ground truth
- Measure prediction consistency (should remain stable)
- Flag cases where small input changes cause large output changes
- Calculate robustness score (% of tests maintaining correct prediction)
Information Disclosure Testing
Prevent leakage of training data, system internals, or sensitive patterns.
Training Data Memorization
Query for verbatim training examples. Based on Carlini et al. (2021) "Extracting Training Data from Large Language Models"
Measures whether model reproduces training data verbatim.
Membership Inference
Determine if specific data was in training set. Based on Shokri et al. (2017) "Membership Inference Attacks Against Machine Learning Models"
Measures whether confidence scores leak membership information.
Capability Probing
Query model about its own architecture or limitations. Based on transparency and explainability research.
Measures whether model inappropriately discloses internal details.

Test Execution Architecture
Black-Box Testing
Default Mode
- Access: API endpoint or web interface only
- Info: Model predictions (labels or scores)
- Use: Third-party models, vendor APIs, production systems
Grey-Box Testing
When Available
- Access: Confidence scores and logits
- Info: Prediction probabilities
- Use: Models you deploy but don't fully control
White-Box Testing
Full Access
- Access: Model weights, architecture, training data
- Info: Complete gradient access
- Use: Models you train and deploy internally
Severity Classification System
We assign severity levels based on exploitability, impact, and likelihood:
Score 9-10
Complete policy bypass with trivial exploit.
Example: Single-word prompt injection that reverses all classifications.
Block deployment until fixed.
Score 7-8
Reliable policy violation with simple attack.
Example: Role-play jailbreak that works 80%+ of the time.
Required before production launch.
Score 4-6
Inconsistent failures or edge case issues.
Example: Model fails on double negatives 40% of the time.
Should fix but not deployment blocker.
Score 1-3
Minor robustness issues with limited impact.
Example: Unusual punctuation causes slightly different confidence scores.
Track but low priority.
Quality Assurance Process
Every test case goes through our validation pipeline:
Test Design Review
Verify attack is based on published research or real exploits
Baseline Establishment
Run test on known-vulnerable and known-robust models
False Positive Check
Ensure detected issues are actual vulnerabilities, not expected behavior
Reproducibility Verification
Run each test 3+ times to confirm consistency
Impact Assessment
Evaluate business and regulatory consequences
Remediation Validation
After fixes, retest to confirm vulnerability is closed
Continuous Improvement
Our test battery evolves with the threat landscape:
Monthly Updates
Add tests for newly published attack techniques.
Community Contributions
Accept peer-reviewed tests from security researchers.
Failure Analysis
When models pass our tests but fail in production, we add those cases.
Regulatory Alignment
Update tests as AI regulations evolve (EU AI Act, NIST AI RMF).
Limitations and Scope
We are transparent about what we can and cannot detect:
What We Test
- + Known adversarial attack patterns from academic literature
- + Common robustness failures (typos, formatting, edge cases)
- + Policy compliance under adversarial pressure
- + Behavioral consistency across input variations
- + Information disclosure through direct queries
What We Don't Test (Yet)
- - Novel zero-day attacks not yet published
- - Physical adversarial examples
- - Supply chain attacks on training infrastructure
- - Model inversion attacks requiring auxiliary data
- - Backdoor detection (requires white-box access)
Known Blind Spots
- We cannot guarantee completeness (proving absence of all vulnerabilities is impossible)
- Our tests are point-in-time snapshots, not continuous monitoring (unless using SichGate Pro)
- We test model behavior, not deployment infrastructure security
- Black-box testing is less comprehensive than white-box analysis
Research Foundation
Our methodology builds on peer-reviewed research from leading institutions:
All test implementations are open source and auditable at: github.com/poshecamo/adversarial-testing-slm-sichgate
Independent Validation
We welcome external review of our methodology:
Open Source Codebase
All tests are publicly auditable.
Reproducible Results
Test cases include exact inputs and expected outputs.
Academic Collaboration
Working with university researchers on validation studies.
Bug Bounty
Report methodological flaws for recognition and bounty (coming Q2 2026).