What is the QnA model?

The QnA model is the conversational backbone of Securitron. While the Review model filters findings and the Security Assistant generates remediation guidance, the QnA model handles the interactive layer — answering follow-up questions, explaining scan results in plain English, and helping engineers understand the security implications of specific code patterns.

It is purpose-built for security Q&A: trained on a corpus of vulnerability documentation, CVE descriptions, OWASP guidelines, and AquilaX scan data. It understands the language of application security and can hold multi-turn conversations with context awareness across the conversation history.

Model ID: AquilaX-AI/QnA — available on HuggingFace. Architecture: instruction-tuned transformer. Context window: 8,192 tokens. Minimum hardware: ≥4GB RAM. CPU-optimised via quantisation.

Technical specifications.

8192
Token Context Window
512
Max Output Tokens
5
Exchange History Limit
  • Context window: 8,192 tokens — sufficient for full scan reports and multi-turn conversations about complex vulnerabilities
  • Max output: 512 tokens per response — keeps answers concise and actionable
  • History limit: 5 conversational exchanges maintained in context — prevents context overflow on long sessions
  • Minimum hardware: ≥4GB RAM — runs on standard developer machines without GPU requirement
  • Quantisation: CPU-optimised quantisation for fast inference without CUDA

Chat message format.

The QnA model uses a standard instruction-tuned chat template with <|im_start|> and <|im_end|> delimiters. The conversation structure follows the ChatML format:

# System prompt (injected automatically by the AquilaX platform)
<|im_start|>system
You are Securitron, AquilaX's AI security assistant. You help engineers understand
vulnerability findings, explain security concepts, and provide remediation guidance.
Be precise, technical, and actionable. Reference CVEs and CWEs where relevant.
<|im_end|>

# User turn
<|im_start|>user
I have a CWE-89 SQL injection finding in my Python Flask app on line 42.
The code concatenates user_id directly into a query string.
What's the risk and how do I fix it?
<|im_end|>

# Assistant response
<|im_start|>assistant
CWE-89 SQL Injection is a critical risk...
<|im_end|>

The conversation history (up to 5 prior exchanges) is prepended to each new request to maintain context. Older exchanges beyond the 5-exchange window are dropped to stay within the 8,192 token context limit.

Running inference locally.

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "AquilaX-AI/QnA"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="cpu"   # CPU-optimised — no GPU required
)

# Build the conversation messages
messages = [
    {"role": "system", "content": "You are Securitron, AquilaX's AI security assistant."},
    {"role": "user",   "content": "What is the difference between SAST and DAST?"}
]

# Apply the ChatML template
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
inputs = tokenizer([text], return_tensors="pt")

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)
response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(response)

Streaming output.

For interactive chat interfaces, the QnA model supports token-by-token streaming via HuggingFace's TextStreamer. This provides a live typewriter effect in the Securitron UI rather than waiting for the full response to generate.

from transformers import TextStreamer

# Use TextStreamer for live token output
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    streamer=streamer,   # Tokens printed as they are generated
    temperature=0.7,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

Streaming in the Securitron UI: The AquilaX dashboard uses server-sent events (SSE) to stream QnA responses in real time. Responses begin appearing within milliseconds of the request, improving perceived latency significantly versus batch inference.

Example conversation

You

My scan found 3 critical SQL injection findings. Which should I fix first?

Securitron

Prioritise findings in externally-exposed endpoints first — database query methods called directly from HTTP request handlers. Then look for findings where the injected parameter is unvalidated user input versus system-controlled values. Share the specific findings and I can give you a ranked remediation order.

You

Here's the first one: src/api/users.py line 42, cursor.execute("SELECT * FROM users WHERE email = " + request.args.get('email'))

Securitron

This is high priority — direct string concatenation of an HTTP query parameter into a raw SQL statement. Fix: replace with a parameterised query: cursor.execute("SELECT * FROM users WHERE email = %s", (request.args.get('email'),)). This eliminates the injection vector entirely and requires no sanitisation logic.

API access.

The Securitron QnA model is available via the AquilaX REST API at: https://developers.aquilax.ai/api-reference/genai/securitron

The API accepts a JSON body with a messages array following the ChatML structure. The conversation history (up to 5 exchanges) should be passed on each request. The API returns a streaming SSE response or a complete JSON response depending on the stream parameter.

Context injection: When using the Securitron chat in the AquilaX dashboard, your current scan results are automatically injected into the system prompt. The model has full access to your findings, severity distribution, and affected files without you needing to copy-paste them into the chat.