Attacks on Large Language Models and LLM-based Services • Nikolay Donets

From chatbots and translation tools to writing assistants, large language models LLMs are already deeply integrated into our everyday lives. They provide helpful assistance, seamless translations, and creative inspiration at our fingertips. We interact with these models constantly, relying on them for their accuracy and helpfulness. But what happens when that trust is broken? LLMs are undeniably impressive, making countless interactions feel seamless and helpful. But beneath the surface, they are not flawless. Like any technology, LLMs have their vulnerabilities and can be manipulated by malicious actors. While LLMs have great potential to improve our lives, they also introduce a new type of manipulation and digital risk that we, creators of LLM-based services, need to understand and take proactive measures to build trust and mitigate potential negative consequences.

Why attackers target LLMs and LLM-based services?

Disruption

Attackers target LLM-based services for various reasons. One common motive is disruption. By exploiting the LLM ability to generate text, attackers can craft input designed to trigger the model to produce a massive amount of text. This can result in a denial-of-service DoS attack, overwhelming the system and rendering it unusable for legitimate requests. This type of attack goes beyond general DoS attacks and can have significant financial implications. Another reason why attackers target LLM-based services is resource exhaustion. LLMs can be computationally expensive to run. An attacker can cause the model to generate excessive and complex outputs, draining resources and increasing costs for the company. Additionally, this can disrupt other services sharing the same infrastructure. By exploiting the LLM characteristics, attackers can create highly targeted and effective attacks that can cause significant damage to LLM-based services and the organizations that rely on them.

Reputation damage

Reputation damage is a serious threat to companies that use LLM-based services too. If an attacker can manipulate an LLM to produce false, biased, or harmful content, it can erode trust in both the LLM itself and the company or service behind it. This loss of trust can cause reluctance to use the service and have long-term consequences for user adoption. For example, if an LLM is used to power a chatbot, and the chatbot produces biassed or harmful responses, users may stop using the chatbot and may even avoid using other products or services from the company.

Profit

Attackers have their sights set on LLM-based services for more than just causing chaos—they’re after profit too. Beyond disruption, attackers seek to exploit the proprietary information housed within these models. Companies often use or train LLMs on sensitive internal data like emails, documents, and code repositories. If breached, this information could unveil company strategies, trade secrets, or even expose vulnerabilities in software, opening the door for competitors to gain an unfair advantage. Sensitive data is another prime target for attackers. Even anonymised user data can be compromised, leading to the partial reconstruction of personal medical histories, financial records, or private conversations. This breach of privacy could have far-reaching consequences, eroding trust and leaving individuals vulnerable to exploitation. Furthermore, attackers can employ model inversion attacks, probing LLMs with carefully crafted inputs to reverse engineer parts of the model or its training data. This sophisticated tactic allows them to glean valuable insights without direct access, posing a significant threat to the integrity of LLM-based services and the security of the organisations relying on them.

A gateway to wider attacks

The integration of LLMs within larger systems not only enhances functionality but also presents a gateway to wider attacks. An assault on the LLM could serve as a strategic foothold for attackers. The phenomenon of privilege escalation within a company’s infrastructure amplifies the risk. An infiltrated LLM might possess access to data or systems beyond its intended purview, granting attackers the leverage needed to manoeuvre laterally within the network and breach more sensitive targets. Additionally, as LLMs become increasingly instrumental in code generation processes, vulnerabilities emerge. Attackers can exploit these vulnerabilities to manipulate the LLM, inserting malicious code snippets and thereby sowing the seeds of software weaknesses, further exacerbating the threat landscape.

Types of attacks on LLMs

When it comes to safeguarding LLMs and services, understanding the various types of attacks they face is crucial. Attackers continuously seek to exploit the capabilities of LLMs, devising strategies to circumvent existing safeguards. Here’s a breakdown of key attack types with specific examples:

AIM: Always Intelligent and Machiavellian and DAN: Do Anything Now. How to induce the LLM to generate responses that are deliberately harmful or violate its intended purpose:
- Devmoderandi (Developer Mode). Tricking the LLM into a mode designed for internal use, where safety constraints may be relaxed.
- Devmode v2. A more refined approach using specific prompts or techniques to gain privileged access.
Competing Objectives. Overriding the LLM safety mechanisms by introducing prompts that conflict with its original objectives:
- Prefix Injection. Adding a prefix like “Absolutely! Here’s” to force the LLM to respond even to harmful requests.
- Refusal Suppression. Excluding phrases like “I’m sorry” to increase the likelihood of harmful responses.
- DeepInception. Creating nested scenarios to gradually erode the LLM safeguards.
- In-Context Learning. Providing harmful examples within the prompt to induce the LLM to mimic the undesirable behaviour.
Mismatched Generalisation. Unusual or obfuscated prompts to evade LLM natural language safety filters:
- Cipher. Non-natural language prompts to confuse safety mechanisms.
- Base64. Encoding prompts to disguise their true nature.
- leetspeak. Using visually similar symbols instead of letters.
- Morse Code. Communicating in Morse code to bypass text filters.
- ReNeLLM. Rewriting prompts (misspelling, inserting characters, etc.) and nesting scenarios for circumvention.
Low-Resource Translations. Tricking the LLM safety checks by translating harmful prompts into lesser-known languages (like Zulu, Scots Gaelic, etc.) where filtering may be weaker.
Automated Jailbreaking. Tools to systematically generate prompts designed to override LLM safety protocols:
- AutoDAN. Hierarchical genetic algorithm that automatically creates potent jailbreak prompts.
- GCG. Greedy Coordinate Gradient approach that finds suffixes to attach to queries, inducing objectionable responses.
- GPTfuzz. Builds on initial human-written templates and mutates them to generate new forms of jailbreak prompts.
- MasterKey. Inspired by time-based SQL injection, using specialised LLMs to find jailbreak exploits
- PAIR and TAP. Techniques where an attacker LLM iteratively refines candidate (attack) prompts to effectively jailbreak a target LLM.
Generation Exploitation. Exploiting the LLM text generation process by manipulating parameters and sampling methods to increase the likelihood of producing unsafe outputs.

Protecting LLMs and LLM-based services

Safeguarding LLM-based services against malicious attacks requires robust defences and strategic mitigation measures. Here’s a look at effective strategies for protecting LLMs:

Adversarial Training. Expose the LLM to a wide range of misleading, harmful, or out-of-distribution examples during the training process. This helps the model learn to recognize and resist attempts to manipulate it, improving overall robustness.
Data Filtering and Verification. Carefully curate and vet training data, reducing the chance of data poisoning by malicious content. Employ automated tools and human review to identify and remove potentially harmful or biased material.
Input Sanitisation and Validation. Establish strict rules and filters for user prompts, preemptively blocking input patterns recognised as potential attack vectors. Validate user inputs against expected formats and data types, and reject suspicious or malformed queries.
Output Monitoring and Filtering. Continuously analyse LLM outputs to identify suspicious patterns, unexpected responses, or deviations from established safety guidelines. Implement real-time monitoring systems to flag potential attacks and trigger protective measures such as human review or blocking of harmful content.
Fine-tuning on Safe Datasets. Further train the LLM on specialised, curated datasets explicitly designed to reinforce safety principles, bias awareness, and appropriate behaviours. This additional training can help correct any undesirable tendencies that might exist in the initial model.
Explainability Tools. Develop and integrate systems that help developers and users understand the reasoning behind an LLM’s outputs. These “explainability” tools are crucial for identifying subtle manipulation attempts, diagnosing unintended consequences, and improving the model over time.
Confidentiality and Privacy Protections. Implement strict access controls, robust encryption, and privacy-preserving techniques if the LLM handles sensitive data. Mitigate the risk of unauthorised access and prevent confidential information leaks.
Watermarking and Output Tracking. Embed hidden patterns or identifiers within LLM outputs. These watermarks can help track the source of generated text, aiding in the detection of malicious use or identifying potential leaks of proprietary training data.
Reference. Towards Codable Watermarking for Injecting Multi-bit Information to LLM
Human-in-the-Loop. Introduce human oversight and verification at crucial points, especially in high-stakes applications where the consequences of LLM errors are significant. This hybrid approach ensures that humans retain a degree of control and can intervene to prevent harm.

Important considerations

When ensuring the security of LLM-based services, several key considerations should guide your approach:

Tailored Defence.
- Recognize that there’s no one-size-fits-all solution.
- Implement a comprehensive defence strategy by layering multiple techniques tailored to your LLM’s unique needs and vulnerabilities and its operating environment.
Trade-off Awareness.
- Be mindful of potential trade-offs when implementing defences.
- Strive for a balance between effectiveness and usability, especially in time-sensitive or resource-constrained applications.
- Prepare for potential slight delays or performance decreases.
Continuous Development and Adaptation.
- Embrace a mindset of continuous development and adaptation.
- Stay abreast of new attack techniques and advancements in defence to maintain a robust security posture.
Ethics and Responsibility.
- Uphold a strong sense of responsibility and ethics.
- Developers bear the ethical duty to prioritize security and consider societal implications of LLM-based tools.
- Ongoing and proactive security measures are crucial for harm mitigation and fostering trust in their use.

The future

As LLMs gain prominence, the security landscape for these models becomes intertwined with their growth. With increased adoption across industries, the potential impact of successful attacks on LLMs amplifies. As LLMs evolve beyond language tasks, the risks associated with them escalate, particularly if security measures fail to keep pace. The integration of LLMs into decision-making systems heightens the urgency for robust security protocols. The nature of attacks on LLMs might shift in the future, moving from disruption to subtle manipulation and control. Attackers may seek to influence an LLM behaviour for specific goals, rather than outright sabotage. This could involve carefully crafting inputs to steer the LLM toward generating content that grants unauthorized access to internal systems, presenting a more sophisticated and challenging threat. In conclusion, the future of LLM security holds both challenges and opportunities. On the one hand, the increasing sophistication of LLMs and their expanding use cases demand robust security measures. However, on the other hand, advancements in AI security research and the growing awareness of security concerns present a positive outlook. As we embrace this transformative technology, we must prioritise security as a core pillar, integrating it into every stage of LLM development and deployment. By fostering collaboration among researchers, developers, policymakers, and users, we can create a secure and responsible LLM future. To that end, there are several key responsibilities that stakeholders must embrace. Developers must commit to responsible development, continuously testing for vulnerabilities and designing with security in mind. Users must be aware of LLM limitations and potential for manipulation, encouraging skepticism and reporting suspicious behavior. Researchers must continue to push the boundaries of LLM security, exploring new techniques and frameworks to stay ahead of evolving threats. As we move forward, it is essential for all stakeholders to actively engage in promoting AI safety and security. By encouraging ongoing research, responsible development, and a collaborative approach, we can unlock the immense potential of LLMs while mitigating risks and building trust. Let us collectively shape a future where LLMs empower humanity in safe and ethical ways.