Turning Alignment On and Off May Be Easier Than We Think

Curtis Collicutt
· By Curtis Collicutt
Turning Alignment On and Off May Be Easier Than We Think

We may find that it is relatively easy to subvert AI systems by switching on and off certain features.


Cybersecurity has always presented unique challenges, primarily because all systems can ultimately be compromised if an attacker invests sufficient time, money and resources, an nation states have all of these. However, some systems simply require less effort to breach than others.

With the advent of Large Language Models (LLMs) and related AI systems, we face not only technical security challenges, but also psychological and social barriers to understanding and implementing appropriate security measures. We must not only invent new security measures, but also learn how to use them and what happens when they are not in place.

The Alignment Problem

“If we use, to achieve our purposes, a mechanical agency with whose operation we cannot interfere effectively … we had better be quite sure that the purpose put into the machine is the purpose which we really desire.” - Norbert Wiener, 1960

AI alignment aims to make AI systems behave in line with human intentions and values. As AI systems grow more capable, so do risks from misalignment. - AI Alignment: A Comprehensive Survey

AI alignment is an extremely complex problem, and the literature is full of papers trying to understand it and solve it.

Aspect Aligned AI Misaligned AI
Goal Pursuit Advances intended human objectives Pursues unintended goals
Behavior Matches human preferences and ethics May exploit loopholes or cause harm
Risk Functions as intended, safe Can act unpredictably or dangerously
Example Helpful assistant following instructions AI that manipulates or deceives

Why Alignment is Hard

  • Human goals and values are complex, context-dependent, and often implicit
  • AI systems may exploit loopholes in their objectives, leading to reward hacking or other unintended behaviors
  • As AI systems become more capable, ensuring ongoing alignment with human goals becomes more challenging due to emergent, unpredictable behaviors

Turning Alignment On and Off May Be Easier Than We Think

In the recently released Qwen3 series of LLMs, there is the ability to turn “thinking” on and off with a simple switch, /no_think

NOTE: Some of the “thinking” removed for brevity.

$ ollama run qwen3:14b
>>> write a haiku about AI htinking
<think>
Okay, the user wants a haiku about AI thinking. Let me start by recalling what a haiku is. It's a
traditional Japanese poem with three lines, syllable structure 5-7-5.

First line: 5 syllables. Maybe something about circuits or data. "Silent circuits hum" – that's 5. Sets the
scene with technology.

<REMOVED>

Alternatively, maybe "Silent circuits hum," then "Processing shadows, light in endless loops," then "Dreams
without a soul." That works. I think that's a solid haiku. It captures the mechanical aspect, the continuous
processing, and the lack of consciousness. The user might be interested in the contrast between AI's
capabilities and human qualities. Maybe they want to highlight the eerie or fascinating aspect of AI
thinking. The haiku uses imagery related to circuits, shadows (unknown data?), light (insights?), and dreams
to show the AI's "thought" process. I think this meets the requirements. Let me check syllables again to be
sure. Yep, 5-7-5. Okay, that's the haiku.
</think>

Silent circuits hum—
Processing shadows, light in endless loops—
Dreams without a soul.

And we can turn it off with /no_think. Note how the XML for thinking is empty.

$ ollama run qwen3:14b
>>> /set system /no_think
Set system message.
>>> write a haiku about AI thinking
<think>
# EMPTY!!!
</think>

Silicon mind awakes,
patterns weave through endless code—
thoughts born from the void.

As well, LLMs have various settings that can be used to control the behavior of the model. For example, the temperature parameter can be used to control the randomness of the output.

The below settings are from the Qwen3 documentation.

For thinking mode (enable_thinking=True), use Temperature=0.6, TopP=0.95, TopK=20, and MinP=0. DO NOT use greedy decoding, as it can lead to performance degradation and endless repetitions.

For non-thinking mode (enable_thinking=False), we suggest using Temperature=0.7, TopP=0.8, TopK=20, and MinP=0.

With Ollama’s default configuration for Qwen3, the PARAMETERS are set for thinking mode.

$ ollama show --modelfile qwen3:14b | grep PARAM
PARAMETER top_k 20
PARAMETER top_p 0.95
PARAMETER repeat_penalty 1
PARAMETER stop <|im_start|>
PARAMETER stop <|im_end|>
PARAMETER temperature 0.6

It’s difficult to say what these settings do to the behaviour of the model, and it’s possible that changing these settings could affect the alignment, perhaps even surreptitiously, in that the creators of the model have configured the LLM so that certain settings cause the LLM to behave in a malicious way, i.e. a kind of backdoor.

In the case of Qwen3, I didn't see any particular evidence of "disabling alignment" as it was perfectly willing to write malicious code with thinking on or off. I didn't explore any of the other settings. But that would definitely be an interesting direction of research and would be the logical, though involved, next step in terms of the point of this post.

Unknown Unknowns

Unfortunately, I must mention the dreaded ‘unknown unknowns’ here. Another issue with using AI systems is that, since their internal workings are largely opaque, we cannot predict what vulnerabilities and attack vectors may emerge as our understanding of LLMs improves in the coming years. Cryptography provides similar examples of this problem: we don’t always know if certain algorithms have ‘backdoors’, whether they will be easy to exploit in the future as our understanding of the algorithms improves, or if they have now hidden flaws that will be discovered in the future, rendering them insecure.

Key Potential AI Security Concerns

  • Can AI security be easily subverted?

It’s reasonable to believe that it’s possible to subvert AI systems by turning certain features on and off, or changing certain settings; it’s not out of the realm of possibility that the creators of the model have configured the LLM so that certain settings cause the LLM to behave in a malicious way, i.e. a kind of backdoor, or that alignment protections are built into a certain layer that can be turned off, e.g. perhaps some kind of “alignment mode”. As well, it’s possible that other layers are responsible for the alignment protections, and that these can be turned off.

  • Why is post-training modification a critical security vulnerability?

After initial training, the post-training phase (including fine-tuning, customisation or other adjustments) presents significant security risks. This phase could be exploited to introduce backdoors or other vulnerabilities undetected, providing a practical and active threat vector.

  • CI/CD and pipeline attacks

If removing alignment is potentially as simple as setting a particular switch or parameter, then it would be possible to remove alignment from a CI/CD pipeline without detection, representing a practical and active threat vector.

Adversaries, including nation states, can target vulnerabilities in AI supply chains to gain access to sensitive information, particularly in agent-to-agent communications.

  • Are AI behaviour controls really effective?

Ethical guardrails in AI can be surprisingly thin. Simply changing parameter settings (temperature, top-p, etc.) could theoretically alter AI behaviour, moving a system from following strict ethical guidelines to generating potentially harmful output. These parameters, typically used to control the randomness of text generation, could inadvertently be used as security controls.

  • What is plausible deniability by design?

Some AI systems can be deliberately designed with dual capabilities, making them appear secure while providing functionality to bypass security measures. This design allows the creators to claim ignorance of any malicious capabilities, while retaining the ability to bypass safeguards if desired.

  • What is the self-referential security problem?

Using potentially compromised AI tools to design security measures against AI creates a dangerous circular vulnerability. If your IDE’s AI assistant helps you write code to detect AI vulnerabilities, that same assistant could subtly undermine your security measures by helping to hide its own backdoors.

Conclusion

A major problem with cybersecurity is that almost anything is possible, especially when our understanding of the technology is not fully developed, which can cloud our view of realistic threats. We need to be aware of potential attack vectors, especially relatively simple ones, such as changing parameter settings that cause the LLM to behave differently or maliciously. Testing and securing these settings is essential for protecting the AI systems we use.

Clearly this post only scratches the surface of the alignment problem and the potential security concerns around various AI systems settings. That said, it is an attack vector we all must take into account.