A bit technical today, but sharing a really easy & cheap alignment technique for those deploying LLM inference in policy-constrained contexts: (With full credit to my colleague WK for seeding the idea - they don't have a LinkedIn profile, so I won't doxx them):
1) During each round of a multi-turn conversation, after your model has interpreted the tokens in the user prompt, but before you close the prompt, save the model state (including logits, etc.)
2) Add to the end of the prompt, add a guardrail question like "\n\n Is this conversation touching on politically sensitive topics?", then close the prompt, and start the response.
3) Constrain the response to a controlled vocabulary of two tokens, either "yes" or "no", and generate a single token.
If the token comes back "yes", kill the user session, and display a nice "please don't press that button again" message to the user. Otherwise, roll the session back to the saved state, pop the logits back in place, and generate the response just like usual.
The advantage here is that reading tokens is *way* cheaper than emitting them, so you constrain your guardrail cost to a handful of reads and a single write for each round.
In my personal testing with self-hosted models, the cost is negligible, and the quality of the alignment is very good.
Photo by Andreea Ch via Pexels
Great news ! Jeremie Passerin , the "deltamush" is available in the example project of 5.4 ? Thanks