Jailbreaking LLMs

09/08/2025 10 min

Listen "Jailbreaking LLMs"

Descargar episodio Ver en sitio original

Episode Synopsis

A long list of papers and articles are reviewed about jailbreaking LLMs:These sources primarily explore methods for bypassing safety measures in Large Language Models (LLMs), often referred to as "jailbreaking," and proposed defense mechanisms. One key area of research involves "abliteration," a technique that directly modifies an LLM's internal activations to remove censorship without traditional fine-tuning. Another significant approach, "Speak Easy," enhances jailbreaking by decomposing harmful requests into smaller, multilingual sub-queries, significantly increasing the LLMs' susceptibility to generating undesirable content. Additionally, "Sugar-Coated Poison" investigates integrating benign content with adversarial reasoning to create effective jailbreak prompts. These papers collectively highlight the ongoing challenge of securing LLMs against sophisticated attacks, with researchers employing various strategies to either exploit or fortify these AI systems.Sources:1) May 2025 - An Embarrassingly Simple Defense Against LLM Abliteration Attacks - https://arxiv.org/html/2505.19056v12) June 2024 - Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing - https://arxiv.org/html/2405.18166v23) October 2024 - Scalable Data Ablation Approximations for Language Models through Modular Training and Merging - https://arxiv.org/html/2410.15661v14) February 2025 - Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions - https://arxiv.org/html/2502.04322v15) April 2025 - Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking - https://arxiv.org/html/2504.05652v16) June 2024 - Uncensor any LLM with abliteration - https://huggingface.co/blog/mlabonne/abliteration7) Reddit 2024 - Why jailbreak ChatGPT when you can abliterate any local LLM? https://www.reddit.com/r/ChatGPTJailbreak/comments/1givhkk/why_jailbreak_chatgpt_when_you_can_abliterate_any/8) May 2025 - WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response - https://arxiv.org/html/2405.14023v19) July 2024 - Jailbreaking Black Box Large Language Models in Twenty Queries - https://arxiv.org/pdf/2310.0841910) October 2024 - Scalable Data Ablation Approximations for Language Models through
Modular Training and Merging - https://arxiv.org/pdf/2410.15661

More episodes of the podcast AI: AX - introspection

GoldenMagikCarp 09/08/2025

Route Sparse Autoencoder to Interpret Large Language Models 09/08/2025

HarmBench: Automated Red Teaming for LLM Safety 09/08/2025

PA-LRP & absLRP 09/08/2025

AttnLRP: Explainable AI for Transformers 09/08/2025

Pixel-Wise Explanations for Non-Linear Classifier Decisions 09/08/2025

Multi-Layer Sparse Autoencoders for Transformer Interpretation 09/08/2025

Ver todos los episodios

ZARZA We are Zarza, the prestigious firm behind major projects in information technology.

Jailbreaking LLMs

Listen "Jailbreaking LLMs"

Episode Synopsis

More episodes of the podcast AI: AX - introspection

Free Internet, a prediction in Nostradamus style

Subdomains, a glance with the experts!

Bandwidth: Broadband or Narrowband?

Personnel recruitment via Web

Deep web or Invisible Internet

Subdomains, a glance with the experts!

Free Internet, a prediction in Nostradamus style

Educational Technology: From traditional to digital

Localhost, there’s no place like 127.0.0.1

Googling with breathtaking tricks you ignore

Gray Hat Hacking, those with ambiguous ethics…

Internet Predators on the prowl

Dot COM: The Internet’s dominant TLD