Jailbreaks, Collaboration, and Cognitive Shifts

31/07/2025 1h 2min

Listen "Jailbreaks, Collaboration, and Cognitive Shifts"

Descargar episodio Ver en sitio original

Episode Synopsis

Generated by Google NotebookLM. This episode explores 15 new research papers at the edge of LLM behavior, safety, collaboration, and reasoning:Beyond passive replies – CollabLLM rethinks how LLMs interact across turns, training them to uncover user intent and proactively collaborate.Red teaming, automated – RedCoder weaponizes multi-turn attacks against code models, training autonomous agents to probe for unsafe generations.Synthesis by simulation – CodeEvo builds training data by pairing coder and reviewer agents in feedback loops, automating high-quality instruction-code generation.Internal deception – Linear probes and SAEs reveal how truthful features flip when models are prompted to lie.Defense by deflection – SDeflection avoids refusal and instead rewrites malicious prompts into innocuous replies, lowering jailbreak success without hurting helpfulness.Attack by persona – A genetic algorithm crafts persona prompts that reduce refusal rates and supercharge jailbreaks, especially when stacked with other methods.Agents with evolving maps – CoEx lets planning agents continually revise their world models, co-adapting structure and strategy over time.Interfaces for oversight – Magentic-UI powers human-in-the-loop agentic systems with long-term memory, action guards, and collaborative controls.Measuring long-context reasoning – NeedleChain moves past “needle-in-a-haystack” with tasks that require full semantic integration across long input windows.Bias as an exploit – CognitiveAttack uncovers how stacking psychological biases in prompts dramatically increases LLM jailbreak success.Patching with logic – RePaCA guides LLMs to assess bug fixes using chain-of-thought, boosting accuracy and explainability in patch correctness tasks.Federated fine-tuning at scale – H2Tune handles architectural and task diversity across clients with a novel decomposition and disentanglement scheme.Multimodal mastery – MoCHA uses sparse MoE connectors and hierarchical attention to align vision with language and reduce hallucinations.Where demos belong – A detailed analysis of demo position bias finds that demonstration ordering in prompts drastically alters LLM accuracy and stability.Together, these papers uncover the subtle mechanics that shape LLM trustworthiness, the strategies that make or break jailbreak defenses, and the design patterns emerging in agentic interfaces and federated learning.Sources:CollabLLM: arXiv:2406.04425RedCoder: arXiv:2407.00482CodeEvo: arXiv:2407.00483When Truthful Representations Flip Under Deceptive Instructions: arXiv:2407.00495Strategic Deflection: arXiv:2407.00496Enhancing Jailbreak Attacks via Persona Prompts: arXiv:2407.00499CoEx: arXiv:2407.00508Magentic-UI: arXiv:2407.00510NeedleChain: arXiv:2407.00518CognitiveAttack: arXiv:2407.00519RePaCA: arXiv:2407.00523H2Tune: arXiv:2407.00529MoCHA: arXiv:2407.00530Where to show Demos in Your Prompt: arXiv:2407.00533

More episodes of the podcast Today in arXiv AI

Cognition, Contracts, and Compression 06/08/2025

Architectures, Attacks, and Autonomy 06/08/2025

Planning Agents, Emotional Bias, and Trustworthy Responses 30/07/2025

Factuality, Alignment, and Edge Efficiency 30/07/2025

Safety, Evaluation, and Reasoning 25/07/2025

The AI Frontier: Confronting Hallucinations, Deepening Reasoning, and Building Trust 24/07/2025

Ver todos los episodios

ZARZA We are Zarza, the prestigious firm behind major projects in information technology.

Jailbreaks, Collaboration, and Cognitive Shifts

Listen "Jailbreaks, Collaboration, and Cognitive Shifts"

Episode Synopsis

More episodes of the podcast Today in arXiv AI

Prevent Attacks From Your Local Area Network

Free Internet, a prediction in Nostradamus style

Bandwidth: Broadband or Narrowband?

Personnel recruitment via Web

Deep web or Invisible Internet

Subdomains, a glance with the experts!

Free Internet, a prediction in Nostradamus style

Educational Technology: From traditional to digital

Localhost, there’s no place like 127.0.0.1

Googling with breathtaking tricks you ignore

Gray Hat Hacking, those with ambiguous ethics…

Internet Predators on the prowl

Dot COM: The Internet’s dominant TLD