arXiv Cluster Highlights

• Steerable Visual Representations Pretrained Vision Transformers (ViTs) such as DINOv2 and MAE provide generic image features that can be applied to a variety of downstream tasks such as retrieval, classification, and segmentation.. However, such representations tend to focus on the most salient visual cues in the image, with no way to direct them toward less prominent concepts of interest. • Modular Energy Steering for Safe Text-to-Image Generation with Foundation Models Controlling the behavior of text-to-image generative models is critical for safe and practical deployment.. Existing safety approaches typically rely on model fine-tuning or curated datasets, which can degrade generation quality or limit scalability. • FlowSlider: Training-Free Continuous Image Editing via Fidelity-Steering Decomposition Continuous image editing aims to provide slider-style control of edit strength while preserving source-image fidelity and maintaining a consistent edit direction.. Existing learning-based slider methods typically rely on auxiliary modules trained with synthetic or proxy supervision. • Attention at Rest Stays at Rest: Breaking Visual Inertia for Cognitive Hallucination Mitigation Like a body at rest that stays at rest, we find that visual attention in multimodal large language models (MLLMs) exhibits pronounced inertia, remaining largely static once settled during early decoding steps and failing to support the compositional understanding required for cognitive inference.. While existing hallucination mitigation methods mainly target perceptual hallucinations concerning object existence or attributes, they remain inadequate for such cognitive hallucinations that require inter-object relational deduction. • Low-Effort Jailbreak Attacks Against Text-to-Image Safety Filters Text-to-image generative models are widely deployed in creative tools and online platforms.. To mitigate misuse, these systems rely on safety filters and moderation pipelines that aim to block harmful or policy violating content. • SafeRoPE: Risk-specific Head-wise Embedding Rotation for Safe Generation in Rectified Flow Transformers Recent Text-to-Image (T2I) models based on rectified-flow transformers (e.g., SD3, FLUX) achieve high generative fidelity but remain vulnerable to unsafe semantics, especially when triggered by multi-token interactions.. Existing mitigation methods largely rely on fine-tuning or attention modulation for concept unlearning; however, their expensive computational overhead and design tailored to U-Net-based denoisers hinder direct adaptation to transformer-based diffusion models (e.g., MMDiT). • SteerFlow: Steering Rectified Flows for Faithful Inversion-Based Image Editing Recent advances in flow-based generative models have enabled training-free, text-guided image editing by inverting an image into its latent noise and regenerating it under a new target conditional guidance.. However, existing methods struggle to preserve source fidelity: higher-order solvers incur additional model inferences, truncated inversion constrains editability, and feature injection methods lack architectural transferability.

arXiv Cluster Highlights

Related Papers

Hiring AI researchers or engineers?