Disentangle Light and Color: Content-Aware, Text-Guided Retouching with LHSI & log-RGB
Color representation and where you put the “knobs” decides how well AI can edit, correct, or synthesize images. Choices like the color space, whether edits are applied uniformly or locally, and how user preferences are given (numbers vs words) change both quality and control.
Why this matters (plain language): imagine an image as a stage: actors are shapes, costumes are colors, and stage lights are illumination. Some tools recolor everyone the same way (a blanket dye). Better tools let you change costumes differently for each actor depending on where they stand or how the light hits them. Picking the right color math (color space) and the right control signals (local masks or text labels) is the difference between a plasticky edit and something that looks natural.
Content-adaptive retouching (CA-ATP): instead of applying one color change everywhere, CA-ATP uses a small palette of preset color mappings called basis curves and learns a matching weight map for each curve. Think: several recipe cards (basis curves) and a stencil (weight maps) that tells the model how much of each recipe to use at each pixel. That makes the system:
- Content-aware: same pixel color can be transformed differently if it’s in the sky vs on skin.
- Controllable: an attribute-text module turns style words (e.g., warm, punchy, cinematic) into embeddings that steer the retouching so the result matches user intent.
- Practical: combines visual features and text features via a multimodal model so you can guide edits with simple descriptive language instead of fiddly sliders.
What’s new under the hood: basis curves + pixel-wise weight maps = spatially varying curve mapping; attribute-to-text prediction = explicit, human-friendly guidance; integrating them gives adaptive, user-steerable retouching that beats uniform pixel-wise mappings on benchmarks.
How generative models encode color (Stable Diffusion example): internal compressed representations (latents) aren’t random blobs — they organize things. Analyses show color tends to live on circular, opponent-like axes (picture the color wheel), mainly in certain latent channels (called c_3 and c_4), while brightness and shape sit in other channels (c_1 and c_2). Practical consequences:
- Targeted edits: you can nudge color without wrecking shape by operating on the color channels.
- Interpretable knobs: principal component analysis (PCA) finds main directions of color variation, so editing becomes more predictable.
- Design hint: models naturally learn efficient, opponent-like color encodings similar to human vision — useful when designing editing tools or disentangled representations.
White balance (WB) and the need for better color spaces: many post-ISP fixes happen in sRGB because raw camera files may be missing. sRGB ties together channels with nonlinear transforms (gamma) and entangled math, making it hard to correct color casts robustly. A perception-inspired Learnable HSI (LHSI) color space treats color like a cylinder: hue as an angle, saturation as radius, intensity as height. LHSI adds learnable parameters to increase separation between lightness and chroma, and a specialized network (Mamba) is trained in that space for WB correction. Result: more reliable color recovery when RAW is unavailable, especially under tricky lighting.
Why log RGB helps for view synthesis (NeRF): illumination multiplies surface colors: bright light multiplies a pixel’s reflectance. Taking logs turns multiplication into addition, which makes it easier for networks to separate light from material. Experiments training NeRFs in a log-RGB interpretation show:
- Better rendering quality across scenes, especially in low light.
- More compact, stable representations across network sizes and variants.
- Improved robustness when input bit-depth remains the same (no extra data needed).
Big picture theme: whether you’re retouching a photo, correcting white balance, or teaching a renderer to reproduce a scene, two design ideas repeatedly help:
- Separate light from color: pick or learn a color space (log, linear, cylindrical HSI/LHSI) that disentangles illumination and chromaticity.
- Make adjustments content-aware and controllable: use local weight maps, basis mappings, or text-guided embeddings so similar pixels aren’t forced into identical edits.
Practical takeaways (for non-experts):
- If you want natural retouching: prefer content-aware curve mappings plus text-style controls instead of a single global curve.
- If white balance looks off and you only have sRGB files: methods built around HSI-like spaces (LHSI) are more robust than trying to fix sRGB directly.
- If training scene representation models (NeRF): try training in log-RGB or linear spaces rather than sRGB to make illumination/reflectance easier to learn.
- If you want to edit generated images: look for latent channels that align with color axes — changing those often edits color without distorting shape.
Quick how-to recipes:
- Make an auto-retouch feature: build a small bank of basis curves and a lightweight network that predicts pixel-wise weights; add a text encoder so users can type "warm" or "moody."
- Fix white balance on JPEGs: convert to a cylindrical HSI-like space, apply a learnable chroma/luminance separation, then map back to sRGB.
- Train a view-synthesis model for low light: preprocess or train outputs in log-RGB to reduce multiplicative lighting effects and boost stability.
- Edit color in diffusion outputs: analyze latent channels (PCA helps), find the ones aligned with color, and modify them while keeping others intact.
Cheat sheet: separate light from color (use log or HSI-like spaces), edit with local, content-aware mappings, and let users steer style with simple text attributes — that combination keeps edits both powerful and predictable.
Caveats and open problems: disentanglement isn’t perfect — editing one axis may still affect others. Log transforms require careful handling of zeros and noise. Text-guided style depends on the multimodal model’s vocabulary and training data. Still, these approaches provide stronger, more intuitive control than older uniform methods.
Want the code or further reading? The LHSI white-balance project provides a public implementation (see its GitHub for experiments and code): https://github.com/YangCheng58/WB_Color_Space.
Related Papers
- arXiv Query: search_query=&id_list=2512.09580v1&start=0&max_results=10
Image retouching has received significant attention due to its ability to achieve high-quality visual content. Existing approaches mainly rely on uniform pixel-wise color mapping across entire images,…
- arXiv Query: search_query=&id_list=2512.09477v1&start=0&max_results=10
Recent advances in diffusion-based generative models have achieved remarkable visual fidelity, yet a detailed understanding of how specific perceptual attributes - such as color and shape - are intern…
- arXiv Query: search_query=&id_list=2512.09383v1&start=0&max_results=10
White balance (WB) is a key step in the image signal processor (ISP) pipeline that mitigates color casts caused by varying illumination and restores the scene's true colors. Currently, sRGB-based WB e…
- arXiv Query: search_query=&id_list=2512.09375v1&start=0&max_results=10
Neural Radiance Fields (NeRF) have achieved remarkable results in novel view synthesis, typically using sRGB images for supervision. However, little attention has been paid to the color space in which…
Hiring AI researchers or engineers?
Your job post reaches ML engineers, PhD researchers, and AI leads who read arXiv daily. Transparent pricing, real impression data, no middlemen.
