Round Embeddings, Smarter Positional Encodings, and Attention as a Kernel

Why representation learning needs an order: make embeddings easy to use, keep geometry that matters, and reason about attention as a kernel.

LeJEPA — force embeddings to look like a nice round cloud

Many modern self-supervised methods try to make two views of the same input produce embeddings that can predict each other. LeJEPA gives a clean theory and a practical fix for those embeddings: aim for an isotropic Gaussian — think of the embedding cloud as a spherical ball, not a squashed or lopsided blob.

Why a sphere? If embeddings are zero-mean with identity covariance (an isotropic Gaussian), then no direction is artificially more important than another. That makes simple predictors (like linear classifiers) work well, avoids collapse, and minimizes certain prediction errors provably under JEPA-style objectives.
How to get there cheaply: Sketched Isotropic Gaussian Regularization (SIGReg) estimates covariance with a small random projection ("sketch") and penalizes deviations from identity. Sketching keeps memory and compute costs linear and tiny.
Practical wins: a single regularization hyperparameter, no teacher–student or stop-gradient tricks, stable across architectures (ResNets, ViTs, ConvNets) and datasets, and simple distributed implementation (~50 lines). Empirically strong results (example: ViT-H/14 reaching ~79% linear accuracy on ImageNet-1k pretrained with LeJEPA).

Positional encodings — you can be relative, or you can be cleverer

Transformers need to know "where" tokens or pixels are. Rotary Positional Encoding (RoPE) rotates embedding coordinates to encode relative positions; it became popular because it keeps attention sensitive to relative offsets.

RoPE is mathematically neat in 1D: for sequences, RoPE is one of the most general ways to get relative-equivariant position embeddings — that is, attention depends on relative distances.
Mixed RoPE extends this to higher dimensions — but only if rotation generators commute. Commutative generators behave like addition (order doesn’t matter). For many real rotations, order does matter, so Mixed RoPE forces a restrictive assumption.
Spherical RoPE relaxes that restriction. It lets the underlying rotation generators be non-commutative (order matters). Surprisingly, in vision tasks Spherical RoPE performs as well or better than strict equivariant versions, suggesting strict relative equivariance is not always necessary for visual data.

Point clouds — measure direction cleverly, remain rotation-invariant

3D point clouds challenge models because arbitrary rotations change coordinates but not the object. DiPVNet attacks this by building features that are both direction-aware and rotation-invariant.

Local idea (L2DP operator): compute learnable dot-products between a center point and its neighbors to capture directional selectivity while keeping the result invariant to rigid rotations of the whole cloud.
Global idea (DASFT): projecting the cloud against many spherical sampling vectors is equivalent to a direction-aware spherical Fourier transform — a compact global signature that doesn’t change when you rotate the object.
Why it helps: Dot-products are natural for direction: measuring alignment with sample directions gives a stable, multiscale descriptor that is provably rotation-invariant and robust under noise and big rotations. This yields state-of-the-art results in classification and segmentation benchmarks for point clouds.

Transformers as kernels on continuous fields — a unifying viewpoint

Instead of treating positions as discrete indices, map them to functions on a continuous manifold (a line for text, a plane for images, a sphere for 3D). From this view:

Positional encodings become part of the function embedding on the manifold.
Attention is a kernel integral operator: it mixes values using a position-dependent kernel (the attention weights), so designing attention = designing a kernel.
This field-theoretic view connects the math behind positional encodings, attention, and classical kernel theory — which helps explain why certain positional encodings (like RoPE variants) interact better with attention and how geometric priors should be encoded.

How these ideas fit together in plain terms

Make the embedding space well-behaved (isotropic) so downstream linear tasks are easy and training is stable — LeJEPA gives a provable target and an efficient way to nudge embeddings there.
Choose positional encodings to match the geometry of your data — but don’t assume strict relative equivariance is always needed. Non-commutative constructions (Spherical RoPE) can work better for images.
For 3D data prefer direction-aware features that aggregate dot-products across directions; harmonics and spherical transforms give compact, rotation-invariant signatures (DiPVNet style).
Think of attention as a kernel: positional encodings shape that kernel. Use this to reason about what positional design and feature distributions you want.

Actionable tips for engineers (simple, low-friction)

Prefer distributional regularization: encourage embeddings to be roughly isotropic with a small penalty rather than complex tricks (stop-grad, momentum teachers). Sketching keeps this cheap.
Try non-strict positional encodings for vision: don’t be married to relative-equivariance; test Spherical or mixed encodings for better generalization and speed trade-offs.
For 3D tasks: use dot-product-based local operators and spherical sampling—they give rotation robustness without huge overhead.
When tuning models: favor methods with fewer hyperparameters (LeJEPA style) — easier to scale and reproduce across datasets and architectures.
Design attention like a kernel: if you want locality, smoothness, or scale invariance, pick positional encodings and attention forms that produce the corresponding kernel behavior.

Bottom line: make embeddings geometrically sensible (a "round" distribution), encode positions in a way that matches your data’s geometry (and don’t over-constrain equivariance), and treat attention as kernel design — these simple, theory-backed moves give more stable, efficient, and robust models.

Quick glossary

Isotropic Gaussian: zero mean, identity covariance — the embedding cloud looks like a sphere.
Sketching: using small random projections to estimate large matrices cheaply.
Equivariance vs invariance: equivariant outputs change predictably under input transforms; invariant outputs don’t change.
RoPE / Mixed / Spherical RoPE: families of positional encodings based on rotating embedding coordinates; they differ in how they handle multi-dimensional positions and whether rotations must commute.
DASFT: direction-aware spherical Fourier transform — a global, rotation-invariant descriptor for point clouds.

These advances share a pattern: pick simple mathematical targets for representations (isotropy, rotation invariance, kernel shapes) and implement them with light, efficient, and provable tools (SIGReg/sketching, Spherical RoPE, dot-product + spherical sampling). That combination tends to beat brittle engineering hacks while being easier to scale and reason about.

Related Papers