Status: Concept proposal (unvalidated).
More materials (work-in-progress): https://osf.io/85z3u/files
TL;DR
We propose a method to teach AI something like intuition:
- Know when to zoom — use more representational detail only when it’s worth the cost.
- Know where to jump — predict a better direction that only appears at higher-dimensional detail.
The approach combines Cross-Dimensional Contrastive Learning (CDCL), Dimensional Distance Profiles (DDP), and a fold-aware jump policy that triggers proactive cross-scale moves when low-dimensional geometry hints there’s a better solution “across a fold.”
Note: This is a theoretical proposal; experiments are forthcoming.
Part 1 — A human-friendly explanation
Intuition, in one picture
You spot a friend across a plaza. First comes gist (cheap, fast). If unsure, you zoom (more detail). Sometimes a tiny cue makes you jump — you just know it’s them without scanning every face.
We propose training AI to do that:
- Zoom only when cheap signals suggest it will help.
- Jump early when low-dimensional patterns indicate the problem will “untangle” at higher dimension.
Why this might help
Embeddings can reshuffle neighbors when you change dimension (e.g., 256 → 1024). Systems either recompute a lot (slow/expensive) or risk missing long-tail matches. Stabilizing geometry across dimensions and adding fold-aware jumps could keep quality while avoiding brute force.
Part 2 — How it works (still friendly)
The ladder
Use a fixed dimensional “ladder,” e.g., 128 → 256 → 512 → 1024, with a versioned basis so geometry is comparable across steps (and auditable).
DDP: Dimensional Distance Profiles
For any pair (A, B), track distance vs. dimension. Positives should stay flat as you zoom; negatives should separate.
The zoom gate (policy)
From cheap, low-d signals — margin, stability-delta (how much rankings changed since last step), and residual headroom — estimate the expected value of zooming minus its cost (latency/compute). Zoom only if it’s worth it.
Controller (v1.5): we use a contextual bandit to pick stay/zoom from cheap signals and a compute-cost prior. A hysteresis deadband (τlo<τhi) prevents thrash. We run a lightweight pilot-zoom on ~5–10% of the shortlist to estimate the expected gain before full escalation. With conformal abstention, the system can say “I don’t know” (or escalate) when non-conformity exceeds a threshold tα.
The fold-aware jump (proactive)
Sometimes the low-d view suggests a topological fold: rapid neighbor swaps, DDP curvature, or “meet-again” behavior. When those cues appear, predict a small residual direction in the higher-d space and jump before local search fails.
Part 3 — Show me the math (for practitioners & reviewers)
Notation
- Encoder:
outputs
. We use the unit-norm version
.
- Fixed orthonormal basis:
(columns are orthonormal).
- Prefix / residual at depth
:
,
. If
then
.
- Optional normalized $d$-view:
(use for contrastive losses only).
- Residual headroom:
.
- Cosine (certificate space) at depth
:
,
.
- Production distance (score) — ARD-RBF: with
,
,
, define
.
- DDP (Dimensional Distance Profile): the curve
under the chosen metric.
- Positives/negatives for a query
:
/
.
- Depth ladder:
denotes the served dimensions (e.g.,
).
Early-stop certificate (cosine domain)
By Cauchy–Schwarz on residuals, the full-D cosine lies in
Certification rule: at depth , certify winner
over runner-up
if
. (We score with ARD-RBF
, but we certify STOP/ZOOM in cosine.)
Training: CDCL + DDP shaping
Goal: learn geometry that is (i) contrastive at each depth and (ii) stable/smooth across depths so low-d works fast and high-d refines.
Definitions used below: ,
.
Total loss:
Notes: is the negative-pair margin;
are scalar weights; expectations
are over the current batch unless stated.
Zoom policy (Expected Value of Computation)
Idea: start cheap (low-d). If quick cues predict a gain from “zooming,” pay the cost to go deeper.
Cost model: captures extra latency/compute at depth
(e.g., proportional to FLOPs or wall-time).
Here are cheap features (margin, stability-delta across depths, residual headroom
, local density, anisotropy). In practice we walk a ladder (e.g., 128→256→512→1024) and stop when the cosine certificate above proves the order, or thresholds pass; otherwise escalate.
Serving controller (v1.5): a contextual bandit chooses stay vs. zoom using and a compute prior. We apply hysteresis with thresholds
Cross-scale lift (jump)
When low-d looks “braided,” predict a residual direction in the higher-d subspace and leap there.
Where: is the target (or class prototype) in
dims;
is the competitor set;
controls residual sparsity; and
counts non-zeros in
.
The reward balances target separation in full against the sparsity/size of the predicted residual.
Fold-aware proactive trigger
Cheap signals that a better ordering “over the fold” likely exists:
- DDP curvature:
(second difference of mean distance).
- Neighbor-swap rate (braiding):
(Kendall
between neighbor orders across depths).
- Meet-again / reconvergence score (diverge then agree again in higher-d).
- Residual headroom:
(energy still left beyond depth
).
- Hubness skew around the query (unstable hubs → likely fold).
Definitions: is a batch mean (often over positives);
is Kendall
between top-
neighbor lists across depths.
Fold potential head:
Decision energy:
Policy: jump if or
(subject to budget).
Part 4 — Evaluation plan (what we’ll test if we proceed)
- Datasets: representative public text/code retrieval sets + a synthetic meet-again benchmark.
- Metrics: AUQC (Area-Under Quality-vs-Compute), Budget frontier (p50/p90/p95 quality vs. latency/compute), Zoom-AUC, Thrash@τ (hysteresis reversals), Neighbor-Stability (Kendall-τ across depths), Hubness skew, Recall@K per rung, Jump@K, dims-per-jump, braid/meet-again indices, DDP AUC, cost (re-ranks/tokens, p95 latency).
- Baselines: 1024-only retriever; early-exit escalator (no lift); cross-encoder re-rank; dimensional ensembling; ours (proposed) without vs. with fold-aware jumps.
- Ablations: remove each DDP loss; remove lift; remove fold features one-by-one; ladder/anchor variants; basis swaps with versioning.
- Stats: bootstrap CIs, paired permutation tests, effect sizes.
Ops & CI (regression guards)
- AUQC/Budget frontier: track movement of the frontier release-to-release.
- Neighbor-Stability: mean ± SE of Kendall-τ across depths.
- Zoom-AUC and Thrash@τ to catch oscillation.
- Conformal coverage: empirical coverage vs. target (1−α).
- Index telemetry: HNSW hops, PQ recall@K, GPU util.
Acceptance tests (go/no-go)
- Compute ↓, Quality =/↑: AUQC ↑ or p95 latency ↓ ≥10% at matched Hit@1.
- Stability: Kendall-τ SE ↓ ≥10% (fewer cross-depth neighbor flips).
- Abstention: conformal coverage within ±1.5% of target (1−α).
- Retrieval: final-rung recall@10 ≥0.98× brute-force on a 1M shard; ≥5× speedup with HNSW+slice-PQ.
Part 5 — Potential benefits (if validated)
Reduced compute by avoiding brute-force high-d on everything.
Long-tail improvements (polysemy, rare relations) via early fold-aware jumps.
Ops clarity: explicit budgets, thresholds, versioned bases for audits.
Part 6 — Limitations & risks (up front)
- Fold cues may be noisy → over-jump; needs budget/entropy priors.
- Sensitivity to basis versioning and domain shift in fold statistics.
- As with any retrieval improvement, consider policy/safety filters.
Part 7 — Learn more / follow along
Working draft, LaTeX, configs, and notes will live in the OSF project:
OSF: https://osf.io/85z3u/files
We’ll update that page with code and evaluation artifacts if we move to experiments.