Home » "Science" » The Geometry of Intuition in Artificial Minds

The Geometry of Intuition in Artificial Minds

A proposal for teaching models when to zoom and where to jump.

Status: Concept proposal (unvalidated).
More materials (work-in-progress): https://osf.io/85z3u/files

TL;DR

We propose a method to teach AI something like intuition:

  1. Know when to zoom — use more representational detail only when it’s worth the cost.
  2. Know where to jump — predict a better direction that only appears at higher-dimensional detail.

The approach combines Cross-Dimensional Contrastive Learning (CDCL), Dimensional Distance Profiles (DDP), and a fold-aware jump policy that triggers proactive cross-scale moves when low-dimensional geometry hints there’s a better solution “across a fold.”

Note: This is a theoretical proposal; experiments are forthcoming.

Part 1 — A human-friendly explanation

Intuition, in one picture

You spot a friend across a plaza. First comes gist (cheap, fast). If unsure, you zoom (more detail). Sometimes a tiny cue makes you jump — you just know it’s them without scanning every face.

We propose training AI to do that:

  • Zoom only when cheap signals suggest it will help.
  • Jump early when low-dimensional patterns indicate the problem will “untangle” at higher dimension.

Why this might help

Embeddings can reshuffle neighbors when you change dimension (e.g., 256 → 1024). Systems either recompute a lot (slow/expensive) or risk missing long-tail matches. Stabilizing geometry across dimensions and adding fold-aware jumps could keep quality while avoiding brute force.

Part 2 — How it works (still friendly)

The ladder

Use a fixed dimensional “ladder,” e.g., 128 → 256 → 512 → 1024, with a versioned basis so geometry is comparable across steps (and auditable).

DDP: Dimensional Distance Profiles

For any pair (A, B), track distance vs. dimension. Positives should stay flat as you zoom; negatives should separate.

The zoom gate (policy)

From cheap, low-d signals — margin, stability-delta (how much rankings changed since last step), and residual headroom — estimate the expected value of zooming minus its cost (latency/compute). Zoom only if it’s worth it.

Controller (v1.5): we use a contextual bandit to pick stay/zoom from cheap signals and a compute-cost prior. A hysteresis deadbandlohi) prevents thrash. We run a lightweight pilot-zoom on ~5–10% of the shortlist to estimate the expected gain before full escalation. With conformal abstention, the system can say “I don’t know” (or escalate) when non-conformity exceeds a threshold tα.

The fold-aware jump (proactive)

Sometimes the low-d view suggests a topological fold: rapid neighbor swaps, DDP curvature, or “meet-again” behavior. When those cues appear, predict a small residual direction in the higher-d space and jump before local search fails.

Part 3 — Show me the math (for practitioners & reviewers)

Notation

  • Encoder: f_\theta(x) outputs z_x\in\mathbb{R}^D. We use the unit-norm version \bar z_x=z_x/\|z_x\|_2.
  • Fixed orthonormal basis: U\in\mathbb{R}^{D\times D} (columns are orthonormal).
  • Prefix / residual at depth d: p_x^{(d)}=(U^\top \bar z_x)_{1:d}, r_x^{(d)}=(U^\top \bar z_x)_{d+1:D}. If \|\bar z_x\|_2=1 then \|p_x^{(d)}\|_2^2+\|r_x^{(d)}\|_2^2=1.
  • Optional normalized $d$-view: \tilde z_x^{(d)}=p_x^{(d)}/\|p_x^{(d)}\|_2 (use for contrastive losses only).
  • Residual headroom: \epsilon_d(x)=\|r_x^{(d)}\|_2^2=1-\|p_x^{(d)}\|_2^2.
  • Cosine (certificate space) at depth d: \cos_d(x,y)=\big(p_x^{(d)}\big)^\top\big(p_y^{(d)}\big), \cos_D(x,y)=\bar z_x^\top \bar z_y.
  • Production distance (score) — ARD-RBF: with \tilde\Delta_i=\mathrm{softclip}\!\big((z_i-z'_i)/\sigma_i\big), \rho^{2(d)}(x,y)=\sum_{i=1}^{d}\tilde\Delta_i^2/\ell_i^2, S^{(d)}=\exp(-\tfrac12\rho^{2(d)}), define \delta^{(d)}(x,y)=1-S^{(d)}.
  • DDP (Dimensional Distance Profile): the curve d\mapsto \delta^{(d)}(x,y) under the chosen metric.
  • Positives/negatives for a query x: \mathcal{P}(x) / \mathcal{N}(x).
  • Depth ladder: \mathcal{D} denotes the served dimensions (e.g., \{128,256,512,1024\}).

Early-stop certificate (cosine domain)

By Cauchy–Schwarz on residuals, the full-D cosine lies in

\boxed{\ \cos_D(x,y)\in\big[\ \cos_d(x,y)\mp \sqrt{\epsilon_d(x)\,\epsilon_d(y)}\ \big]\ }

Certification rule: at depth d, certify winner a over runner-up b if \cos_d(x,a)-\sqrt{\epsilon_d(x)\epsilon_d(a)}>\cos_d(x,b)+\sqrt{\epsilon_d(x)\epsilon_d(b)}. (We score with ARD-RBF \delta, but we certify STOP/ZOOM in cosine.)

Training: CDCL + DDP shaping

Goal: learn geometry that is (i) contrastive at each depth and (ii) stable/smooth across depths so low-d works fast and high-d refines.

\mathcal{L}_{\mathrm{con}}=\sum_{d\in\mathcal{D}}\mathrm{InfoNCE}\big(\{\tilde z^{(d)}\}\big)

\mathcal{L}_{\mathrm{cons}}=\sum_{d\in\mathcal{D}}\mathrm{E}_{(x,y)}\big|\delta^{(d)}(x,y)-\delta^{(D)}(x,y)\big|

Definitions used below: \delta_{+}^{(d)}(x)=\mathrm{E}_{y\in\mathcal{P}(x)}[\delta^{(d)}(x,y)], \delta_{-}^{(d)}(x)=\mathrm{E}_{y\in\mathcal{N}(x)}[\delta^{(d)}(x,y)].

\mathcal{L}_{\mathrm{flat}}=\tfrac{1}{|\mathcal{D}|}\sum_{d\in\mathcal{D}}\Big(\delta_{+}^{(d)}-\tfrac{1}{|\mathcal{D}|}\sum_{d'\in\mathcal{D}}\delta_{+}^{(d')}\Big)^{2}

\mathcal{L}_{\mathrm{margin}}=\sum_{d\in\mathcal{D}}\mathrm{E}_{(x,y)\in\mathcal{N}(x)}\max\!\left(0,\ m-\delta^{(d)}(x,y)\right)

Total loss: \mathcal{L}=\mathcal{L}_{\mathrm{con}}+\lambda_{\mathrm{cons}}\mathcal{L}_{\mathrm{cons}}+\lambda_{\mathrm{flat}}\mathcal{L}_{\mathrm{flat}}+\lambda_{\mathrm{margin}}\mathcal{L}_{\mathrm{margin}}

Notes: m is the negative-pair margin; \lambda_{\mathrm{cons}},\lambda_{\mathrm{flat}},\lambda_{\mathrm{margin}} are scalar weights; expectations \mathrm{E}[\cdot] are over the current batch unless stated.

Zoom policy (Expected Value of Computation)

Idea: start cheap (low-d). If quick cues predict a gain from “zooming,” pay the cost to go deeper.

d^{*}(x)=\arg\max_{d\in\mathcal{D}}\ \mathbf{w}^{\top}\boldsymbol{\phi}_{d}(x)-\lambda\,C(d)

Cost model: C(d) captures extra latency/compute at depth d (e.g., proportional to FLOPs or wall-time).

Here \boldsymbol{\phi}_{d}(x) are cheap features (margin, stability-delta across depths, residual headroom \epsilon_d, local density, anisotropy). In practice we walk a ladder (e.g., 128→256→512→1024) and stop when the cosine certificate above proves the order, or thresholds pass; otherwise escalate.

Serving controller (v1.5): a contextual bandit chooses stay vs. zoom using \phi_{d}(x) and a compute prior. We apply hysteresis with thresholds

\tau_{\mathrm{lo}}

<

\tau_{\mathrm{hi}}

to avoid reversals, run a pilot-zoom on ~5–10% of the shortlist to estimate \widehat{\Delta},\mathrm{Quality}, and use conformal abstention (threshold t_{\alpha}) to abstain or escalate when uncertain.

Cross-scale lift (jump)

When low-d looks “braided,” predict a residual direction in the higher-d subspace and leap there.

R_{d^*}=\mathrm{span}\{u_{d^*+1},\dots,u_D\}

\hat r=g\big(\tilde z^{(d^*)},\,\boldsymbol{\phi}_{d^*}(x)\big)\in R_{d^*}

\tilde z=\mathrm{Norm}\big([\tilde z^{(d^*)};\,\hat r]\big)

R_{\mathrm{jump}}=\left( \cos(\tilde z, b^{(D)})-\max_{c\in\mathcal{C}} \cos(\tilde z, c^{(D)}) \right)-\lambda_{\mathrm{dim}}\ \| \hat r \|_{0}

Where: b^{(D)} is the target (or class prototype) in D dims; \mathcal{C} is the competitor set; \lambda_{\mathrm{dim}} controls residual sparsity; and \| \hat r \|_{0} counts non-zeros in \hat r.

The reward balances target separation in full D against the sparsity/size of the predicted residual.

Fold-aware proactive trigger

Cheap signals that a better ordering “over the fold” likely exists:

  • DDP curvature: \kappa_d=\bar\delta^{(d+1)}-2\bar\delta^{(d)}+\bar\delta^{(d-1)} (second difference of mean distance).
  • Neighbor-swap rate (braiding): \sigma_d=1-\tau_K(d,\,d+\Delta) (Kendall \tau between neighbor orders across depths).
  • Meet-again / reconvergence score (diverge then agree again in higher-d).
  • Residual headroom: \epsilon_d (energy still left beyond depth d).
  • Hubness skew around the query (unstable hubs → likely fold).

Definitions: \bar\delta^{(d)} is a batch mean (often over positives); \tau_K is Kendall \tau between top-K neighbor lists across depths.

Fold potential head: \Psi_{\mathrm{fold}}=\sigma\!\big(\mathbf{w}_f^{\top}\boldsymbol{\phi}^{\mathrm{fold}}_{d}\big)

Decision energy: \Delta U_{\mathrm{jump}}(d\!\to\!d')=\mathrm{E}\big[\mathrm{Gain}_{d'}-\mathrm{Gain}_{d}\mid x,\,\boldsymbol{\phi}_{d}\big]-c_t\,\Delta t(d,d')-c_c\,\mathrm{Cost}(d')

Policy: jump if \Psi_{\mathrm{fold}}>\tau or \Delta U_{\mathrm{jump}}>0 (subject to budget).

Part 4 — Evaluation plan (what we’ll test if we proceed)

  • Datasets: representative public text/code retrieval sets + a synthetic meet-again benchmark.
  • Metrics: AUQC (Area-Under Quality-vs-Compute), Budget frontier (p50/p90/p95 quality vs. latency/compute), Zoom-AUC, Thrash@τ (hysteresis reversals), Neighbor-Stability (Kendall-τ across depths), Hubness skew, Recall@K per rung, Jump@K, dims-per-jump, braid/meet-again indices, DDP AUC, cost (re-ranks/tokens, p95 latency).
  • Baselines: 1024-only retriever; early-exit escalator (no lift); cross-encoder re-rank; dimensional ensembling; ours (proposed) without vs. with fold-aware jumps.
  • Ablations: remove each DDP loss; remove lift; remove fold features one-by-one; ladder/anchor variants; basis swaps with versioning.
  • Stats: bootstrap CIs, paired permutation tests, effect sizes.

Ops & CI (regression guards)

  • AUQC/Budget frontier: track movement of the frontier release-to-release.
  • Neighbor-Stability: mean ± SE of Kendall-τ across depths.
  • Zoom-AUC and Thrash@τ to catch oscillation.
  • Conformal coverage: empirical coverage vs. target (1−α).
  • Index telemetry: HNSW hops, PQ recall@K, GPU util.

Acceptance tests (go/no-go)

  1. Compute ↓, Quality =/↑: AUQC ↑ or p95 latency ↓ ≥10% at matched Hit@1.
  2. Stability: Kendall-τ SE ↓ ≥10% (fewer cross-depth neighbor flips).
  3. Abstention: conformal coverage within ±1.5% of target (1−α).
  4. Retrieval: final-rung recall@10 ≥0.98× brute-force on a 1M shard; ≥5× speedup with HNSW+slice-PQ.

Part 5 — Potential benefits (if validated)

Reduced compute by avoiding brute-force high-d on everything.

Long-tail improvements (polysemy, rare relations) via early fold-aware jumps.

Ops clarity: explicit budgets, thresholds, versioned bases for audits.

Part 6 — Limitations & risks (up front)

  • Fold cues may be noisy → over-jump; needs budget/entropy priors.
  • Sensitivity to basis versioning and domain shift in fold statistics.
  • As with any retrieval improvement, consider policy/safety filters.

Part 7 — Learn more / follow along

Working draft, LaTeX, configs, and notes will live in the OSF project:
OSF: https://osf.io/85z3u/files
We’ll update that page with code and evaluation artifacts if we move to experiments.