Hyperbolic Entailment Cones for Learning Hierarchical Embeddings

Octavian-Eugen Ganea, Gary Becigneul, Thomas Hofmann

ICML'18 submission

Basic Concepts of Differential Geometry

  • Manifold: generalization of a surface in n dimensions
     
  • Tangent space at x: \(T_x M\)
     
  • Riemannian metric: \(g=(g_x(\cdot,\cdot))_{x\in M}\)
     
  • Length of a curve \(\gamma\):
    $$L(\gamma)=\int_{t=0}^1\sqrt{g_{\gamma(t)}(\gamma'(t),\gamma'(t))}dt$$
  • Exponential map: $$\exp_x(v):T_x M\to M$$

https://commons.wikimedia.org/wiki/File:Tangentialvektor.svg

from http://inspirehep.net/record/1355197/plots

Hyperbolic Geometry

from http://mathworld.wolfram.com/PoincareHyperbolicDisk.html

Hyperbolic Geometry

  • Geodesics 
  • Conformality (angle preserving)
  • Metric tensor (gives geodesics, distances, needed for optimization)
  • Distance \(d_{\mathbb D}(x,y) = \cosh^{-1}\left(1+ 2 \frac{\| x-y\|^2}{(1-\|x\|^2) \cdot (1-\|y\|^2)} \right)\)

Poincaré Ball

  • Manifold: \(\mathbb{D}^n=\{x\in\mathbb{R}^n\mid\Vert x\Vert_2<1\}\)
  • Tangent space: \(T_x\mathbb{D}^n=\mathbb{R}^n\)
  • Metric tensor: \(g_x(u,v)=\lambda_x^2 u^T v,\ \lambda_x:=\frac{2}{1-\Vert x\Vert^2}\)
  • Distance \(d_{\mathbb D}(x,y) = \cosh^{-1}\left(1+ 2 \frac{\| x-y\|^2}{(1-\|x\|^2) \cdot (1-\|y\|^2)} \right)\)

Property 1:

  • Volumes grow exponentially with radius, similar to trees in which number of nodes grows exponentially with depth
  • Length of circle of radius \(r\): \(2\pi\sinh(r)\)

Properties of the Poincaré Ball

[1] Christopher De Sa et al., 2018, Representation Tradeoffs for Hyperbolic Embeddings

\frac{d(x,y)}{d(0,x) + d(0,y)}
d(x,y)d(0,x)+d(0,y)\frac{d(x,y)}{d(0,x) + d(0,y)}

Figure taken from [1].

Property 2:

Hyperbolic vs Euclidean Tree Embeddings

[1] https://www.cs.cornell.edu/~cdesa/papers/arxiv2018_hyperbolic.pdf

[2] https://en.wikipedia.org/wiki/Hyperbolic_metric_space

Trees can be embedded with arbitrary low distortions in \(\mathbb D_2\), but NOT in the Euclidean space for any number of dimensions! [1]

 

\text{distortion}(G) = \sum_{x,y \in G, x \neq y}\frac{|d_G(x,y) - d_{M}(x,y)|}{d_G(x,y)}
distortion(G)=x,yG,xydG(x,y)dM(x,y)dG(x,y)\text{distortion}(G) = \sum_{x,y \in G, x \neq y}\frac{|d_G(x,y) - d_{M}(x,y)|}{d_G(x,y)}
  • Graph distortion when embedding in manifold M:
  • Reverse: given \(n\) points on \(\mathbb D\), can we find a tree quasi-isometrically* embedded into these points ?
    • Yes for hyperbolic spaces (Gromov theorem [2])
    • No for Euclidean spaces.
*:\ \forall x_1,...,x_n\in X, \exists \text{weighted tree } T \text{and embedding } f:T\to X, \text{ s.t. } \forall i,j
: x1,...,xnX,weighted tree Tand embedding f:TX, s.t. i,j*:\ \forall x_1,...,x_n\in X, \exists \text{weighted tree } T \text{and embedding } f:T\to X, \text{ s.t. } \forall i,j
\vert d_T(f^{-1}(x_i),f^{-1}(x_j))- d_X(x_i,x_j)\vert=\mathcal{O}(\delta\log(n))
dT(f1(xi),f1(xj))dX(xi,xj)=O(δlog(n))\vert d_T(f^{-1}(x_i),f^{-1}(x_j))- d_X(x_i,x_j)\vert=\mathcal{O}(\delta\log(n))

Hyperbolic Geometry - Math

For \(x\in\mathbb{D}^n\) and \(v\in T_x\mathbb{D}^n\), we have the geodesic satisfying \(\gamma_{x,v}(0)=x,\ \dot{\gamma}_{x,v}(0)=v\),
 

$$\gamma_{x,v}(t)=\dfrac{\left(\lambda_x\cosh(t)+\lambda_x^2\langle x,v\rangle\sinh(t)\right)x+\lambda_x\sinh(t) v}{1+(\lambda_x-1)\cosh(t)+\lambda_x^2\langle x,v\rangle\sinh(t)}$$
and the exponential map

$$\exp_x(v)=\frac{\lambda_x\left(\cosh(\lambda_x\Vert v\Vert)+\langle x,\frac{v}{\Vert v\Vert}\rangle\sinh(\lambda_x\Vert v\Vert)\right)}{1+(\lambda_x-1)\cosh(\lambda_x\Vert v\Vert)+\lambda_x\langle x,\frac{v}{\Vert v\Vert}\rangle\sinh(\lambda_x\Vert v\Vert)}x+$$ $$\frac{\frac{1}{\Vert v\Vert}\sinh(\lambda_x\Vert v\Vert)}{1+(\lambda_x-1)\cosh(\lambda_x\Vert v\Vert)+\lambda_x\langle x,\frac{v}{\Vert v\Vert}\rangle\sinh(\lambda_x\Vert v\Vert)}v$$

Trees/DAGs Models:

 Poincaré Embeddings - M.Nickel et al, 2017

Nickel, M., and Douwe K. "Poincaré embeddings for learning hierarchical representations." NIPS. 2017.

  • does not explicitly model asymmetric relations; relies on heuristic:   score(is-a\((u,v)) = (1 + \alpha (\|u\| - \|v\|)) d(u,v)\)
  • loss:

Modeling Entailment  - Order Embeddings

Vendrov, Ivan, et al. "Order-embeddings of images and language." ICLR 2016

  • entailment region/cone: \(\mathcal{O}_x:=\lbrace y\mid\forall i\in[d],\ y_i\geq x_i\rbrace\)
  • define a partial order on \(\mathbb{R}^d\)

Disadvantages:

  • capacity scales linearly with dimension d
  • heavy  cone intersections
  • disjoint areas of the cones are bounded

Hyperbolic Entailment Cones 

  • Goal 1: generalize entailment regions/cones to any Riemannian manifold
  • Goal 2: derive cones in the hyperbolic space (d-dimensional Poinaré ball)

figure adapted from https://math.stackexchange.com/questions/1407550/what-hyperbolic-space-really-looks-like/1408383

Convex Cones in a Complete Riemannian Manifold

  • Euclidean convex cones: 

$$v_1, v_2 \in S \Longrightarrow \alpha v_1 + \beta v_2 \in S \quad (\forall \alpha, \beta \geq 0)$$

  • Generalize convex cones for complete Riemannian manifolds:$$S \subseteq T_x\mathcal{M}, \quad \frak S_x := \exp_x \left( S \right) \subset\mathcal{M}$$

Angular Cones in the Poincaré Ball

Goal:

  • induce a partial order in \(\mathbb{D}^n\)
  • avoid heavy cone intersections
  • extend in all space directions
  • scale capacity exponentially with dimension

Angular Cones in the Poincaré Ball

Satisfy four intuitive properties:

  • axial symmetry (depends on aperture angle \(\psi\)) \(S_x^\psi := \{ v: \angle(v,x) \le \psi(x) \}, \quad \frak S^\psi_x := \exp_x(S_x^\psi)\)
  • continuous cone aperture function \(\psi\)
  • rotation invariance (only depends on norm of x) $$\exists \tilde{\psi} \quad \text{s.t.} \quad \psi(x) = \tilde{\psi}(\Vert x\Vert)$$
\psi
ψ\psi

Angular Cones in the Poincaré Ball

Satisfy four intuitive properties:

  • (hardest) transitivity of nested angular cones \(\forall x,x' \in \mathbb D^n: \quad x' \in \frak S^{\psi(x)}_x \; \Longrightarrow \; \frak S^{\psi(x')}_{x'} \subseteq \frak S^{\psi(x)}_x\)
\psi
ψ\psi
\psi
ψ\psi

Angular Cones in the Poincaré Ball

Four properties \(\Longrightarrow\) closed form expression of \(\psi\)

Lemma 1: If transitivity holds, then $$\forall x \in \text{Dom}(\psi): \quad \psi(x) \leq \frac{\pi}{2}$$

 

Proof idea:

  • take any \(x' \in \partial \frak S^\psi_x\), the cone border
  • then \(\psi(x')\) cannot be \(> \frac{\pi}{2}\)
  • continuously move \(x'\) towards \(x\)

Angular Cones in the Poincaré Ball

Four properties \(\Longrightarrow\) closed form expression of \(\psi\)

Theorem 2: If transitivity holds, then the function $$h : [0,1) \cap \text{Dom}(\tilde{\psi}) \rightarrow \mathbb R_+, \quad h(r):= \frac{r}{1 - r^2} \sin(\tilde{\psi}(r))$$ is non-increasing.

 

 

Proof idea:

  • $$\forall x\in\mathbb D^n, \forall x' \in \partial \frak S^{\psi(x)}_x$$ transitivity and hyperbolic law of sines in \(\Delta\) Oxx': $$\Longrightarrow \sin(\psi(\|x'\|)) \sinh(\|x'\|_{\mathbb D}) \leq \sin(\psi(\|x\|)) \sinh(\|x\|_{\mathbb D})$$

Angular Cones in the Poincaré Ball

Four properties \(\Longrightarrow\) closed form expression of \(\psi\)

Lemma 1 and Theorem 2 \(\Longrightarrow\) $$\psi : \mathbb{D}^n \setminus \mathcal{B}^n(O,\epsilon) \rightarrow (0,\pi/2)$$ $$x \mapsto \arcsin ( K (1 - \|x\|^2)/\|x\| )$$

or, equivalently (more practically useful):

$$\frak S^{\psi(x)}_x =\bigg\{ y \in \mathbb{D}^n | \quad \Xi(x,y) \leq \arcsin \left( K \frac{1 - \|x\|^2}{\|x\|} \right) \bigg\}$$

where

$$\Xi(x,y)=\arccos \left( \frac{<x,y> (1 + \|x\|^2) - \|x\|^2 (1 + \|y\|^2)}{\|x\| \cdot \|x-y\| \sqrt{1 + \|x\|^2 \|y\|^2 - 2 <x,y>}} \right)$$

Angular Cones in the Poincaré Ball

Four properties \(\Longrightarrow\) closed form expression of \(\psi\)

Similar derivation can be done for Euclidean angular cones

Learning with Entailment Cones

  • Learning from positive and negative pairs of nodes in a DAG
  • loss: $$\mathcal{L}=\sum_{(u,v)\in P} E(u,v)+\sum_{(u',v')\in N}\max(0,\gamma-E(u',v'))$$
  • penalty: \(E(u,v):=\max(0,\Xi(u,v)-\psi(u)),\)

Riemannian Optimization

  • SGD becomes: $$u\leftarrow\exp_u(-\eta\nabla^R_u\mathcal{L}), \quad u \in \mathbb{D}$$
  • Riemannian gradient: $$\nabla^R_u\mathcal{L}=(1/\lambda_u)^2\nabla_u\mathcal{L}, \quad \text{conformal factor } \lambda_u=\frac{2}{1-\|u\|^2}$$
  • fast and efficient approximation exists for the hyperbolic space (i.e. via retractions)

Experiments - Qualitative

  • Left deck: uniform tree of depth 7 and branching factor 3
  • Right deck: WordNet mammal subtree - 4230 edges, 1165 nodes
  • Each left: Poincare embeddings
  • Each right: our hyperbolic cones

Quantitative Experiments - Link Prediction in DAGs

  • Transitive closure of WordNet hierarchy: 82K nodes, 660K edges

Thank you!