Hyperbolic Neural Networks

 

If you’re new to hyperbolic geometry, we recommend that you read our Simple Geometry Initiation.

 

This blogpost presents our work Hyperbolic Neural Networks (arxiv paper, code) (accepted to NIPS’18 with a spotlight presentation).

 

The aim of this work is to make deep learning feasible in hyperbolic space, more specifically in the Poincaré ball.

 

In particular, we’ll explain how to :

  • Perform matrix-vector multiplications, apply non-linearities
  • Adapt certain neural network architectures (basic RNN, GRU, …)
  • Adapt softmax/multinomial logistic regression

 

 

Starter: basic definitions and Möbius operations

 

Remember that in the Poincaré ball model of hyperbolic space, the set we consider is the open ball

\mathbb{D}^n = \lbrace x\in\mathbb{R}^n,\ \parallel x\parallel_2\ <\ \ 1 \rbrace,

and we measure distances between two points x,y\in\mathbb{D}^n via

d(x,y)=\cosh^{-1}\left(1+\frac{1}{2}\lambda_x\lambda_y\parallel x-y\parallel_2^2\right),

where \cosh(u)=(e^u+e^{-u})/2 is the even part of e^u and \lambda_x=2/(1-\parallel x\parallel_2^2) is called the conformal factor.

 

With this definition in hand, you may wonder: can we define basic operations, such as adding two elements or rescaling one by a real number?

 

These have indeed already been defined, and are called Möbius operations.

 

The Möbius addition of y to x is defined by

x\oplus y := \frac{(1+2\langle x,y\rangle+\parallel y\parallel_2^2)x+(1-\parallel x\parallel_2^2)y}{1+2\langle x,y\rangle+\parallel x\parallel_2^2\parallel y\parallel_2^2}

 

Although this operation is not commutative nor associative, it satisfies some basic desirable

  • The zero vector is still the neutral element: 0\oplus x = x\oplus 0 = x
  • Every element has an inverse, which is simply the opposite: (-x)\oplus x = x\oplus (-x) = 0
  • We also have a left-cancellation law: (-x)\oplus(x\oplus y)=y
  • The border of the ball represents infinity: if \parallel x\parallel_2=1 then x\oplus y=x

 

The Möbius scalar multiplication of x by r\in\mathbb R is defined by

r\otimes x = \tanh(r\tanh^{-1}(\parallel x\parallel_2))\frac{x}{\parallel x\parallel}

Likewise, this operation satisfies a few

  • Associativity: r\otimes (s \otimes x) = (rs)\otimes x
  • n-additions: n\otimes x = x\oplus\ \cdot\cdot\cdot\ \oplus x (n times)
  • Scalar distributivity: (r+s)\otimes x=(r\otimes x)\oplus(s\otimes x)
  • Scaling property: r\otimes x/\parallel r\otimes x\parallel = x/\parallel x\parallel

 

Another reason why these definitions make sense, is because they provide simple formulas for

 

The geodesic between two points x,y\in\mathbb{D}^n is the shortest path connecting them, and is given by

\forall t\in[0,1],\ \ \ \ \gamma_{x\to y}(t)=x\oplus(t\otimes ((-x)\oplus y))

 

Similarly, in Euclidean space, the shortest path between two points x,y\in\mathbb{R}^n is given by

\forall t\in[0,1],\ \ \ \gamma_{x\to y}(t)=x+(t\times((-x)+y))=(1-t)x+ty

 

These formulas look very similar, which pleads in favor of such definitions for addition and scalar multiplication in hyperbolic space.

 

Matrix-vector multiplication in hyperbolic space

 

Our first contribution results in noticing that we can re-write the mysterious formula of Möbius scalar multiplication in a more intuitive manner, using two well-known tools in differential geometry called the 

A rigorous definition is outside of the scope of this blogpost, but let’s give a quick intuitive explanation.

A manifold \mathcal M of dimension n is a sort of n-dimensional surface. At each point x\in\mathcal M, we can define its tangent space T_x\mathcal M, an n-dimensional vector space, which is a local, first order approximation of the manifold around x.

Then, if you take a vector v, tangent to \mathcal M at x, and want to move inside the manifold in the direction of v, you have to use a map called the exponential map at x:

\exp_x\ :\ \ \ \ \ T_x\mathcal M\to\mathcal M.

It kind of “folds” the tangent space onto the manifold, locally (see Figure 1).

Figure 1: Illustration showing how the exp map takes points from the tangent space, to the manifold. Here, the manifold is the 2D sphere, hence the tangent space is a 2D plane.

 

Now the log map is just its inverse:

\log_x\ :\ \ \ \ \ \mathcal{M}\to T_x\mathcal M.

 

 

We get the following formula:

r\otimes x = \exp_0(r\log_0(x)).

 

This means that the Möbius scalar multiplication of x by r corresponds to

  1. Moving x to the tangent space at 0 of the Poincaré ball using \log_0,
  2. Multiplying this vector by r, since \log_0(x) is now in the vector space T_x\mathbb{D}^n=\mathbb{R}^n,
  3. Projecting it back on the manifold using the exp map at 0.

 

This identity can be easily verified by replacing the exp and log maps by their

\exp_x(v)=x\oplus\left(\tanh\left(\frac{1}{2}\lambda_x\parallel v\parallel\right)\frac{v}{\parallel v\parallel}\right)

\log_x(y) = \frac{2}{\lambda_x}\tanh^{-1}\left(\parallel (-x)\oplus y\parallel\right) \frac{(-x)\oplus y}{\parallel(-x)\oplus y\parallel}

Note that when x=0, they become much simpler:

\exp_0(v)=\tanh\left(\parallel v\parallel\right)\frac{v}{\parallel v\parallel}

\log_0(v)=\tanh^{-1}\left(\parallel v\parallel\right)\frac{v}{\parallel v\parallel}

 

We propose to define matrix-vector multiplications in the Poincaré ball in a similar manner:

M\otimes x\ := \exp_0(M\log_0(x)).

 

It is then easily checked that this operation still satisfies the usual desirables

  • Matrix associativity: M\otimes (N\otimes x)\ = (MN)\otimes x
  • Compatibility with scalar multiplication: M\otimes (r\otimes x)\ = (rM)\otimes x = r\otimes(M\otimes x)
  • Directions are preserved: M\otimes x/\parallel M\otimes x\parallel = Mx/\parallel Mx\parallel for Mx\neq 0
  • Rotations are preserved: M\otimes x= Mx for M\in\mathcal{O}_n(\mathbb{R})

 

 

Applying a non-linearity

 

The previous definitions naturally suggest to apply non-linearities \varphi in same manner:

  1. Moving x to the tangent space at 0 of the Poincaré ball using \log_0,
  2. Applying the non-linearity to \log_0(x), since it is now in the vector space T_x\mathbb{D}^n=\mathbb{R}^n,
  3. Projecting it back on the manifold using the exp map at 0.

 

 

We get

\varphi^{\otimes}(x)\ :=\exp_0(\varphi(\log_0(x))).

 

Note that since addition and multiplication are already non-linear, you might not need any additional non-linearity. Give it a try. In our experiments, we obtained better results by removing it.

 

Bias translation

 

How should we “translate” a point x\in\mathbb{D}^n by another one b\in\mathbb{D}^n?

 

Since we already have a notion of addition, we can simply take

x\oplus b

 

However, if you’re already familiar with basic differential geometry, you could come up with

 

Indeed, another natural way to define a translation of x in the direction v\in T_x\mathbb{D}^n, would be to move x along the geodesic with initial speed vector v, using the exp map:

\exp_x(v)

 

So, which definition should we choose?

 

Good news: they are equivalent, thanks to the following formula:

x\oplus b= \exp_x(P_{0\to x}(\log_0(b))),

where P_{0\to x}\ :\ \ \ \ T_0\mathbb{D}^n\to T_x\mathbb{D}^n is the parallel transport (outside of the scope of this post). Quickly, parallel transport explains how to connect different tangent spaces in a somehow canonical way.

 

As they are equivalent, simply use x\leftarrow x\oplus b as your bias translation.

 

RNN, GRU

 

It seems like we have all we need to define a simple RNN!

 

A naive RNN is usually defined with some pointwise non-linearity \varphi by the following update equation:

h_{t+1}=\varphi(W h_t + U x_t + b),

where W,\ U are matrices and b is a bias.

 

A natural adaptation using our formulas yields:

h_{t+1} = \varphi^{\otimes}\left(((W\otimes h_t)\oplus(U\otimes x_t))\oplus b\right).

 

 

Now can we also adapt the GRU architecture?

 

Yes! Feel free to compare the formulas of the Euclidean and hyperbolic GRUs:

 

  • r_t = \sigma (W^r h_{t-1} + U^r x_t + b^r)
  • z_t = \sigma (W^z h_{t-1} + U^z x_t + b^z)
  • \tilde{h}_t = \varphi(W(r_t\odot h_{t-1})+U x_t +b)
  • h_t = (1-z_t)\odot h_{t-1} + z_t\odot \tilde{h}_t

 

  • r_t = \sigma \log_0\left(((W^r\otimes h_{t-1})\oplus(U^r\otimes x_t ))\oplus b^r\right)
  • z_t = \sigma \log_0\left(((W^z\otimes h_{t-1})\oplus(U^z\otimes x_t ))\oplus b^z\right)
  • \tilde{h}_{t} = \varphi^{\otimes}\left((([W\mathrm{diag}(r_t)]\otimes h_{t-1})\oplus (U\otimes x_t))\oplus b \right)
  • h_t = h_{t-1} \oplus\left( \mathrm{diag}(z_t)\otimes((-h_{t-1})\oplus \tilde{h_t})\right)

 

where \mathrm{diag}(u) is the square diagonal matrix with the u_i‘s on its diagonal.

For a complete justification of these equations, see Section 3.3 of the paper.

 

Remember that Euclidean space has zero curvature and the Poincaré ball has curvature -1? Well, in our paper, we consider the general case of curvature -c for some c>0, and we show that continuously moving c from 1 to 0, our general h-GRU equations give back the standard GRU equations 🙂

 

In order to assess whether these models can actually be trained, we ran some simple

We evaluate our method on two tasks.
The first is natural language inference, or textual entailment. Given two sentences, a premise (e.g. “Little kids A. and B. are playing soccer.”) and a hypothesis (e.g. “Two children are playing outdoors.”), the binary classification task is to predict whether the second sentence can be inferred from the first one. This defines a partial order in the sentence space. We test hyperbolic networks on the biggest real dataset for this task: SNLI.
We conjecture that the improvements of hyperbolic neural networks are more significant when the underlying data structure is closer to a tree. To test this, we design a proof-of-concept task of detection of noisy prefixes, i.e. given two sentences, one has to decide if the second sentence is a noisy prefix of the first, or a random sentence. We thus build synthetic datasets PREFIX-Z% (for Z being 10, 30 or 50) as follows: for each random first sentence of random length at most 20 and one random prefix of it, a second positive sentence is generated by randomly replacing Z% of the words of the prefix, and a second negative sentence of same length is randomly generated. Word vocabulary size is 100, and we generate 500K training, 10K validation and 10K test pairs.
Results are summarized in Table 1.

 

 

On these tasks, we found that hyperbolic RNN either outperform or are on par with their Euclidean counterparts. However, more empirical investigation is needed.

 

Hyperbolic softmax

 

In order to generalize multinomial logistic regression (MLR, also called softmax regression), we proceed in 3 steps:

  1. We first reformulate softmax probabilities with a distance to a margin hyperplane
  2. Second, we define hyperbolic hyperplanes
  3. Finally, by finding the closed-form formula of the distance between a point and a hyperbolic hyperplane, we derive the final formula.

 

Similarly as for the hyperbolic GRU, we show that when the Poincaré ball in continuously flattened to Euclidean space (sending its radius to infinity), hyperbolic softmax converges to Euclidean softmax.

 

The hyperbolic hyperplane centered at p\in\mathbb{D}^n, with normal direction a\in T_p\mathbb{D}^n is given by

\lbrace x\in\mathbb{D}^n,\ \langle (-p)\oplus x, a\rangle=0\rbrace

See Figure 2 for an illustration.

Figure 2: Example of a 2D hyperbolic hyperplane in 3D Poincaré ball. The red point is p. The shown normal axis to the hyperplane through p is parallel to a.

 

 

Have a look at our final closed formula for the

The hyperbolic softmax probabilities are given by

p(y=k\mid x) \prop \exp\(\lambda_{p_k}\parallel a_k\parallel \sinh^{-1}\left(\frac{2\langle (-p_k)\oplus x,a_k\rangle}{(1-\parallel (-p_k)\oplus x,a_k\parallel^2)\parallel a_k\parallel}\right)\)

with p_k\in\mathbb{D}^n and a_k\in T_{p_k}\mathbb{D}^n.

You should hence:

  1. optimize p_k with Riemannian optimization,
  2. use the parallel transport identity a_k = P_{0\to p_k}(a'_k) = (\lambda_0/\lambda_{p_k})a'_k = (1-\parallel p_k\parallel^2) a'_k where a'_k\in T_0\mathbb{D}^n is now independent of p_k and can be optimized as a Euclidean parameter.

 

Figure 3 illustrates a comparison between 2-class classification obtained respectively with hyperbolic and Euclidean softmax.

 

Figure 3: Left: hyperbolic margin hyperplane from hyperbolic softmax. Right: Euclidean margin hyperplane from Euclidean softmax.

 

In order to assess whether our hyperbolic adaptation of MLR would yield any practical improvements, we ran a few

 

Table 2 shows results for binary classification between a subtree of the WordNet hierarchy, and the rest of the vocabulary, after being embedded in the Poincaré ball.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2 thoughts on “Hyperbolic Neural Networks

  1. Jibril says:

    Hello,

    I have been reading your paper Hyperbolic Neural Networks with a lot of interest and I would like to reproduce the results from Table 2 (hyperbolic MLR classification on Wordnet noun Hierarchy)

    I visited your github page but could only find the code to reproduce the results from Table 1.

    Could you provide the code to reproduce your results on Wordnet ?

    Thanks

Leave a Reply

Your email address will not be published. Required fields are marked *