## Hyperbolic Neural Networks

If you’re new to hyperbolic geometry, we recommend that you read our Simple Geometry Initiation.

This blogpost presents our work Hyperbolic Neural Networks (arxiv paper, code, poster, video), accepted to NIPS’18 with a spotlight presentation.

The aim of this work is to make deep learning feasible in hyperbolic space, more specifically in the Poincaré ball.

In particular, we’ll explain how to :

• Perform matrix-vector multiplications, apply non-linearities
• Adapt certain neural network architectures (basic RNN, GRU, …)

### Starter: basic definitions and Möbius operations

Remember that in the Poincaré ball model of hyperbolic space, the set we consider is the open ball

and we measure distances between two points via

where is the even part of and is called the conformal factor.

With this definition in hand, you may wonder: can we define basic operations, such as adding two elements or rescaling one by a real number?

These have indeed already been defined, and are called Möbius operations.

The Möbius addition of to is defined by

Although this operation is not commutative nor associative, it satisfies some basic desirable

• The zero vector is still the neutral element:
• Every element has an inverse, which is simply the opposite:
• We also have a left-cancellation law:
• The border of the ball represents infinity: if then

The Möbius scalar multiplication of by is defined by

Likewise, this operation satisfies a few

• Associativity:
• Scalar distributivity:
• Scaling property:

Another reason why these definitions make sense, is because they provide simple formulas for

The geodesic between two points is the shortest path connecting them, and is given by

Similarly, in Euclidean space, the shortest path between two points is given by

These formulas look very similar, which pleads in favor of such definitions for addition and scalar multiplication in hyperbolic space.

### Matrix-vector multiplication in hyperbolic space

Our first contribution results in noticing that we can re-write the mysterious formula of Möbius scalar multiplication in a more intuitive manner, using two well-known tools in differential geometry called the

A rigorous definition is outside of the scope of this blogpost, but let’s give a quick intuitive explanation.

A manifold of dimension  is a sort of -dimensional surface. At each point , we can define its tangent space , an -dimensional vector space, which is a local, first order approximation of the manifold around .

Then, if you take a vector , tangent to at , and want to move inside the manifold in the direction of , you have to use a map called the exponential map at :

.

It kind of “folds” the tangent space onto the manifold, locally (see Figure 1).

Now the log map is just its inverse:

.

We get the following formula:

.

This means that the Möbius scalar multiplication of  by  corresponds to

1. Moving to the tangent space at 0 of the Poincaré ball using ,
2. Multiplying this vector by , since is now in the vector space ,
3. Projecting it back on the manifold using the exp map at 0.

This identity can be easily verified by replacing the exp and log maps by their

Note that when , they become much simpler:

We propose to define matrix-vector multiplications in the Poincaré ball in a similar manner:

.

It is then easily checked that this operation still satisfies the usual desirables

• Matrix associativity:
• Compatibility with scalar multiplication:
• Directions are preserved: for
• Rotations are preserved: for

### Applying a non-linearity

The previous definitions naturally suggest to apply non-linearities  in same manner:

1. Moving to the tangent space at 0 of the Poincaré ball using ,
2. Applying the non-linearity to , since it is now in the vector space ,
3. Projecting it back on the manifold using the exp map at 0.

We get

.

Note that since addition and multiplication are already non-linear, you might not need any additional non-linearity. Give it a try. In our experiments, we obtained better results by removing it.

### Bias translation

How should we “translate” a point by another one ?

Since we already have a notion of addition, we can simply take

However, if you’re already familiar with basic differential geometry, you could come up with

Indeed, another natural way to define a translation of in the direction , would be to move along the geodesic with initial speed vector , using the exp map:

So, which definition should we choose?

Good news: they are equivalent, thanks to the following formula:

,

where is the parallel transport (outside of the scope of this post). Quickly, parallel transport explains how to connect different tangent spaces in a somehow canonical way.

As they are equivalent, simply use as your bias translation.

### RNN, GRU

It seems like we have all we need to define a simple RNN!

A naive RNN is usually defined with some pointwise non-linearity  by the following update equation:

,

where are matrices and is a bias.

A natural adaptation using our formulas yields:

.

Now can we also adapt the GRU architecture?

Yes! Feel free to compare the formulas of the Euclidean and hyperbolic GRUs:

where is the square diagonal matrix with the ‘s on its diagonal.

For a complete justification of these equations, see Section 3.3 of the paper.

Remember that Euclidean space has zero curvature and the Poincaré ball has curvature ? Well, in our paper, we consider the general case of curvature for some , and we show that continuously moving from  to , our general h-GRU equations give back the standard GRU equations 🙂

In order to assess whether these models can actually be trained, we ran some simple

We evaluate our method on two tasks.
The first is natural language inference, or textual entailment. Given two sentences, a premise (e.g. “Little kids A. and B. are playing soccer.”) and a hypothesis (e.g. “Two children are playing outdoors.”), the binary classification task is to predict whether the second sentence can be inferred from the first one. This defines a partial order in the sentence space. We test hyperbolic networks on the biggest real dataset for this task: SNLI.
We conjecture that the improvements of hyperbolic neural networks are more significant when the underlying data structure is closer to a tree. To test this, we design a proof-of-concept task of detection of noisy prefixes, i.e. given two sentences, one has to decide if the second sentence is a noisy prefix of the first, or a random sentence. We thus build synthetic datasets PREFIX-Z% (for Z being 10, 30 or 50) as follows: for each random first sentence of random length at most 20 and one random prefix of it, a second positive sentence is generated by randomly replacing Z% of the words of the prefix, and a second negative sentence of same length is randomly generated. Word vocabulary size is 100, and we generate 500K training, 10K validation and 10K test pairs.
Results are summarized in Table 1.

On these tasks, we found that hyperbolic RNN either outperform or are on par with their Euclidean counterparts. However, more empirical investigation is needed.

### Hyperbolic softmax

In order to generalize multinomial logistic regression (MLR, also called softmax regression), we proceed in 3 steps:

1. We first reformulate softmax probabilities with a distance to a margin hyperplane
2. Second, we define hyperbolic hyperplanes
3. Finally, by finding the closed-form formula of the distance between a point and a hyperbolic hyperplane, we derive the final formula.

Similarly as for the hyperbolic GRU, we show that when the Poincaré ball in continuously flattened to Euclidean space (sending its radius to infinity), hyperbolic softmax converges to Euclidean softmax.

The hyperbolic hyperplane centered at , with normal direction is given by

See Figure 2 for an illustration.

Have a look at our final closed formula for the

The hyperbolic softmax probabilities are given by

with and .

You should hence:

1. optimize with Riemannian optimization,
2. use the parallel transport identity where is now independent of and can be optimized as a Euclidean parameter.

Figure 3 illustrates a comparison between 2-class classification obtained respectively with hyperbolic and Euclidean softmax.

In order to assess whether our hyperbolic adaptation of MLR would yield any practical improvements, we ran a few

Table 2 shows results for binary classification between a subtree of the WordNet hierarchy, and the rest of the vocabulary, after being embedded in the Poincaré ball.

## 7 thoughts on “Hyperbolic Neural Networks”

1. Jibril says:

Hello,

I have been reading your paper Hyperbolic Neural Networks with a lot of interest and I would like to reproduce the results from Table 2 (hyperbolic MLR classification on Wordnet noun Hierarchy)

I visited your github page but could only find the code to reproduce the results from Table 1.

Could you provide the code to reproduce your results on Wordnet ?

Thanks

1. Hyperbolic Deep Learning says:

Thanks for your interest! We will release the code soon and come back to you 😉

2. Kangkang Li says:

I have been reading your paper Hyperbolic Neural Networks.It is a good job. but it is hard for me to understand the paper.I know the website is help for student like me to understand your work. But all the formulas of the page are messy code .Could the staff of the website repair it? Thank you.

3. Haithem says:

A very interesting discovery. It will open doors for new trends in DL.

4. Shawn Zhou says:

Can you guys clarify my confusing about your recent two publications? In one of them, titled as “Learning continuous hierarchies in the Lorentz model of hyperbolic geometry”, it was argued that Lorentz model is better than Poincare model. Yet in the second one, titled as “Hyperbolic neural networks”, which is also mainly summarized in this blog, you still use Poincare model. Which one do you suggest me to use? Thanks!

1. Hyperbolic Deep Learning says:

The ICML’18 paper using the Lorentz model is not from us, but from Nickel and Kiela.
Apparently, the Lorentz model has been observed to lead to better numerical stability empirically.
However, it is still unclear to us why this is the case.
We used the Poincare model because we needed to use Ungar’s formalism of Mobius operations, but you can probably re-derive these formula in any model.