Sie sind auf Seite 1von 8

Network of Networks — A Neural-Symbolic Approach to Inverse-Graphics Michael Kissner

One of the first ideas many people have once they get acquainted with deep learning and neural
networks is, “what if we make a network of neural networks?”. That is a perfectly valid idea, one
that is explored in many different ways. The most common place one finds this kind of approach is
in automated machine learning (AutoML), where small pieces of mini-networks are recombined to
find a complete neural architecture that ideally suits some machine learning problem. What I want
to explore in this article is both very similar to AutoML, yet very different. My goal is to convey an
idea, rather than present formulas. If you are interested in the details, you can find them in our
paper “A Neural-Symbolic Architecture for Inverse Graphics Improved by Lifelong Meta-Learning”
(https://arxiv.org/abs/1905.08910).
So, what if we could design a network for computer vision where each node represents some object
and the connections indicate what objects are part of another object? What if this network
encounters a new object and simply adds it as a node to the network? It might look something like
this:

A Network that grows with each new object it sees, by connecting it to other objects that make up its parts.
Sure, in some sense our current convolutional neural networks (CNN) do something very similar
internally. By retraining them, you’d probably also be able to learn additional objects as well. But
why retrain everything, if most of the network isn’t doing anything new? And at what point do we
need to enlarge a neural network to fit more object types? The above image seems to indicate that
there might be an alternative approach.

Generative Grammar
Let’s take a step back. A large step back. Let’s look at the reverse scenario. What if we want to
generate an image based on a single word, such as [House]? Well, we would take a look at what
this [House] is made of. Probably a [Roof] and some [Ground Floor]. And what is the [Ground
Floor]made of? Multiple [Wall]s, a [Door] and [Window]s. Everything is made of something. This
process can be summarized in a generative grammar and we call the individual things we put in
brackets a symbol.

A partial parse-tree of a generative grammar for a [House].


Well, that looks awkwardly close to the network we had before, just the other way around… And
we’ll get to that. But first, we’ll explore how to get from [House] to a fully drawn picture of an actual
house. There is a set of extremely primitive graphical elements from which we can draw anything.
The most basic is obviously the pixel, but we won’t go that deep (although you could). Take
something higher level and let’s assume you can draw the entire world using nothing but [Square]s,
[Circle]s and [Triangle]s. What would our house look like then?

A [House] drawn from a starting symbol using triangles and squares.


Cool. Do we have control over this process? Sure! Each element can have attributes that controls
how it’s drawn. A [House] has a position, size and rotation which govern the rendering process.
Think of them like the attributes an object has in a game engine. We might as well include more
complex attributes, such as how old or yellow it is, i.e., adjectives.
Further, a [House] can have multiple floors. We can thus decide if a [House]should be drawn
as [Roof][Ground Floor] or as [Roof][Floor][Ground Floor]. These different possibilities are called
the rules of a grammar.

Primitives
Let’s focus on a [Square] for now. It has attributes of its own, but we still need a way to draw it to
the screen. So, we introduce a rendering function that does this. This is quite easy for the primitives
we chose. We’ll refer to this rendering function as the decoder, as it basically decodes the
attributes into actual pixels.
Obviously, each symbol higher up in the hierachy also needs some sort of decoder that converts its
own attributes into valid ones for those that follow. For example the decoder for the rule [House] →
[Roof][Ground Floor]would take the house’s position located at the center and calculate the center
position for the roof and ground floor.

Very nice. We can draw images. The exact opposite of what we actually wanted to do… So, let’s flip
the entire thing around.
“Flipping” the Parse-Tree around.
But we must flip a lot more things than just the connections. First, we must invert the decoder, so
that it becomes an encoder.

Primitive Capsules
The encoder of a [Triangle] takes in the pixel values and tries to find the valid attributes for that
triangle. Easy enough. Let’s just take some some regression model and plug that in. We can even
generate all the training data ourselves, as we know how to draw the triangle using the attributes.
But that’s not good enough. What if its shown a picture of a circle instead of a triangle? The
regression model doesn’t care, it will just produce weird attributes. We need a way to verify that the
image is truly a triangle.
How about, we take the attributes the regression model produced, render a triangle with it and
check if those images agree. If it was a triangle in the first place, we should have agreement,
otherwise not!
So, we are encoding and then decoding.. That’s just an auto-encoder! Well, not entirely, as we
hand-crafted the decoder. But still… We then just add a small agreement function to it, which
gives us a probability of how likely the image is a triangle.

A look at the internals of a primitive capsule. An example of agreement (a [Triangle] capsule shown a triangle).

We now have both, all the attributes and the probability from the agreement. Neat. Let’s call this
inverted symbol with all its internals (encoder, decoder, agreement function, …) a capsule. More
specifically a primitive capsule, as it represents the graphical primitives. These primitive
capsules follow the simple rule: If we can render it, we can de-render it…at least into
something that produces the same image.
An example of disagreement (a [Triangle] capsule shown a circle).

Semantic Capsules
For the next layers of capsules, things are a bit more complicated. These take in the primitives from
the primitive capsules and check if they are the correct parts for the object.

A Semantic Capsule ([House]) activating based on the output of the Primitive Capsules ([Triangle] and [Square])

We don’t know how the decoder works that takes the ouputs of the primitive capsules and tells us
the attributes of an object. Thus, we can’t train an encoder… What if we pretend we know what the
encoder looks like? Seems strange, right… But consider our [House], which is made up of
a [Triangle] for the roof and a [Square] representing the ground floor. During a forward pass, we
know all attributes of the square and triangle. The size of the entire house can then just be taken as
the sum of the sizes of its parts. For the position and rotation of the house, we just take the mean of
both the square and the triangle. This same process goes for all other attributes, such as color,
where we can just take the mean. An old ground floor (1.0) with a new roof (0.0) just looks like a
mid-age house (0.5). This is the rough idea: We assume, at least at the beginning, that our encoder
is similar to a mean function.

Obviously, with such a general mean function,


any configuration of [Triangle] and [Square] would make a valid [House]. We don’t want that.
Let’s again create an encoder-decoder pair with an agreement function. This time, we need to train
the decoder instead of the encoder, but we’ll train it on real houses. Now, each time a configuration
of squares and triangles is passed to the [House] capsule, it encodes it and then attempts to recreate
the house from those attributes. If the result somewhat agrees with the original, then its a house.
Otherwise its not.
A look at the internals of a semantic capsule.

Let’s call this type of capsule that checks part configurations, i.e., the agreement of semantics,
a semantic capsule.

Routing
Now, I’ve mentioned that the symbols can have multiple different rules on how they are produced
(A house with one or two floors, etc.). We need some way to do that with our capsules. What if we
allow each capsule to have multiple such encoder-decoder pairs and check which one fits the best
for the current image? Yeah, let’s do that! And let’s call each such possible pair a route. We’ll refer
to the entire process as routing-by-agreement.

Detailed look at the internals of a Capsule (Primitive and Semantic). Note the different routes highlighted in yellow.
(And here is why we call these things capsules. The idea of matching the inputs and finding the
correct route is based on the capsule architecture introduced bySabour, Frosst and Hinton. The
capsules presented here, however, are a bit different and we refer to them as neural-symbolic
capsules to not get things confused)
Each of the presented Capsules is essentially a small container for neural networks.

Lifelong Meta-Learning
Next, you might have noticed that each of the capsules presented here is independent of every other
capsule. And you are right! Training one capsule has no effect on the others. We can add and move
around capsules as we please (checking that the attributes match of course). How about we let the
entire network grow with each new object it encounters!
First we need to find a way to determine when the network actually encounters something it does
not understand. As it is based on grammar, this is quite easy. Everytime the individual capsules
activate and detect an object, there is some “root” object, i.e., one object that represents the entire
image. In grammar terms, this is called an axiom. In our capsule network, this “root” does not
need to be at the highest layer. If the image just contains a [Roof] the [House] capsule won’t
activate, but [Roof] will still be the “root” that fully represents the image. We’ll refer to
this observed “root” as an axiom as well.
However, multiple such axioms can activate in a single image. Take a [House] and
a [Garage] which both have activated but do not share a common parent. We thus seem to have two
axioms, which we don’t allow. If this happens, it means we encountered a new object or scene to
which all these axioms are mere parts, such as [House] and [Garage] are parts of an [Estate].
Consider a different example and let’s again take our network that has learned how to
detect [House]. It knows how to deal with a ground floor. But what if there are more? What if there
are five additional [Floor]activations? Then the [House] capsule won’t activate. Sure, it can
generate some attributes for the ground floor plus one of the five floors, but its agreement function
will notice that the roof is way out of place. We must add a new capsule that acts as the new axiom
for these activated capsules, such as an [Appartement] capsule.

Slightly different example of capsules activating (blue), but there is a shared axiom missing (left). This is rectified by adding a
new capsule that acts as the parent to all previously dangling capsules (right).
Note, however, that we do not halt at distinct objects, but also describe scenes. [Office], [Singapore
Orchard Road], [Baseball Game], all are perfectly valid. Imagine a picture with 10 houses. We
would have 10 [House] acrtivations. This, again, violates the “only one axiom” rule we have. So, we
create one called [Town Road Scene].
Of course, this decision requires some creativity. Coming up with names for now capsules or
attributes is still a human task. What the capsule network can do, however, is pose a question and
the network will respond with meta-learning. Like the following:
Q: “These floors and roof look like a house, but what is it?” ([House] did not activate even
though multiple [Floor]s and [Roof] were present)
Now, this can trigger different responses by the human. Each indicating a different meta-learning
procedure.
A.1: “It’s a House.” (Meta-learning trains a new route for the existing [House] capsule)
A.2: “It’s an Appartement.” (Meta-learning trains a new [Appartement] capsule)
B.1: “It’s an old House.” (Meta-learning continues training the existing attribute “old” with new
data for the [House] capsule)
B.2: “It’s a rich House.” (Meta-learning trains a new attribute “rich” for the existing [House]
capsule)
If we collect enough of these decisions by a human, we can train a decision matrix for the capsule
network. Then, once it has learned enough responses, the network can make these decisions on its
own and refine its original question or even answer it itself!

An excerpt of a decision matrix. Each of those features says something about the current activations in the network, the details of
which are unimportant at this point. For all features that evaluate to “True”, the values on the right are added up and the column
with the highest value is equal to the decision (A.1 — B.2). Training this matrix simply means adding “1” to the column that was
chosen by the human for the rows that evaluate to “True”.
But what do we train these newly added capsules, routes or attributes with? With the data that it
just received? Sure, but that’s not enough now, is it? How about we augment that data! This is
actually quite easy in our case. We have access to all the attributes after all. Moving the
appartement around (adding to the position attributes of the appartement and the produced
floor/roof symbols) doesn’t change the fact that its an appartement and we can generate huge
amounts of data doing just that. Neither does rotating or resizing change the apartement. We can
do this to almost all attributes. If each floor has an old or yellow attribute, we can shift those all at
once and augment the training set. For an old appartement we would assume that all floors and the
roof are old, but it would still remain an appartement.
Obviously we will make mistakes with this rough augmentation, but its a start. Better than nothing
considering we only have one image to go by. Single-shot is difficult, even when we are cheating.
Now, next time the network sees an appartement, it will… maybe understand it. But we can always
add more training data as the network encounts the same type of image, again using our
augmentation strategy. This isn’t too intensive computationally, as we isolated each capsule and
don’t need to retrain the rest of the network. And, slowly, it will understand… This entire meta-
learning process is very much like teaching a toddler what objects are. And as with toddlers, it is
subjective and depends very much on the what and in what order the parent teaches it. Take, for
example, the following two capsule networks trained in a different way, but reaching the same
conclusion:
The difference is subtle, but as time goes on, those two networks will diverge further and further
from each other.
We now have an entire network that can detect various objects, learn to detect new ones over time
and even gives us attributes for it.
Two capsule networks meant to train for an asteroids like environment, but both end up with a different network configuration.
But we can also do some other neat things! We can
 generate a semantic network of the scene! I.e., the parse-tree of the underlying grammar.
 re-use the capsule network and simply expand on it! No need to retrain the old stuff and allow
for some federated learning.
 use the capsule network in reverse (as the original grammar) and generate images with it! It’s
just an inverse graphics engine under the hood.
 use the attributes to segment the image! We know all the sizes and positions.
 generate basic descriptions of the image! After all, a capsule’s symbol is just a noun ([House],
…), an attribute is just a preposition (position, rotation, …), an adjective (old, yellow, …) or a
verb (explored in a later part), and their magnitude can be interpreted as an adverb (0.0 = not,
very = 1.0, …)
 do simple style transfers! Using multiple rules/routes, a [Door] can be drawn as a square
(abstract) or as an actual door (real) and this automatically transfers to any object that has
a [Door] as its part.
 do physics! Inverse-simulation anyone?

Conclusion
I’ve left out a lot of the details found in the paper, such as how the network and the grammar deal
with occlusion, how exactly the meta-learning agent expands on the network, the “observation
table” to avoid using multiple copies of the same capsule and much more. But I hope I was able to
convey a general idea.
In the next part, I want to explore how physics works using this approach
( https://arxiv.org/abs/1905.09891) and how the capsule network evolves from an inverse-
graphics-engine to an inverse-simulation-engine, hopefully one day ending with an inverse-game-
engine.
And, yes. Most of the results so far are far from impressive and it remains to be shown if this can
even be extended to real images instead of the MSPaint-level of inputs it currently works on.
Kosiorek et. al. developed a different method based on similar principles called “Stacked Capsule
Autoencoders” ( https://arxiv.org/abs/1906.06818), which are definitely worth checking out if this
sort of thing interests you!
My main hope is that this slightly different approach was both entertaining and perhaps gave you
some ideas you could explore, by incorporating more old-school symbolic methods.

Das könnte Ihnen auch gefallen