The Holographic Principle: Why Deep Learning Works

Source: Intuition Machine

The Holographic Principle: Why Deep Learning Works

Carlos E. Perez

What I want to talk to you about today is the Holographic Principle and how it provides an explanation to Deep Learning. The Holographic Principle is a theory (see: Thin Sheet of Reality) that explains how quantum theory and gravity interact to construct the reality that we are in. The motivations for this theory comes from the paradox that Hawking created when he theorized that black holes would emanate energy. The fundamental concept that had been violated by Hawking’s theory was that information was destroyed. As a consequence of this paradox, through several decades of research and experimentation, physicists have brought forth a unified theory of the universe that is based on information theoretic principles. The entire universe is a projection of a hologram. It is entirely fascinating that the arrow of time and the existence gravity are but mere manifestations of information entanglement!
Now, you may be mistaken to think that this Holographic Principle is just some fringe idea from physics. It appears at first read to be quire a wild idea! Apparently though, the theory rests on very solid experimental and theoretical underpinnings. Let’s just say that Stephen Hawking who first remarked that is was ‘rubbish’ has finally agreed to its conclusions. So at this time, it should be relatively safe to start deriving some additional theories of this principle.
One surprising consequence of this theory is that the hologram is able to capture the dynamics of the universe that has of the order of d^N degrees of freedom (where d is the dimension and N is the number of particles). One would think that the hologram would be of equal size, but it is not. It is a surface area and is proportional only to N². This begs the question, how is an structure of order N² able to capture the dynamics of a system in d^N?
In the meantime, Deep Learning (DL) coincidentally has a similar mapping problem. Researchers don’t know how it is possible for DL to perform so impressively well considering the problem domain’s search space has an exceedingly high dimension. So, Max Tegmark and Henry Lin of Harvard, have volunteered their own explanation “Why does deep and cheap learning work so well?” In their paper they argue the following:
… although well-known mathematical theorems guarantee that neural networks can approximate arbitrary functions well, the class of functions of practical interest can be approximated through “cheap learning” with exponentially fewer parameters than generic ones, because they have simplifying properties tracing back to the laws of physics. The exceptional simplicity of physics-based functions hinges on properties such as symmetry, locality, compositionality and polynomial log-probability, and we explore how these properties translate into exceptionally simple neural networks approximating both natural phenomena such as images and abstract representations thereof such as drawings.
The authors bring up several promising ideas like the “no-flattening theorems” as well as the use of information theory and the renormalization group as explanations for their conjecture. I however was not sufficiently convinced by their argument. The argument assumes that all problem data follows ‘natural laws’, but as we all know that DL can be effective in unnatural domains. See, Identifying cars, driving, creating music and playing Go as trivial examples of clearly an unnatural domain. To be fair, I think that they were definitely on to something, and that something I discuss in more detail .
In this article, I make a bold proposal with an argument that is somewhat analogous to what Tegmark and Lin proposed. Deep Learning works so well because of physics. However, the genesis of my idea is that DL works because it uses the leverages the same computational mechanisms underlying the Holographic Principle. Specifically, the capability of representing an extremely high dimensional space (i.e. d^N) with a paltry number of parameters of the order N².
The computational mechanism underpinning the Holographic Principle can be most easily depicted through the use of Tensor Networks (note: These are somewhat different from the TensorFlow or the Neural Tensor Network). Tensor network notation is as follows:




Source: http://inspirehep.net/record/1082123/

The value of tensor networks in physics is that they are used to drastically reduce the state space into a network that focuses only on the relevant physics. The primary motivation behind the use of Tensor Networks is to reduce computation. A tensor network is a way to perform computation in a high dimensional space by decomposing a large tensor into smaller more manageable parts. The computation can then be performed with smaller parts at a time. By optimizing each part one effectively optimizes the full larger tensor.
In the context of the holographic principle, the MERA tensor is used and it is depicted as follows:




Source: http://inspirehep.net/record/1082123/

In above the circles depict “disentanglers” and the triangles “isometries”. One can look at the nodes from the perspective of a mapping. That is the circles map matrices to other matrices. The triangles take a matrix and map it to a vector. The key though here is to realize that the ‘compression’ capability arises from the hierarchy and the entanglement. As a matter of fact, this network embodies the mutual information chain rule:




Source: https://inspirehep.net/record/1372114/

In other words, as you move from the bottom to the top of the network, the information entanglement increases.
I’ve written earlier about the similarities of Deep Learning with ‘Holographic Memories’ however here I’m going to make one step further. Deep Learning networks are also tensor networks. Deep Learning networks however are not as uniform as a MERA network, however they exhibit similar entanglements. As information flows from input to output in either a fully connected network or a convolution network, the information are similarly entangled.
The use of tensor networks has been studied recently by several researchers. Miles Stoudenmire wrote a blog post: “Tensor Networks: Putting Quantum Wavefunctions into Machine Learning” where he describes his method applied to MNIST and CIFAR-10. He writes about one key idea about this approach:
The key is dimensionality. Problems which are difficult to solve in low dimensional spaces become easier when “lifted” into a higher dimensional space. Think how much easier your day would be if you could move freely in the extra dimension we call time. Data points hopelessly intertwined in their native, low-dimensional form can become linearly separable when given the extra breathing room of more dimensions.
Amnon Shashua et al. have also done work in this space. Their latest paper (Oct 2016) “Tensorial Mixture Models” proposes a novel kind of convolution network.
In conclusion, the Holographic Principle, although driven by quantum computation, reveals to us the existence of a universal computational mechanism that is capable of representing high dimensional problems using a relatively low number of model parameters. My conjecture here is that this is the same mechanism that permits Deep Learning to perform surprisingly well.
Most explanations about Deep Learning revolve around the 3 Ilities that I described here. These are expressibility, trainability and generalization. There is definitely consensus in “expressibility”, that is of a hierarchical network requiring less parameters that a shallow network. The open questions however are that of trainability and generalization. The big difficulty in explaining away these two is that they don’t fit with any conventional machine learning notion. Trainability should be impossible in a high-dimensional non-convex space, however simple SGD seems to work exceedingly well. Generalization does not make any sense without a continuous manifold, yet GANs show quite impressive generalizations:




Credit: https://arxiv.org/pdf/1612.03242v1.pdf

The above figure shows the StackGAN generating, given text descriptions , output images in two stages. For the StackGAN there are two generative networks and it is difficult to comprehend how the second generator captures only image refinements. There are plenty of unexplained phenomena like this. The Holographic Principle provides a base camp to a plausible explanation.
The current mainstream intuition of why Deep Learning works so well is that there exists a very thin manifold in high-dimensional space that can represent the natural phenomena that it is trained on. Learning proceeds through the discover of this ‘thin manifold’. This intuition however breaks apart considering the recent experimental data (see: “Rethinking Generalization). The authors of the ‘Rethinking Generalization) paper write:
Even optimization on random labels remains easy. In fact, training time increases only by a small constant factor compared with training on the true labels.
Both the Tegmark argument and the ‘Thin Manifold’ argument cannot possibly work with random data. This thus lead to the hypothesis that there should exist an entirely different mechanism that is reducing the degrees of freedom (or problem dimension) so that computation is feasible. This compression mechanism exists can be found in the structure of the DL network, just like it exists in the MERA tensor network.
Conventional Machine Learning thinking is that it is the intrinsic manifold structure of the data that needs to be discovered via optimization. In contrast, my conjecture claims that the data is less important, rather it is the topology of the DL network that is able to capture the essence of the data. That is, even if the bottom layers have random initializations, it is likely that the network should work well enough subject to a learned mapping at the top layer.
In fact, I would even make a bigger leap in that in our quest for unsupervised learning, we may have already overlooked the fact that a neural network has already created its own representation of the data at onset of random initialization. It is just our inability to interpret that representation that is problematic. A random representation that preserves invariances (i.e. locality, symmetry etc.) may just be a good as any other representation. Yann LeCun’s cake might already be present and that it is just the icing and cherry that needs to explain what the cake represents.
Note to reader: In 1991, psychologist Karl Pribham with physicist David Bohm had speculated about Holonomic Brain Theory. I don’t know the concrete relationship between the brain and deep learning. So I can’t make the same conclusion that they made in 1991.
References
https://arxiv.org/pdf/1407.6552v2.pdf Advances on Tensor Network Theory: Symmetries, Fermions, Entanglement, and Holography

Machine Learning Algorithms


Predictive Analytics 101

by Ravi Kalakota

https://practicalanalytics.wordpress.com/predictive-analytics-101/

Data Mining

Source: Dr. Saed Sayad http://www.saedsayad.com/

An Introduction to Data Mining


Kenny Bastani: A Docker Image for Graph Analytics on Neo4j with A...

Kenny Bastani: A Docker Image for Graph Analytics on Neo4j with A...: I've just released a useful new Docker image for graph analytics on a Neo4j graph database with Apache Spark GraphX. This image deploy...

Named Entity Extraction

A Survey of named entity recognition and classification
Evaluation of Named Entity Extraction Systems
NERD: A Framework for Unifying Named Entity Recognition and Disambiguation Extraction Tools
NERD: Evaluating Named Entity Recognition Tools in the Web of Data
NERD: an open source platform for extracting and disambiguating named entities in very diverse documents
NERD Ontology
Unsupervised Named-Entity Extraction from the Web: An Experimental Study

Stanford Named Entity Recognizer (Conditional Random Field) Whitepaper
GATE (General Architecture for Text Engineering) ANNIE (A Nearly-New Information Extraction) System
Illinois Named Entity Tagger
Balie: Multilingual Information Extraction from Text with Machine Learning and Natural Language Techniques
Mallet: Machine Learning

Apache Nutch (Web Crawler); Bixo (Web Mining); Behemoth (Hadoop Document Analysis); Apache OpenNLP (Natural Language Processing); Apache Stanbol (Semantic Content Management); Apache Tika (Metadata and text extraction); Apache UIMA (Unstructured Information Management Architecture); Apache Mahout (Machine Learning); Apache Avro (Data Serialization); Apache SOLR/Lucene; Apache Clerezza (OSGi RESTful Web framework, Triplestore DB); Apache Jena (Semantic Web: RDF, Triplestore DB, OWL); Fedora (Flexible Extensible Digital Object Repository Architecture), Apache Ambari

Maui (Topic Indexing); Weka (Data Mining); LingPipe; FreeLing; OpenCalais; DBpediaSpotlight

Alchemy API; Evri API; Web ARChive (WARC) format

HBase  Bigtable: A Distributed Storage System for Structured Data , Apache Phoenix

Docker and DevOps


Docker Basics  (Tutorial)
Getting started with Docker
Docker User Guide
Dockerizing Applications
Docker Network Configuration
Working with Containers; Automatically Start Containers
Docker Run Reference
Launching Containers with Fleet; Fleet Configuration and API
Getting Started with Etcd; Etcd Configuration
Getting started with system
Working with Docker Images
Google Compute Engine: Container Images

Microservices in a Nutshell

The following is an except from an article that originally appeared on Martin Fowler's website.  

"Microservices" - yet another new term on the crowded streets of software architecture. Although our natural inclination is to pass such things by with a contemptuous glance, this bit of terminology describes a style of software systems that we are finding more and more appealing. We've seen many projects use this style in the last few years, and results so far have been positive, so much so that for many of our colleagues this is becoming the default style for building enterprise applications. Sadly, however, there's not much information that outlines what the microservice style is and how to do it.
In short, the microservice architectural style is an approach to developing a single application as a suite of small services, each running in its own process and communicating with lightweight mechanisms, often an HTTP resource API. These services are built around business capabilities and independently deployable by fully automated deployment machinery. There is a bare mininum of centralized management of these services, which may be written in different programming languages and use different data storage technologies.

To start explaining the microservice style it's useful to compare it to the monolithic style: a monolithic application built as a single unit. Enterprise Applications are often built in three main parts: a client-side user interface (consisting of HTML pages and javascript running in a browser on the user's machine) a database (consisting of many tables inserted into a common, and usually relational, database management system), and a server-side application. The server-side application will handle HTTP requests, execute domain logic, retrieve and update data from the database, and select and populate HTML views to be sent to the browser. This server-side application is a monolith - a single logical executable. Any changes to the system involve building and deploying a new version of the server-side application.

Such a monolithic server is a natural way to approach building such a system. All your logic for handling a request runs in a single process, allowing you to use the basic features of your language to divide up the application into classes, functions, and namespaces. With some care, you can run and test the application on a developer's laptop, and use a deployment pipeline to ensure that changes are properly tested and deployed into production. You can horizontally scale the monolith by running many instances behind a load-balancer.

Monolithic applications can be successful, but increasingly people are feeling frustrations with them - especially as more applications are being deployed to the cloud . Change cycles are tied together - a change made to a small part of the application, requires the entire monolith to be rebuilt and deployed. Over time it's often hard to keep a good modular structure, making it harder to keep changes that ought to only affect one module within that module. Scaling requires scaling of the entire application rather than parts of it that require greater resource.




These frustrations have led to the microservice architectural style: building applications as suites of services. As well as the fact that services are independently deployable and scalable, each service also provides a firm module boundary, even allowing for different services to be written in different programming languages. They can also be managed by different teams.

We do not claim that the microservice style is novel or innovative, its roots go back at least to the design principles of Unix. But we do think that not enough people consider a microservice architecture and that many software developments would be better off if they used it.

For more information: 
James and Martin’s article goes on to define what a microservice architecture is by laying out 9 common characteristics, discussing its relationship with Service-Oriented Architecture, and considering whether this style is the future of enterprise software. Read it here: martinfowler.com/articles/microservices.html.

James Lewis is a Principal Consultant at ThoughtWorks and member of the Technology Advisory Board. James' interest in building applications out of small collaborating services stems from a background in integrating enterprise systems at scale. He's built a number of systems using microservices and has been an active participant in the growing community for a couple of years.

Martin Fowler is an author, speaker, and general loud-mouth on software development. He's long been puzzled by the problem of how componentize software systems, having heard more vague claims than he's happy with. He hopes that microservices will live up to the early promise its advocates have found.

Pattern: Microservices Architecture
The Scale Cube
SRP: The Single Responsibility Principle (.pdf)
Decomposing Applications for deployability and scalability



Building microservices with Spring Boot: Part 1. Part 2, Part 3 (Deploying Spring Boot-based microservices  with Docker)

A Quick Introduction to CoreOS
An Introduction to CoreOS System Components
CoreOSKubernetesFleetEtcd
CoreOS Continued:etcd
Running Kubernetes on CoreOS Part 1, Part 2
CoreOS Contined: Fleet and Docker
Launching Containers with fleet
Deploying a NodeJS Application using Docker
Deploying Docking Containers on CoreOS using Fleet
Running CoreOS on Vagrant
Running CoreOS on Google Compute Engine
Running CoreOS on EC2

Toolkit: Spray; Akka; Scala; Clojure; Spring; Dropwizard (Jetty (Web Server), Jersey (RESTful), Jackson (JSON), JDBI (SQL), Logback , Yammer metrics, Guava (Core libraries), Hibernate Validator; NodeJS; Play; Python; GitHub

Apache Hue
Elasticsearch