Daniel's Deep Study Notes, Deep Learning Quick Guide

This article is transferred from: Reading technology Zouxy finishing

Deep learning , Deep Learning, is a learning algorithm and is also an important branch of artificial intelligence. From rapid development to practical application, deep learning has overturned the algorithm design ideas in many fields such as speech recognition, image classification, and text understanding in just a few years and gradually formed a kind of end-to-end training data. The end-to-end model is then output directly to obtain a new model of the final result. So, how deep is deep learning? Did you learn a bit? This article will take you through the methods and processes behind the deep learning of high-end range.

I. Overview
Second, background three, human brain vision mechanism four, on the characteristics
4.1. Granularity of Feature Representation 4.2. Primary (Shallow) Feature Representation 4.3. Structural Feature Representation 4.4. How Many Characteristics Do We Need? V. The basic ideas of Deep Learning VI. Shallow Learning and Deep Learning 7. Deep Learning and Neural Network 8. Deep Learning Training Process 8.1. Traditional Neural Network Training Methods 8.2. Deep Learning Training Process Nine, commonly used models or methods of Deep Learning 9.1, AutoEncoder automatic encoder 9.2, Sparse Coding sparse coding 9.3, Restricted Boltzmann Machine (RBM) restrictions Boltzmann machine 9.4, Deep BeliefNetworks deep belief network 9.5, Convolutional Neural Networks convolutional nerve Network Ten, Summary and Outlook

| I. Overview

Artificial Intelligence, also known as human intelligence, is one of humankind's best dreams, just like immortality and interplanetary roaming. Although computer technology has made great progress, so far, there is not a computer that can generate "self" consciousness. Yes, with the help of humans and a large amount of ready-made data, the computer can perform very powerfully, but leaving the two, it can't even tell a comet and a Wangxing.

Turing (Turing, we all know it. The originators of computers and artificial intelligence, corresponding to their famous Turing machines and Turing tests, respectively) proposed in the 1950 paper that the Turing experiment envisaged Dialogue with the wall, you will not know whether to talk to you, people or computers. This undoubtedly gives computers, especially artificial intelligence, a preset high expectation. But half a century has passed and the progress of artificial intelligence is far from reaching the Turing test standard. This is not only disappointing people who have been waiting for years, but also believes that artificial intelligence is a flicker and related fields are "pseudoscience."

However, since 2006, the field of machine learning has made breakthrough progress. The Turing experiment was at least not as far-reaching as possible. As for technical means, not only rely on the ability of cloud computing for parallel processing of big data, but also rely on algorithms. The algorithm is, Deep Learning. With the help of the Deep Learning algorithm, humans finally found a way to deal with the ancient concept of "abstract concept."


In June 2012, the "New York Times" disclosed the Google Brain project and attracted wide public attention. The project was led by Andrew Ng, a renowned Stanford University professor of machine learning, and Jeff Dean, a world-leading expert in large-scale computer systems. He trained a so-called "deep neural network" (DNN) using a parallel computing platform of 16,000 CPU Cores. , Deep Neural Networks' machine learning model (having a total of 1 billion nodes internally. This network naturally cannot be compared with human neural networks. It should be noted that there are more than 15 billion neurons in the human brain, interconnected nodes That is, the number of synapses is more like the number of galactic sands. It has been estimated that if the axons and dendrites of all nerve cells in a person’s brain are connected in sequence and pulled into a straight line, they can be connected from the earth to the moon. Returning to Earth from the Moon, it achieved great success in the fields of speech recognition and image recognition.

Andrew, one of the project leaders, said: "We didn't frame our boundaries like we usually do, but we put a lot of data directly into the algorithm, let the data speak for itself, and the system automatically learns from the data." Another person in charge Jeff said: "We never told the machine when we were training that: 'This is a cat.' The system actually invented or understood the concept of 'cat'."

In November 2012, Microsoft publicly demonstrated a fully automated simultaneous interpretation system at an event in Tianjin, China. The lecturer gave a speech in English. The computer in the background automatically completed speech recognition, English-Chinese machine translation, and Chinese speech synthesis. Very smooth. According to reports, the key technology behind the support is also DNN, or Deep Learning (DL, DeepLearning).

In January 2013, at Baidu's annual meeting, founder and CEO Robin Li announced a high-profile announcement of the establishment of Baidu Research Institute, the first of which was the Institute of Deep Learning (IDL).


Why do Internet companies with big data rush to invest heavily in research and development of deep learning technologies. It sounds like deeplearning like cows. What is deep learning? Why deep learning? How did it come from? What can you do? What are the current difficulties? The brief answers to these questions need to be taken slowly. Let's first understand the background of machine learning (the core of artificial intelligence).

| Second, background

Machine Learning is a discipline that specializes in how computers simulate or realize human learning behavior to acquire new knowledge or skills and reorganize existing knowledge structures to continuously improve their performance. Can machines be as capable as humans? In 1959, Samuel of the United States designed a chess program. This program has the ability to learn. It can improve his chess skills in continuous chess. Four years later, this program defeated the designer himself. After another three years, this procedure defeated the United States, an undefeated champion who has maintained an eight-year history. This program shows people the ability of machine learning, and put forward many thought-provoking social issues and philosophical issues. (Oh, the normal track of artificial intelligence has not been greatly developed. What these philosophical ethics have developed very quickly. What is the future? Machines are more and more like people, people are more and more like machines. What machines are anti-human, ATM is the first shot, etc. The human mind is endless.)

Although machine learning has developed for decades, there are still many problems that are not well solved:


For example, image recognition, speech recognition, natural language understanding, weather prediction, gene expression, content recommendation, and the like. At the moment, we are thinking of solving these problems through machine learning (visual perception as an example):

Data is obtained from the beginning through a sensor (for example, CMOS). Then after preprocessing, feature extraction, feature selection, and then to reasoning, prediction or identification. The last part, that is, the part of machine learning, the vast majority of the work is done in this area, there are a lot of paper and research.

The middle three parts are summarized as feature expressions. The good feature expression plays a very key role in the accuracy of the final algorithm, and the system's main calculation and testing work consumes most of this. However, this practice is generally done manually. Rely on artificial extraction features.

As of now, there have been many characteristics of NB (good features should have invariance (size, scale, rotation, etc.) and distinguishability): For example, the emergence of Sift is a landmark in the field of local image feature descriptor research. work. Because SIFT is invariant to image changes such as scale, rotation and certain angles of view and illumination changes, and SIFT is very distinguishable, it does make it possible to solve many problems. But it is not everything.

However, the manual selection of features is a very laborious and heuristic (requiring professional knowledge) approach. Whether or not it can be selected depends largely on experience and luck, and its adjustment requires a lot of time. Since manual selection of features is not so good, can we learn some features automatically? The answer is yes! Deep Learning is used to do this. Look at it as an alias UnsupervisedFeature Learning, you can justify the name, Unsupervised means that people do not participate in the selection process.

How did it learn? How do you know which features are better or not? We say that machine learning is a discipline that specializes in how computers simulate or realize human learning behavior. Well, how does our human visual system work? Why can we find another one in the vast sea, the mortal beings, the red dust (because you exist in my deep mind, my dreams my heart my song ...). The human brain then NB, can we refer to the human brain and simulate the human brain? (It seems to be related to the characteristics of the human brain, ah, the algorithm is good, but I do not know whether it is artificially imposed, in order to make his work sacred and elegant.) In recent decades, cognitive neuroscience The development of disciplines such as biology and biology has made us no longer unfamiliar with our mysterious and magical brain. It also contributed to the development of artificial intelligence.

| Third, the human brain vision mechanism

The 1981 Nobel Prize in Medicine was awarded to David Hubel (American neurobiologist born in Canada) and Torsten Wiesel, and Roger Sperry. The main contributions of the first two are "discovery of information processing in the visual system": The visual cortex is graded:

Let's see what they have done. In 1958, David Hubel and Torsten Wiesel at John Hopkins University studied the correspondence between the pupil area and the cerebral cortical neurons. They opened a 3mm hole in the cat's hindbrain skull and inserted electrodes into the hole to measure the activity of the neurons.

Then, in front of the cat's eyes, they showed various shapes and various brightness objects. And, when presenting each object, it also changes the position and angle at which the object is placed. They hope that through this approach, the kitten's pupils will experience different types of different strengths and irritations.

The reason for doing this experiment is to prove a guess. There is a corresponding relationship between different visual neurons located in the posterior cortex and the stimulation of the pupil. Once the pupil is stimulated, a certain part of the neurons in the posterior cortex will become active. After many days of tedious experiments and sacrifices of poor kittens, David Hubel and Torsten Wiesel discovered a type of neuron called the Orientation Selective Cell. When the pupil finds the edge of the object in front of the eye and the edge points in a certain direction, the neuron cell becomes active.

This discovery inspired people to think further about the nervous system. The working process of the nerve-center-brain may be an iterative, continuous abstraction process. There are two keywords here, one is abstraction and the other is iteration. From the original signal, do low-level abstractions and gradually iterate to high-level abstractions. Human logical thinking often uses highly abstract concepts.

For example, starting from the original signal intake (pupil uptake pixel Pixels), then doing preliminary processing (cerebral cortex some cells find the edge and direction), then abstract (the brain determines that the shape of the object in front of the eyes, is round) Then further abstraction (the brain further determines that the object is a balloon).

The discovery of this physiology contributed to the breakthrough of computer artificial intelligence in 40 years.

In general, the information processing of the human visual system is hierarchical. From the low-level V1 area to extract edge features, to the shape of the V2 area or part of the target, etc., to the higher level, the entire target, the behavior of the target. In other words, the high-level features are the combination of low-level features. From the low-level to the high-level, feature representations become more and more abstract, and they can express semantics or intentions more and more. The higher the level of abstraction, the fewer possible guesses there are and the better it is for classification. For example, the correspondence between word sets and sentences is many-to-one, and the correspondence between sentences and semantics is many-to-one, and the correspondence between semantics and intentions is many-to-one. This is a hierarchical system.

Sensitive people notice the key word: stratification. Is Deep learning deep enough to indicate how many layers I have, and how deep is it? That's right. How does Deep learning learn from this process? After all, it is due to the computer to deal with. The problem is how to model this process?

Because we want to learn the expression of characteristics, then we need to understand more about the characteristics, or about the characteristics of the hierarchy. So before we say Deep Learning, it is necessary for us to write down the features. (Oh, actually seeing such a good explanation of the features is not a pity here, so we plug it here).

| IV. About Features

The feature is the raw material of the machine learning system, and the impact on the final model is beyond doubt. If the data is well expressed as a feature, the linear model can usually achieve satisfactory accuracy. What do we need to consider for features?

4.1, the granularity of the feature representation

What is the granularity of the learning algorithm's characteristics that can play a role? In terms of a picture, pixel-level features have no value at all. For example, the motorcycle below does not receive any information at all from the pixel level, and it cannot distinguish between motorcycles and non-motorcycles. If the feature is a structural (or meaningful) time, such as whether it has a handlebar and whether it has a wheel, it is easy to distinguish the motorcycle from the non-motorcycle, and the learning algorithm can work. .

4.2 Primary (Shallow) Feature Representation

Since pixel-level feature representation does not work, what kind of representation is useful?

Around 1995, two scholars, Bruno Olshausen and David Field, worked at Cornell University. They tried to use physiology and computers to study visual problems in a two-pronged approach. They collected a lot of black and white landscape photos. From these photos, 400 small pieces were extracted. The size of each photo was 16x16 pixels. Mark the 400 pieces as S[i], i = 0, .. 399. Next, from the black and white landscape photos, another fragment is randomly extracted, and the size is also 16x16 pixels. You may wish to mark this fragment as T.

The question they asked was how to select a set of fragments from the 400 fragments, S[k], and synthesize a new fragment by stacking. This new fragment should be randomly selected with the target fragment T , as similar as possible, while the number of S[k] is as small as possible. Described in the language of mathematics, is:

Sum_k (a[k]S[k]) --> T, where a[k] is the weight coefficient when the fragment S[k] is superimposed.

To solve this problem, Bruno Olshausen and David Field invented an algorithm, sparse coding (Sparse Coding).

Sparse coding is a repeated iterative process with two steps per iteration:

1) Select a set of S[k] and adjust a[k] so that Sum_k(a[k]S[k]) is closest to T.

2) Fix a[k]. Among the 400 fragments, select other more suitable fragments S'[k], replacing the original S[k], making Sum_k (a[k]S'[k]) the closest T.

After several iterations, the best S[k] combination was selected. Surprisingly, the selected S[k] is basically the edge line of the different objects on the photo. These line segments are similar in shape and differ in direction.

The algorithmic results of Bruno Olshausen and David Field coincide with the physiological findings of David Hubel and Torsten Wiesel!

In other words, complex graphics often consist of some basic structure. For example, the following figure shows a graph that can be represented linearly by using 64 orthogonal edges, which can be understood as orthogonal basic structures. For example, the sample x can be reconstructed by using the weights of 0.8, 0.3, and 0.5 from three out of 1-64 edges. The other basic edges do not contribute and are therefore 0.

In addition, the big cows also found that not only the law exists in the images, but also sounds exist. They have discovered 20 basic sound structures from unmarked sounds, and the rest of the sounds can be synthesized from these 20 basic structures.

4.3. Representation of structural features

Small pieces of graphics can be made up of basic edges, more structured, more complex, and how do the conceptual graphics be represented? This requires a higher level of feature representation, such as V2, V4. Therefore, V1 looks at the pixel level as a pixel level. V2 sees V1 as a pixel level. This is a hierarchical progression. High-level expressions are formed by a combination of underlying expressions. Professionalism is the basic basis. V1 assumes that the base is the edge, and then the V2 layer is the combination of these bases of the V1 layer. At this time, the V2 area obtains a higher level of base. That is, the result of the combination of the upper layers is the combination of the upper layer and the upper layer... (Therefore, Da Ng said that Deep Learning is the "base", because it is ugly, so its name is Deep Learning or Unsupervised Feature Learning. )

Intuitively speaking, it is to find the small patch of make sense and then combine it to get the upper layer of features, recursively upward learning feature.

When doing training on different objects, the resulting edgebasis is very similar, but the object parts and models will be completely different (that's how easy it is to distinguish car or face):

From the text, what does a doc mean? We describe one thing. What does it mean to be more appropriate? With a word, I don't see it. The word is pixel level. At least it should be a term. In other words, each doc is made up of a term, but the ability to express the concept is enough, and it may not be enough. One step, reaching the topic level, with the topic, and then to the doc is reasonable. However, there is a large gap between the levels of each level, such as the concept of doc -> topic (thousands of thousands - million) -> term (10 million) -> word (million level).

When a person is looking at a doc, what his eyes see is word. Words are automatically word-formed in the brain to form term. In the way of concept organization, prior learning, topic, and then high-level learning .

4.4 How many features do I need?

We know that there needs to be a hierarchy of feature construction, from shallow to deep, but how many features should there be in each layer?

The more features, the more reference information given and the accuracy will be improved. However, multiple features means that the calculation is complex and the exploration space is large. The data that can be used for training will be sparse in each feature and will bring about various problems. The more features, the better.


Well, at this point, we can finally talk about Deep learning. Above we talked about why there is Deep learning (which allows the machine to automatically learn good features, and eliminate the manual selection process and the reference layered visual processing system). We have come to the conclusion that Deep learning requires multiple layers to obtain more Abstract feature expression. How many layers are appropriate? What architecture is used to model it? How to conduct non-supervisory training?

| V. The basic idea of ​​Deep Learning

Suppose we have a system S, which has n layers (S1,...Sn), whose input is I and whose output is O, which is visually represented as: I => S1 => S2 =>..... => Sn => O, if the output O is equal to the input I, that is, after the input I has undergone this system change, there is no information loss (Oh, Da Niu said, this is not possible. Information theory has a saying that "information is lost layer by layer" (information processing Inequality), suppose that processing a information to obtain b, and then b processing to obtain c, then we can prove: a and c mutual information will not exceed the mutual information of a and b. This shows that information processing will not increase information, most of the processing will Loss of information. Of course, if it is worthless to lose information that is useless, and it stays the same, this means that input I goes through every layer of Si without any loss of information, ie, at any level of Si, it It is another representation of the original information (ie input I). Now back to our theme Deep Learning, we need to learn features automatically. Suppose we have a bunch of input I (like a bunch of images or text). Suppose we have designed a system S (with n layers). We adjust the parameters in the system. So that its output is still input I, then we can automatically get a series of hierarchical features of the input I, namely S1, ..., Sn.

For deep learning, the idea is to stack multiple layers, that is, the output of this layer as the input to the next layer. In this way, it is possible to hierarchically express the input information.

In addition, the front is assuming that the output is strictly equal to the input. This limit is too strict. We can slightly relax this limit. For example, we only need to make the difference between input and output as small as possible. This relaxation will lead to another type of different Deep. Learning method. The above is the basic idea of ​​Deep Learning.

| 6. Shallow Learning and Deep Learning

Shallow learning is the first wave of machine learning.

In the late 1980s, the invention of the back-propagation algorithm (also called Back Propagation algorithm or BP algorithm) for artificial neural networks brought hope to machine learning and set off an upsurge of machine learning based on statistical models. This boom continued until today. It has been found that using the BP algorithm allows an artificial neural network model to learn statistical rules from a large number of training samples and thus predict unknown events. This kind of statistics-based machine learning method has superiority in many aspects compared with the past based on artificial rules. Artificial neural network at this time, although also known as Multi-layer Perceptron, is actually a shallow model containing only one hidden layer node.

In the 1990s, a variety of shallow machine learning models were successively proposed, such as Support Vector Machines (SVM), Boosting, and Maximum Entropy methods (such as LR, Logistic Regression). The structure of these models can basically be seen as a hidden layer node (such as SVM, Boosting), or no hidden layer nodes (such as LR). These models have achieved great success both in theoretical analysis and in their application. In contrast, due to the difficulty of theoretical analysis, training methods also require a lot of experience and skills. During this period, shallow artificial neural networks are relatively quiet.

Deep learning is the second wave of machine learning.

In 2006, Professor Geoffrey Hinton and his student Ruslan Salakhutdinov of the University of Toronto, Canada, and his student Ruslan Salakhutdinov published an article in Science that opened the wave of deep learning in academia and industry. This article has two main viewpoints: 1) The multi-hidden layer artificial neural network has excellent feature learning capabilities, and the learned features have a more characterization of the data, which is conducive to visualization or classification; 2) deep neural networks Difficulties in training can be effectively overcome by layer-wise pre-training. In this article, layer-by-layer initialization is achieved through unsupervised learning.

Currently, most of the learning methods such as classification and regression are shallow structure algorithms. The limitation is that the ability to represent complex functions is limited in the case of finite samples and computational units, and the generalization ability of complex classification problems is restricted. Deep learning can learn a deep nonlinear network structure, realize the approximation of complex functions, represent the distributed representation of input data, and demonstrate a strong ability to learn the essential characteristics of data sets from a few sample sets. (The advantage of multi-layer is that you can express complex functions with fewer parameters.)

The essence of deep learning is to learn more useful features by constructing a machine learning model with a lot of hidden layers and massive training data, so as to ultimately improve the accuracy of classification or prediction. Therefore, "deep model" is the means, and "characteristic learning" is the purpose. Different from traditional shallow learning, the difference in deep learning is that: 1) The depth of the model structure is emphasized, usually 5, 6 or even 10 layers of hidden layer nodes; 2) The importance of feature learning is clearly highlighted That is to say, through layer-by-layer feature transformation, the feature representation of the sample in the original space is transformed into a new feature space, thereby making classification or prediction easier. Compared with the method of constructing features by artificial rules, using big data to learn features makes it possible to describe the rich internal information of data.

| VII, Deep learning and Neural Network

Deep learning is a new field in machine learning research. Its motivation lies in building and simulating the neural network of the human brain for analytical learning. It imitates the mechanism of the human brain to interpret data such as images, sounds, and texts. Deep learning is a kind of unsupervised learning.

The concept of deep learning stems from the study of artificial neural networks. A multilayer sensor with multiple hidden layers is a deep learning structure. Deep learning creates more abstract high-level representation attribute categories or features by combining low-level features to discover distributed representations of data.

Deep learning itself is a branch of machine learning. Simple can be understood as the development of neural networks. About two or three decades ago, neural network was once a particularly hot direction in the ML field, but it has since slowly faded out. The reasons include the following aspects:

1) It is easier to overfit, the parameters are more difficult to tune, and many tricks are needed;

2) The training speed is slow, and the effect is not better than other methods when the level is relatively small (less than or equal to 3);

So for about 20 years in the middle, the neural network was little noticed. This time is basically the world of SVM and boosting algorithms. However, an infatuated old Mr. Hinton, he persisted, and finally (and others Bengio, Yann.lecun, etc.) into a practical deep learning framework.

There are many differences between Deep learning and traditional neural networks.

The difference between the two is that deep learning adopts a similar hierarchical structure of neural networks. The system consists of a multi-layer network consisting of input layer, hidden layer (multi-layer), and output layer. Only the adjacent layer nodes have connections, the same layer. And cross-layer nodes are not connected to each other, each layer can be seen as a logistic regression model; this hierarchical structure is closer to the structure of the human brain.


In order to overcome the problems in neural network training, DL adopts a very different training mechanism from neural networks. In the traditional neural network, the method of back propagation is adopted. In simple terms, an iterative algorithm is used to train the entire network, the initial value is set at random, the output of the current network is calculated, and then the difference between the current output and the label is used. Change the parameters of the previous layers until convergence (the whole is a gradient descent method). Deep learning as a whole is a layer-wise training mechanism. The reason for this is because, if the back propagation mechanism is used, for a deep network (more than 7 layers), the residual spread to the frontmost layer has become too small, with the so-called gradient diffusion. We will discuss this issue next.

| Eight, deep learning training process

8.1. Why can't traditional neural network training methods be used in deep neural networks?

BP algorithm is a typical algorithm for traditional training multi-layer networks. In fact, it only contains several layers of networks. This training method is already very unsatisfactory. The ubiquitous local minimum in the non-convex target cost function of the deep structure (involving multiple nonlinear processing unit layers) is the main source of training difficulties.

Problems with the BP algorithm:

(1) Gradient is becoming sparse: the error correction signal is getting smaller and smaller from the top down;

(2) Convergence to local minimums: especially when starting from far away from the optimal region (initialization of random values ​​will cause this to happen);

(3) In general, we can only train with tagged data: but most of the data is unlabeled, and the brain can learn from untagged data;

8.2, deep learning training process

If you train all layers at the same time, the time complexity will be too high; if you train one layer at a time, the deviation will be passed layer by layer. This will face the opposite problem of supervised learning above, and it will seriously under-fit (because the depth of the network has too many neurons and parameters).

In 2006, Hinton proposed an effective method for building multi-layer neural networks on unsupervised data. In simple terms, there are two steps. One is to train one network at a time, and the other is to tune the original representation x upwards. The high-level representation r and the high-level representation r are as consistent as possible. the way is:

1) First build a single layer of neurons layer by layer, so that each time you train a single-layer network.

2) After all layers have been trained, Hinton uses the wake-sleep algorithm for tuning.

Turn the weights of the layers except the topmost layer into bidirectional, so that the top layer is still a single-layer neural network, and other layers become the graph model. The upward weight is used for "cognitive" and the downward weight is used for "generating." Then use the Wake-Sleep algorithm to adjust all the weights. The consensus between cognition and generation is to ensure that the generated top-level representation can restore the underlying nodes as correctly as possible. For example, if a node at the top level represents the face, then the image of all faces should activate the node, and the resulting downward-looking image should be able to appear as a general face image. Wake-Sleep algorithm is divided into wake and sleep.

1) Wake phase: The cognitive process generates an abstract representation of each layer (node ​​state) through external features and upward weights (cognitive weights), and uses gradient descent to modify the downlink weight between layers (generate weights). That is, "If the reality is different from what I have imagined, changing my weight makes my imagination something like this."

2) sleep stage: the generation process, through the top-level representation (concept learned when awake) and down the weight, to generate the underlying state, while modifying the upward weight between layers. That is, "if the dream scene is not a corresponding concept in my mind, changing my cognitive weight makes this scene seem to me the concept."

The deep learning training process is as follows:

1) Use non-supervised learning from the bottom up (that is, start from the ground up, layer by layer to top level training):

Using non-calibrated data (with calibration data also available) to stratify parameters at each level, this step can be seen as an unsupervised training process, which is the most distinct part from the traditional neural network (this process can be seen as a feature learning process)

Specifically, the first layer is first trained with no calibration data, and the parameters of the first layer are learned first (this layer can be seen as a hidden layer of a three-layer neural network that minimizes the difference between output and input). The limitation of capacity and the sparsity constraint make the obtained model able to learn the structure of the data itself, so as to obtain features that are more capable of expressing than the input; after learning to obtain the n-1th layer, the output of the n-1 layer is taken as the first The n-layer input trains the n-th layer, from which each layer's parameters are obtained;

2) Top-down supervised learning (that is, training with tagged data, error propagation from the top, fine-tuning the network):

Based on the parameters obtained in the first step to further fine-tune the parameters of the entire multi-layer model, this step is a supervised training process; the first step is similar to the neural network's random initialization initial value process, since the first step of DL is not random Initialization, but obtained by learning the structure of the input data, so that the initial value is closer to the global optimum, so that better results can be achieved; so the deep learning effect is largely due to the first step of the feature learning process.

| Nine, Deep Learning common model or method

9.1, AutoEncoder Automatic Encoder

One of the simplest methods of Deep Learning is to use the characteristics of artificial neural networks. An artificial neural network (ANN) is itself a system with a hierarchical structure. If a neural network is given, we assume that its output and input are the same, and then training adjustments. Its parameters get the weight in each layer. Naturally, we get several different representations of input I (each layer represents a representation), and these representations are features. An automatic encoder is a neural network that reproduces the input signal as much as possible. In order to achieve this kind of reproduction, the automatic encoder must capture the most important factor that can represent the input data, just like the PCA, find the main component that can represent the original information.

The specific process is briefly described as follows:

1) Given unlabeled data, learn features using unsupervised learning:


In our previous neural network, as in the first diagram, the input sample is labeled, ie, (input, target), so that we change the previous layers according to the difference between the current output and the target(label). Parameters until convergence. But now we only have unlabeled data, which is the figure on the right. How can this error be obtained?


As shown above, we will input an input encoder encoder, you will get a code, this code is a representation of the input, then how do we know this code is input it? We add a decoder decoder. At this time, the decoder will output a message. If the output information is similar to the input signal input (ideally, it is the same), it is obvious that we have a reason. I believe this code is reliable.所以,我们就通过调整encoder和decoder的参数,使得重构误差最小,这时候我们就得到了输入input信号的第一个表示了,也就是编码code了。因为是无标签数据,所以误差的来源就是直接重构后与原输入相比得到。

2)通过编码器产生特征,然后训练下一层。这样逐层训练:

那上面我们就得到第一层的code,我们的重构误差最小让我们相信这个code就是原输入信号的良好表达了,或者牵强点说,它和原信号是一模一样的(表达不一样,反映的是一个东西)。那第二层和第一层的训练方式就没有差别了,我们将第一层输出的code当成第二层的输入信号,同样最小化重构误差,就会得到第二层的参数,并且得到第二层输入的code,也就是原输入信息的第二个表达了。其他层就同样的方法炮制就行了(训练这一层,前面层的参数都是固定的,并且他们的decoder已经没用了,都不需要了)。

3)有监督微调:

经过上面的方法,我们就可以得到很多层了。至于需要多少层(或者深度需要多少,这个目前本身就没有一个科学的评价方法)需要自己试验调了。每一层都会得到原始输入的不同的表达。当然了,我们觉得它是越抽象越好了,就像人的视觉系统一样。

到这里,这个AutoEncoder还不能用来分类数据,因为它还没有学习如何去连结一个输入和一个类。它只是学会了如何去重构或者复现它的输入而已。或者说,它只是学习获得了一个可以良好代表输入的特征,这个特征可以最大程度上代表原输入信号。那么,为了实现分类,我们就可以在AutoEncoder的最顶的编码层添加一个分类器(例如罗杰斯特回归、SVM等),然后通过标准的多层神经网络的监督训练方法(梯度下降法)去训练。

也就是说,这时候,我们需要将最后层的特征code输入到最后的分类器,通过有标签样本,通过监督学习进行微调,这也分两种,一个是只调整分类器(黑色部分):

另一种:通过有标签样本,微调整个系统:(如果有足够多的数据,这个是最好的。end-to-end learning端对端学习)

一旦监督训练完成,这个网络就可以用来分类了。神经网络的最顶层可以作为一个线性分类器,然后我们可以用一个更好性能的分类器去取代它。在研究中可以发现,如果在原有的特征中加入这些自动学习得到的特征可以大大提高精确度,甚至在分类问题中比目前最好的分类算法效果还要好!

AutoEncoder存在一些变体,这里简要介绍下两个:

Sparse AutoEncoder稀疏自动编码器:

当然,我们还可以继续加上一些约束条件得到新的Deep Learning方法,如:如果在AutoEncoder的基础上加上L1的Regularity限制(L1主要是约束每一层中的节点中大部分都要为0,只有少数不为0,这就是Sparse名字的来源),我们就可以得到Sparse AutoEncoder法。

如上图,其实就是限制每次得到的表达code尽量稀疏。因为稀疏的表达往往比其他的表达要有效(人脑好像也是这样的,某个输入只是刺激某些神经元,其他的大部分的神经元是受到抑制的)。

Denoising AutoEncoders降噪自动编码器:

降噪自动编码器DA是在自动编码器的基础上,训练数据加入噪声,所以自动编码器必须学习去去除这种噪声而获得真正的没有被噪声污染过的输入。因此,这就迫使编码器去学习输入信号的更加鲁棒的表达,这也是它的泛化能力比一般编码器强的原因。DA可以通过梯度下降算法去训练。

9.2、Sparse Coding稀疏编码

如果我们把输出必须和输入相等的限制放松,同时利用线性代数中基的概念,即O = a1Φ1 + a2Φ2+….+ anΦn, Φi是基,ai是系数,我们可以得到这样一个优化问题:

Min |I – O|,其中I表示输入,O表示输出。

通过求解这个最优化式子,我们可以求得系数ai和基Φi,这些系数和基就是输入的另外一种近似表达。


因此,它们可以用来表达输入I,这个过程也是自动学习得到的。如果我们在上述式子上加上L1的Regularity限制,得到:

Min |I – O| + u(|a1| + |a2| + … + |an |)

这种方法被称为Sparse Coding。通俗的说,就是将一个信号表示为一组基的线性组合,而且要求只需要较少的几个基就可以将信号表示出来。“稀疏性”定义为:只有很少的几个非零元素或只有很少的几个远大于零的元素。要求系数ai 是稀疏的意思就是说:对于一组输入向量,我们只想有尽可能少的几个系数远大于零。选择使用具有稀疏性的分量来表示我们的输入数据是有原因的,因为绝大多数的感官数据,比如自然图像,可以被表示成少量基本元素的叠加,在图像中这些基本元素可以是面或者线。同时,比如与初级视觉皮层的类比过程也因此得到了提升(人脑有大量的神经元,但对于某些图像或者边缘只有很少的神经元兴奋,其他都处于抑制状态)。

稀疏编码算法是一种无监督学习方法,它用来寻找一组“超完备”基向量来更高效地表示样本数据。虽然形如主成分分析技术(PCA)能使我们方便地找到一组“完备”基向量,但是这里我们想要做的是找到一组“超完备”基向量来表示输入向量(也就是说,基向量的个数比输入向量的维数要大)。超完备基的好处是它们能更有效地找出隐含在输入数据内部的结构与模式。然而,对于超完备基来说,系数ai不再由输入向量唯一确定。因此,在稀疏编码算法中,我们另加了一个评判标准“稀疏性”来解决因超完备而导致的退化(degeneracy)问题。

比如在图像的Feature Extraction的最底层要做Edge Detector的生成,那么这里的工作就是从Natural Images中randomly选取一些小patch,通过这些patch生成能够描述他们的“基”,也就是右边的88=64个basis组成的basis,然后给定一个test patch, 我们可以按照上面的式子通过basis的线性组合得到,而sparse matrix就是a,下图中的a中有64个维度,其中非零项只有3个,故称“sparse”。

这里可能大家会有疑问,为什么把底层作为Edge Detector呢?上层又是什么呢?这里做个简单解释大家就会明白,之所以是Edge Detector是因为不同方向的Edge就能够描述出整幅图像,所以不同方向的Edge自然就是图像的basis了……而上一层的basis组合的结果,上上层又是上一层的组合basis……(就是上面第四部分的时候咱们说的那样)

Sparse coding分为两个部分:

1)Training阶段:给定一系列的样本图片[x1, x 2, …],我们需要学习得到一组基[Φ1, Φ2, …],也就是字典。

稀疏编码是k-means算法的变体,其训练过程也差不多(EM算法的思想:如果要优化的目标函数包含两个变量,如L(W, B),那么我们可以先固定W,调整B使得L最小,然后再固定B,调整W使L最小,这样迭代交替,不断将L推向最小值。

训练过程就是一个重复迭代的过程,按上面所说,我们交替的更改a和Φ使得下面这个目标函数最小。

每次迭代分两步:

a)固定字典Φ[k],然后调整a[k],使得上式,即目标函数最小(即解LASSO问题)。

b)然后固定住a [k],调整Φ [k],使得上式,即目标函数最小(即解凸QP问题)。

不断迭代,直至收敛。这样就可以得到一组可以良好表示这一系列x的基,也就是字典。

2)Coding阶段:给定一个新的图片x,由上面得到的字典,通过解一个LASSO问题得到稀疏向量a。这个稀疏向量就是这个输入向量x的一个稀疏表达了。

E.g:

9.3、Restricted Boltzmann Machine (RBM)限制波尔兹曼机

假设有一个二部图,每一层的节点之间没有链接,一层是可视层,即输入数据层(v),一层是隐藏层(h),如果假设所有的节点都是随机二值变量节点(只能取0或者1值),同时假设全概率分布p(v,h)满足Boltzmann 分布,我们称这个模型是Restricted BoltzmannMachine (RBM)。

下面我们来看看为什么它是Deep Learning方法。首先,这个模型因为是二部图,所以在已知v的情况下,所有的隐藏节点之间是条件独立的(因为节点之间不存在连接),即p(h|v)=p(h1|v)…p(hn|v)。同理,在已知隐藏层h的情况下,所有的可视节点都是条件独立的。同时又由于所有的v和h满足Boltzmann 分布,因此,当输入v的时候,通过p(h|v) 可以得到隐藏层h,而得到隐藏层h之后,通过p(v|h)又能得到可视层,通过调整参数,我们就是要使得从隐藏层得到的可视层v1与原来的可视层v如果一样,那么得到的隐藏层就是可视层另外一种表达,因此隐藏层可以作为可视层输入数据的特征,所以它就是一种Deep Learning方法。

如何训练呢?也就是可视层节点和隐节点间的权值怎么确定呢?我们需要做一些数学分析。也就是模型了。

联合组态(jointconfiguration)的能量可以表示为:

而某个组态的联合概率分布可以通过Boltzmann 分布(和这个组态的能量)来确定:

因为隐藏节点之间是条件独立的(因为节点之间不存在连接),即:

然后我们可以比较容易(对上式进行因子分解Factorizes)得到在给定可视层v的基础上,隐层第j个节点为1或者为0的概率:

同理,在给定隐层h的基础上,可视层第i个节点为1或者为0的概率也可以容易得到:

给定一个满足独立同分布的样本集:D={v(1), v(2),…, v(N)},我们需要学习参数θ={W,a,b}。

我们最大化以下对数似然函数(最大似然估计:对于某个概率模型,我们需要选择一个参数,让我们当前的观测样本的概率最大):

也就是对最大对数似然函数求导,就可以得到L最大时对应的参数W了。

如果,我们把隐藏层的层数增加,我们可以得到Deep Boltzmann Machine(DBM);如果我们在靠近可视层的部分使用贝叶斯信念网络(即有向图模型,当然这里依然限制层中节点之间没有链接),而在最远离可视层的部分使用Restricted Boltzmann Machine,我们可以得到DeepBelief Net(DBN)。

9.4、Deep Belief Networks深信度网络

DBNs是一个概率生成模型,与传统的判别模型的神经网络相对,生成模型是建立一个观察数据和标签之间的联合分布,对P(Observation|Label)和P(Label|Observation)都做了评估,而判别模型仅仅而已评估了后者,也就是P(Label|Observation)。对于在深度神经网络应用传统的BP算法的时候,DBNs遇到了以下问题:

(1)需要为训练提供一个有标签的样本集;

(2)学习过程较慢;

(3)不适当的参数选择会导致学习收敛于局部最优解。

DBNs由多个限制玻尔兹曼机(Restricted Boltzmann Machines)层组成,一个典型的神经网络类型如图三所示。这些网络被“限制”为一个可视层和一个隐层,层间存在连接,但层内的单元间不存在连接。隐层单元被训练去捕捉在可视层表现出来的高阶数据的相关性。

首先,先不考虑最顶构成一个联想记忆(associative memory)的两层,一个DBN的连接是通过自顶向下的生成权值来指导确定的,RBMs就像一个建筑块一样,相比传统和深度分层的sigmoid信念网络,它能易于连接权值的学习。

最开始的时候,通过一个非监督贪婪逐层方法去预训练获得生成模型的权值,非监督贪婪逐层方法被Hinton证明是有效的,并被其称为对比分歧(contrastive divergence)。

在这个训练阶段,在可视层会产生一个向量v,通过它将值传递到隐层。反过来,可视层的输入会被随机的选择,以尝试去重构原始的输入信号。最后,这些新的可视的神经元激活单元将前向传递重构隐层激活单元,获得h(在训练过程中,首先将可视向量值映射给隐单元;然后可视单元由隐层单元重建;这些新可视单元再次映射给隐单元,这样就获取新的隐单元。执行这种反复步骤叫做吉布斯采样)。这些后退和前进的步骤就是我们熟悉的Gibbs采样,而隐层激活单元和可视层输入之间的相关性差别就作为权值更新的主要依据。

训练时间会显著的减少,因为只需要单个步骤就可以接近最大似然学习。增加进网络的每一层都会改进训练数据的对数概率,我们可以理解为越来越接近能量的真实表达。这个有意义的拓展,和无标签数据的使用,是任何一个深度学习应用的决定性的因素。

在最高两层,权值被连接到一起,这样更低层的输出将会提供一个参考的线索或者关联给顶层,这样顶层就会将其联系到它的记忆内容。而我们最关心的,最后想得到的就是判别性能,例如分类任务里面。

在预训练后,DBN可以通过利用带标签数据用BP算法去对判别性能做调整。在这里,一个标签集将被附加到顶层(推广联想记忆),通过一个自下向上的,学习到的识别权值获得一个网络的分类面。这个性能会比单纯的BP算法训练的网络好。这可以很直观的解释,DBNs的BP算法只需要对权值参数空间进行一个局部的搜索,这相比前向神经网络来说,训练是要快的,而且收敛的时间也少。

DBNs的灵活性使得它的拓展比较容易。一个拓展就是卷积DBNs(Convolutional Deep Belief Networks(CDBNs))。DBNs并没有考虑到图像的2维结构信息,因为输入是简单的从一个图像矩阵一维向量化的。而CDBNs就是考虑到了这个问题,它利用邻域像素的空域关系,通过一个称为卷积RBMs的模型区达到生成模型的变换不变性,而且可以容易得变换到高维图像。DBNs并没有明确地处理对观察变量的时间联系的学习上,虽然目前已经有这方面的研究,例如堆叠时间RBMs,以此为推广,有序列学习的dubbed temporal convolutionmachines,这种序列学习的应用,给语音信号处理问题带来了一个让人激动的未来研究方向。

目前,和DBNs有关的研究包括堆叠自动编码器,它是通过用堆叠自动编码器来替换传统DBNs里面的RBMs。这就使得可以通过同样的规则来训练产生深度多层神经网络架构,但它缺少层的参数化的严格要求。与DBNs不同,自动编码器使用判别模型,这样这个结构就很难采样输入采样空间,这就使得网络更难捕捉它的内部表达。但是,降噪自动编码器却能很好的避免这个问题,并且比传统的DBNs更优。它通过在训练过程添加随机的污染并堆叠产生场泛化性能。训练单一的降噪自动编码器的过程和RBMs训练生成模型的过程一样。

| 十、总结与展望

1)Deep learning总结

深度学习是关于自动学习要建模的数据的潜在(隐含)分布的多层(复杂)表达的算法。换句话来说,深度学习算法自动的提取分类需要的低层次或者高层次特征。高层次特征,一是指该特征可以分级(层次)地依赖其他特征,例如:对于机器视觉,深度学习算法从原始图像去学习得到它的一个低层次表达,例如边缘检测器,小波滤波器等,然后在这些低层次表达的基础上再建立表达,例如这些低层次表达的线性或者非线性组合,然后重复这个过程,最后得到一个高层次的表达。

Deep learning能够得到更好地表示数据的feature,同时由于模型的层次、参数很多,capacity足够,因此,模型有能力表示大规模数据,所以对于图像、语音这种特征不明显(需要手工设计且很多没有直观物理含义)的问题,能够在大规模训练数据上取得更好的效果。此外,从模式识别特征和分类器的角度,deep learning框架将feature和分类器结合到一个框架中,用数据去学习feature,在使用中减少了手工设计feature的巨大工作量(这是目前工业界工程师付出努力最多的方面),因此,不仅仅效果可以更好,而且,使用起来也有很多方便之处,因此,是十分值得关注的一套框架,每个做ML的人都应该关注了解一下。

当然,deep learning本身也不是完美的,也不是解决世间任何ML问题的利器,不应该被放大到一个无所不能的程度。

2)Deep learning未来

深度学习目前仍有大量工作需要研究。目前的关注点还是从机器学习的领域借鉴一些可以在深度学习使用的方法特别是降维领域。例如:目前一个工作就是稀疏编码,通过压缩感知理论对高维数据进行降维,使得非常少的元素的向量就可以精确的代表原来的高维信号。另一个例子就是半监督流行学习,通过测量训练样本的相似性,将高维数据的这种相似性投影到低维空间。另外一个比较鼓舞人心的方向就是evolutionary programming approaches(遗传编程方法),它可以通过最小化工程能量去进行概念性自适应学习和改变核心架构。

Deep learning还有很多核心的问题需要解决:

(1)对于一个特定的框架,对于多少维的输入它可以表现得较优(如果是图像,可能是上百万维)?

(2)对捕捉短时或者长时间的时间依赖,哪种架构才是有效的?

(3)如何对于一个给定的深度学习架构,融合多种感知的信息?

(4)有什么正确的机理可以去增强一个给定的深度学习架构,以改进其鲁棒性和对扭曲和数据丢失的不变性?

(5)模型方面是否有其他更为有效且有理论依据的深度模型学习算法?

探索新的特征提取模型是值得深入研究的内容。此外有效的可并行训练算法也是值得研究的一个方向。当前基于最小批处理的随机梯度优化算法很难在多计算机中进行并行训练。通常办法是利用图形处理单元加速学习过程。然而单个机器GPU对大规模数据识别或相似任务数据集并不适用。在深度学习应用拓展方面,如何合理充分利用深度学习在增强传统学习算法的性能仍是目前各领域的研究重点。

本文转自阅面科技专注深度学习和嵌入式视觉的人工智能平台,如需转载请联系原作者。

Electronic Cigarette

Electronic Cigarette,Vape Electronic Cigarette Disposable,Pre-Charge Use Electronic Atomizer,Disposable Electronic Cigarettes

Jinhu Weibao Trading Co., Ltd , https://www.weibaoe-cigarette.com

Posted on