# Long Short-Term Memory Networks
## Overview
Long Short Term Memory (LSTM) is a kind of recurrent neural network.
In RNN, output from the last step is fed as input in the current step.
LSTM was designed to tackle the problem of long-term dependencies of RNN in which the RNN cannot predict the word stored in the long term memory but can give more accurate predictions from the recent information.
As the gap length increases, RNN does not give efficient performance.
LSTM can retain the information for long periods of time which is used for processing, predicting and classifying on the basis of time series data.
## Structure Of LSTM
LSTM has a chain structure that contains four neural networks and different memory blocks called _cells_.
Information is retained by the cells and the memory manipulations are done by the gates. There are three gates:
1. Forget Gate: remvoes the information that no longer useful in the cell state.
Two inputs x_t (input at the time t) and h_{t-1} (previous cell output) are fed to the gate and multiplied with weight matrices followed by the addition of bias. The result is passed through an activation function which gives a binary output.
If for a particular cell state the output is 0, the piece of information is forgotten and if the output is 1, the information is retained for future use.
2. Input Gate: adds useful information to the cell state.
First, the information is regulated using the `sigmoid` function and filter the values to be remembered similar to the forget gate using inputs h_{t-1} and x_t.
Next, a vector is created using `tanh` function that gives output from -1 to +1 which contains all the possible values from h_{t-1} and x_t.
Finally, the values of the vector and the regulated values are multiplied to obtain the useful information.
3. Output Gate: extracts useful information from the current cell state to be presented as an output.
First, a vector is generated by applying `tanh` function on the cell.
Next, the information is regulated using the `sigmoid` function and filter the values to be remembered using inputs h_{t-1} and x_t.
Finally, the values of the vector and the regulated values are multiplied to be sent as output which is the input to the next cell.
## Understanding LSTM Networks
RNN is a network that works on the present input by taking into consideration the previous output (feedback) and storing in its memory for a short period of time (short-term memory).
However, there are problems with RNNs:
- RNN fails to store information for a longer period of time. Sometimes a reference to certain information stored quite a long time ago is required to predict the current output but RNNs are unable to handle such “long-term dependencies”.
- There is no fine control over which part of the context needs to be carried forward and how much of the past needs to be ‘forgotten’.
- Other issues with RNNs are exploding and vanishing gradients which occur during the training process of a network through _backtracking_.
Thus, LSTM networks are an extension of recurrent neural networks (RNNs) designed to handle situations where RNNs fail.
LSTM is designed to remove the vanishing gradient problem while the training model is left unaltered.
Long time lags in certain problems are bridged using LSTMs where they can also handle noise, distributed representations, and continuous values.
With LSTMs, there is no need to keep a finite number of states which is required in the hidden Markov model (HMM).
LSTMs provide a large range of parameters such as learning rates and input and output biases, so no need for fine adjustments.
LSTM also has the advantage of reducing the complexity of updating each weight to O(1) which is similar to that of Back Propagation Through Time (BPTT).
### Exploding and Vanishing Gradients
During the training process of a network, the main goal is to minimize loss (in terms of error or cost) observed in the output when training data is sent through it.
We calculate the gradient (loss with respect to a particular set of weights), adjust the weights accordingly, and repeat this process until we get an optimal set of weights for which loss is minimum which is called backtracking. However, sometimes the gradient is almost zero.
Since the gradient of a layer depends on certain components in the successive layers, if some of these components are small (less than 1), the resulting gradient will be even smaller which is called the scaling effect.
When the gradient is multiplied with the learning rate which is in itself a small value ranging between 0.1-0.001, it results in a smaller value. Thus, the alteration in weights is quite small, producing almost the same output as before. Similarly, if the gradients are large in value due to the large values of components, the weights get updated to a value beyond the optimal value which is called the exploding gradient problem. To avoid the scaling effect, the neural network unit has been re-built in such a way that the scaling factor has been fixed to one. The cell was then enriched by several gating units and called LSTM.
### Architecture
The basic difference between the architectures of RNN and LSTM is that the **hidden layer** of LSTM is a gated unit or gated cell which consists of four layers that interact with one another to produce the output of the cell along with the cell state. These two things (cell output and cell state) are then passed onto the next hidden layer.
Unlike RNNs which only have a single neural net layer of `tanh`, LSTMs are comprised of three logistic sigmoid gates and one tanh layer. The gates have been introduced to limit the information that is passed through the cell. The gates determine which part of the information will be needed by the next cell and which part is to be discarded. The output is usually in the range of 0-1 where ‘0’ means ‘reject all’ and ‘1’ means ‘include all’.
### Hidden layers of LSTM
Each LSTM cell has three inputs h_{t-1}, C_{t-1}, x_t and two outputs h_t and C_t .
For a given time t, h_t is the hidden state, C_t is the cell state or memory, and x_t is the current input.
The first sigmoid layer has two inputs h_{t-1} and x_t where h_{t-1} is the hidden state of the previous cell which is known as the forget gate since its output selects the amount of information of the previous cell to be included. The output is a number in [0, 1] which is multiplied (point-wise) with the previous cell state C_{t-1}.
### Conventional LSTM
The second sigmoid layer is the input gate that decides what new information is to be added to the cell which takes two inputs h_{t-1} and x_t.
The tanh layer creates a vector C_t of the new candidate values.
Together, these two layers determine the information to be stored in the cell state.
Their point-wise multiplication ( i_t ? C_t ) tells us the amount of information to be added to the cell state. The result is then added with the result of the forget gate multiplied with previous cell state (f_t* C_{t-1}) to produce the current cell state C_t.
Next, the output of the cell is calculated using a sigmoid and a tanh layer.
The sigmoid layer decides which part of the cell state will be present in the output and the tanh layer shifts the output in the range of [-1, 1].
The results of the two layers undergo point-wise multiplication to produce the output h_t of the cell.
### Variations
With the increasing popularity of LSTMs, various alterations have been tried on the conventional LSTM architecture to simplify the internal design of cells to make them work in a more efficient way and to reduce the computational complexity.
- Gers and Schmidhuber introduced peephole connections which allowed gate layers to have knowledge about the cell state at every instant.
- Some LSTMs also made use of a coupled input and forget gate instead of two separate gates that helped in making both the decisions simultaneously.
- Another variation was the use of the Gated Recurrent Unit (GRU) which improves the design complexity by reducing the number of gates using a combination of the cell state and hidden state and also an update gate which has forget and input gates merged into it.
### GRU vs LSTM
Gated Recurrent Units (GRU) are recurrent units that are also provided with a gated mechanism to effectively and adaptively capture dependencies of different time scales.
GRU have an update gate and a reset gate.
The update gate is responsible for selecting what piece of knowledge is to be carried forward while the reset gate is between two successive recurrent units which decides how much information needs to be forgotten.
### Applications
LSTM models need to be trained with a training dataset prior to its employment in real-world applications.
Some of the most demanding applications are discussed below:
- Language modelling or text generation which involves the computation of words when a sequence of words is fed as input. Language models can be operated at the character level, n-gram level, sentence level or even paragraph level.
- Image processing involves performing analysis of a picture and concluding its result into a sentence which requires a dataset comprising of a good amount of pictures with their corresponding descriptive captions.
A model that has already been trained is used to predict features of images present in the dataset (photo data). The dataset is then processed in such a way that only the words that are most suggestive are retained (text data). Using these two types of data, we try to fit the model.
The purpose of the model is to generate a descriptive sentence for the picture one word at a time by taking input words that were predicted previously by the model and also the image.
- Speech and Handwriting Recognition
- Music generation which is quite similar to that of text generation where LSTMs predict musical notes instead of text by analyzing a combination of given notes fed as input.
- Language Translation involves mapping a sequence in one language to a sequence in another language. Similar to image processing, a dataset containing phrases and their translations is first cleaned and only a part of it is used to train the model.
An encoder-decoder LSTM model is used which first converts input sequence to its vector representation (encoding) and then outputs the translated version.
### Drawbacks
LSTMs also have a few drawbacks:
- LSTMs are designed to solve the problem of vanishing gradients, but they actually fail to remove it completely. The problem lies in the fact that the data still has to move from cell to cell for evaluation. In addition, the cell has now become complex with the additional features (such as forget gates).
- LSTMs require a lot of resources and time to get trained to be ready for real-world applications. LSTMs need high memory-bandwidth because of the linear layers present in each cell which the system usually fails to provide for. Thus, hardware-wise, LSTMs become quite inefficient.
- With the rise of data mining, developers are looking for a model that can remember past information for a longer time than LSTMs. The source of inspiration for such kind of model is the human habit of dividing a given piece of information into small parts for easy remembrance.
- LSTMs are affected by different random weight initialization so their behavior is similar to that of a feed-forward neural net so they prefer small weight initialization.
- LSTMs are prone to overfitting and it is difficult to apply the dropout algorithm to curb this issue. Dropout is a regularization method where input and recurrent connections to LSTM units are probabilistically excluded from activation and weight updates while training a network.
-----
## The 5 Step Life-Cycle for LSTM Models in Keras
The article [2] discusses the 5 steps in the LSTM model life-cycle in Keras:
1. Define Network
2. Compile Network
3. Fit Network
4. Evaluate Network
5. Make Predictions
## Derivation of Backpropagation through time
LSTM (Long short term Memory) is a type of Recurrent neural network (RNN) that is well-suited for making predictions and classification. In this article, we derive the algorithm backpropagation through time and find the gradient value for all the weights at a particular times step.
As the name suggests, backpropagation through time is similar to backpropagation in deep neural network but due to the dependency of time in RNN and LSTM, we will have to apply the chain rule with time dependency.
—————————-
## LSTM Autoencoders
An **LSTM Autoencoder** is an implementation of an autoencoder for sequence data using an Encoder-Decoder LSTM architecture.
The encoder part of the model can be used to encode or compress sequence data that in turn may be used in data visualizations or as a feature vector input to a supervised learning model.
- Autoencoders are a type of self-supervised learning model that can learn a compressed representation of input data.
- LSTM Autoencoders can learn a compressed representation of sequence data and have been used on video, text, audio, and time series sequence data.
This post is divided into six sections; they are:
1. What Are Autoencoders?
2. A Problem with Sequences
3. Encoder-Decoder LSTM Models
4. What Is an LSTM Autoencoder?
5. Early Application of LSTM Autoencoder
6. How to Create LSTM Autoencoders in Keras
The encoder-decoder model provides a pattern for using recurrent neural networks to address challenging sequence-to-sequence prediction problems such as machine translation.
Encoder-decoder models can be developed in the Keras Python deep learning library and an example of a neural machine translation system developed with this model has been described on the Keras blog, with sample code distributed with the Keras project.
## Encoder-Decoder with Attention
The encoder-decoder model for recurrent neural networks is an architecture for sequence-to-sequence prediction problems where the length of input sequences is different to the length of output sequences.
It is comprised of two sub-models:
- Encoder: The encoder is responsible for stepping through the input time steps and encoding the entire sequence into a fixed length vector called a context vector.
- Decoder: The decoder is responsible for stepping through the output time steps while reading from the context vector.
A problem with the architecture is that performance is poor on long input or output sequences. The reason is believed to be because of the fixed-sized internal representation used by the encoder.
_Attention_ is an extension to the architecture that addresses this limitation by providing a richer context from the encoder to the decoder and a learning mechanism where the decoder can learn where to pay attention in the richer encoding when predicting each time step in the output sequence.
## References
[1]: [Long Short-Term Memory Networks (LSTMs)](https://machinelearningmastery.com/start-here/#lstm)
[2]: [The 5 Step Life-Cycle for Long Short-Term Memory Models in Keras](https://machinelearningmastery.com/5-step-life-cycle-long-short-term-memory-models-keras/)
[Introduction to Long Short Term Memory](https://www.geeksforgeeks.org/deep-learning-introduction-to-long-short-term-memory/)
[Understanding of LSTM Networks](https://www.geeksforgeeks.org/understanding-of-lstm-networks/)
[LSTM – Derivation of Back propagation through time](https://www.geeksforgeeks.org/lstm-derivation-of-back-propagation-through-time/)
[A Gentle Introduction to LSTM Autoencoders](https://machinelearningmastery.com/lstm-autoencoders/)
[How to Develop an Encoder-Decoder Model for Sequence-to-Sequence Prediction in Keras](https://machinelearningmastery.com/develop-encoder-decoder-model-sequence-sequence-prediction-keras/)
[Implementation Patterns for the Encoder-Decoder RNN Architecture with Attention](https://machinelearningmastery.com/implementation-patterns-encoder-decoder-rnn-architecture-attention/)
[LSTM for time series prediction](https://towardsdatascience.com/lstm-for-time-series-prediction-de8aeb26f2ca)
[How to Develop LSTM Models for Time Series Forecasting](https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting/)
[Multivariate input LSTM in PyTorch](https://stackoverflow.com/questions/56858924/multivariate-input-lstm-in-pytorch)
[PyTorch for Deep Learning — LSTM for Sequence Data](https://medium.com/analytics-vidhya/pytorch-for-deep-learning-lstm-for-sequence-data-d0708fdf5717)