According to Emergen Research, the use of deep learning has increased significantly over the last decade, thanks to the development of cloud-based technology and the use of deep learning systems in big data. Deep learning is expected to become a $93 billion market by 2028.
But what is deep learning precisely, and how does it work?
Deep learning is a subtype of machine learning that makes use of neural networks to learn and predict. Deep learning has demonstrated outstanding performance in a variety of tasks, including text, time series, and computer vision. Deep learning’s success is largely due to the availability of massive amounts of data and computing power. Deep learning, on the other hand, is considered superior to any of the traditional machine learning algorithms for a variety of reasons.
Neural networks and functions in deep learning
A neural network is a network of neurons that is interconnected and each neuron is a limited function approximator. Neural networks are seen as universal function approximators in this sense. A function, as you may recall from high school algebra, is a mapping from input space to output space. A basic sin(x) function maps from angular space to real number space (-180o to 180o or 0o to 360o) (-1 to 1).
Let’s have a look at why neural networks are regarded as universal function approximators. Each neuron learns a limited function: f(.) = g(WX), where W is the input vector, X is the weight vector, and g(.) is a non-linear transformation. WX can be represented as a line in high-dimensional space (hyperplane), with g(.) being any non-linear differentiable function such as sigmoid, tanh, ReLU, and so on (commonly used in the deep learning community).
Finding the best weight vector W is the essence of learning in neural networks. We have two weights in y = mx+c, for example, m and c. Now, we identify the best value of m & c based on the distribution of points in 2D space that meets certain criteria: the gap between predicted y and actual points is minimal for all data points.
The effect of layers
We stack numerous such neurons in a “layer” where each neuron receives the identical set of inputs but learns distinct weights W. As a result, each layer has a set of learned functions termed hidden layer values: [f1, f2,…, fn]. In the next layer, these values are combined once more: h(f1, f2,…, fn) and so on. Each layer is made up of functions from the previous layer (for example, h(f(g(x))). It has been demonstrated that any non-linear complex function may be learned using this composition.
A neural network with several hidden layers (typically more than two hidden layers) is known as deep learning. Deep learning, on the other hand, is essentially a complicated synthesis of functions from layer to layer, with the goal of identifying the function that specifies a mapping from input to output. If the input is an image of a lion and the output is an image classification indicating that the image belongs to the lion’s class, deep learning is learning a function that maps image vectors to classes. Similarly, the input is the word sequence, and the output is whether the input sentence is positive, neutral, or negative. As a result, deep learning is the process of learning a map from input text to output classes, which might be neutral, positive, or negative.
Interpolation with deep learning
Humans process images of the world by hierarchically understanding them bit by bit, from low-level elements like edges and contours to high-level features like objects and sceneries, according to a biological interpretation. In neural networks, function composition is similar, with each function composition learning complicated characteristics about an image. The most common neural network design for images is the Convolutional Neural Network (CNN), which learns those features in a hierarchical form and then classifies them into distinct classes using a fully connected neural network.
We try to fit a curve through interpolation that somewhat represents a function defining those data points, utilizing high school math once more. Given a set of 2D data points, we try to fit a curve through interpolation that somewhat represents a function defining those data points. The more complex the function we fit (for example, in interpolation, determined by polynomial degree), the better it fits the data; however, it generalizes less for a new data point. This is where deep learning runs into problems with what is known as an overfitting problem: fitting to data as closely as possible while sacrificing generalization. Almost all deep learning architectures had to deal with this critical feature in order to train a generic function that can perform equally well on unknown input.
“Deep Learning is not as stunning as you think because it is just interpolation coming from glorified curve fitting,” wrote Yann LeCun (developer of the convolutional neural network and ACM Turing award winner) on his Twitter handle (based on a paper). However, there is no such thing as interpolation in high dimensions. Everything at high dimensions is extrapolation.” As a result, deep learning does nothing but interpolation or, in some situations, extrapolation as part of function learning. That concludes our discussion.
The educational aspect
So, how do we learn such a difficult function? Well, it all relies on the situation at hand, and the neural network architecture is determined by that. When it comes to picture classification, CNN is the tool of choice. We utilize RNN or transformers if we want to make time-dependent predictions or messages, and we use reinforcement learning if we have a dynamic environment (like driving a car). Aside from that, learning entails dealing with a variety of challenges:
- Regularization is used to ensure that the model learns a general function rather than only fitting to train data.
- The loss function is chosen based on the task at hand; roughly speaking, the loss function is an error function between what we want (actual value) and what we currently have (current prediction).
- Gradient descent is the procedure for convergent to an optimal function; determining the learning rate is difficult because when we are far from optimal, we want to move faster to optimal, and when we are close to optimal, we want to move slower to ensure we converge to optimal and global minima.
- The vanishing gradient problem necessitates a large number of hidden layers; architectural adjustments such as skip connections and an appropriate non-linear activation function aid in solving it.
Challenges in computing
Now that we know that deep learning is nothing more than a learning complicated function, it presents a new set of computational challenges:
a) A significant amount of data is required to learn a complex function.
b) We need fast computation environments to process enormous amounts of data.
c) Such settings necessitate infrastructure that can support them.
To compute millions or billions of weights, parallel processing with CPUs is insufficient (also called parameters of DL). Learning weights for neural networks necessitates vector (or tensor) multiplications. GPUs can be particularly handy here because they can do parallel vector multiplications quickly. We sometimes require 1 GPU, and sometimes numerous GPUs, depending on the deep learning architecture, data size, and task at hand. A data scientist must make this decision based on published literature or by assessing performance on one GPU.
A deep learning network may learn any mapping from one vector space to another vector space given the right neural network architecture (number of layers, number of neurons, non-linear function, etc.) and enough data. Deep learning is a strong tool for any machine learning endeavor because of this.