The Power of Rectified Linear Unit (ReLU) Activation Function in Multilayer Perceptrons (MLPs)

Ashutosh
4 min readMay 31, 2023

--

Introduction:

The activation function plays a crucial role in the functioning of neural networks. Among various activation functions, Rectified Linear Unit (ReLU) has gained significant popularity in recent years. In this blog, we will delve into the mathematics behind ReLU, explore its benefits, and provide a code implementation of ReLU activation in a Multilayer Perceptron (MLP) model.

Understanding ReLU Activation:

ReLU is a piecewise linear activation function that introduces non-linearity into the neural network. It is defined as follows:

f(x) = max(0, x)

The ReLU function takes an input x and returns either the input value (if it is positive) or 0 (if it is negative). This simple yet effective non-linearity has several advantages over other activation functions.

Benefits of ReLU Activation:

  1. Simplicity: ReLU is a simple function to compute, requiring only a comparison and maximum operation. This simplicity contributes to faster training and inference times.
  2. Non-linearity: ReLU introduces non-linearity, allowing neural networks to learn complex patterns and make nonlinear transformations of input data. This is essential for handling real-world datasets with intricate relationships.
  3. Sparse Activation: ReLU promotes sparse activation by zeroing out negative inputs. Sparse activation reduces the number of active neurons and helps mitigate overfitting by encouraging the network to focus on essential features.
  4. Avoiding Vanishing Gradient: ReLU helps address the vanishing gradient problem, which occurs when the gradients become extremely small during backpropagation. ReLU’s gradient is either 1 or 0, ensuring better signal propagation and alleviating the vanishing gradient issue.

Mathematical Derivation of ReLU:

To understand the mathematical derivation of ReLU, let’s consider a simple MLP with a single hidden layer. The activation of the hidden layer can be expressed as:

h = Wx + b,

where W is the weight matrix, x is the input vector, and b is the bias term.

Applying ReLU activation to the hidden layer:

f(h) = max(0, h),

This equation essentially replaces any negative value in the hidden layer with zero, keeping positive values intact.

Code Implementation of ReLU Activation:

Now, let’s implement the ReLU activation function in a MLP using Python and the TensorFlow framework:.

import tensorflow as tf

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Dense, Activation

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split

# Generate a sample dataset

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the MLP model with ReLU activation

model = Sequential()

model.add(Dense(64, input_dim=20))

model.add(Activation(‘relu’))

model.add(Dense(1, activation=’sigmoid’))

# Compile the model

model.compile(optimizer=’adam’, loss=’binary_crossentropy’, metrics=[‘accuracy’])

# Train the model

model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_test, y_test))

Weight Update Expression:

During the training process, the weights of the neural network are updated using an optimization algorithm, such as gradient descent. The weight update equation for a given layer can be derived using the chain rule and the derivative of the ReLU activation function.

Assuming a single hidden layer in the MLP, the weight update expression for the layer connecting the hidden layer to the output layer can be represented as:

delta_w = learning_rate * (delta * relu_derivative(hidden_output)) * input,

where:

  • delta_w represents the change in weights.
  • learning_rate determines the step size of the weight update.
  • delta represents the error or gradient from the subsequent layer.
  • relu_derivative(hidden_output) represents the derivative of the ReLU activation function applied to the output of the hidden layer.
  • input represents the input to the hidden layer.

Note that the weight update expression varies depending on the specific architecture and the optimization algorithm used. The expression provided above is a simplified representation for a single layer.

Implementing the weight update expression in the code example would involve incorporating the appropriate optimization algorithm, such as gradient descent, and applying the weight update equation during the backpropagation step.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Ashutosh
Ashutosh

Written by Ashutosh

M.tech in control system from IIEST, Shibpur .Data scientist @EBIW .

No responses yet