As I was looking at job postings for a machine learning engineer, I looked at the job description and one of the requirements is knowledge of deep learning and PyTorch. For the traditional data scientist, deep learning is typically not part of the repertoire of studies, typically a part of the study of computer science and artificial intelligence. However, given the rise of AI and the demand of these skills it may behoove the data scientist to know a bit about deep learning.
I found two references that may be a good introduction to the topic:
Mueller, J. P., & Massaron, L. (2024). Python for data science for dummies (3rd ed.). John Wiley & Sons.
Julian, D. (2018). Deep learning with PyTorch quick start guide: Learn to train and deploy neural network models in Python. Packt Publishing Ltd.
For this post, let’s focus on Chapter 19 of Mueller and Massaron (2024) “Playing with Neural Networks”.
Neural networks (NNs) developed as an attempt to reverse engineer how a brain processes signals with terms such as axons and neurons, but effectually are a sophisticated form of linear regression. NNs are effective for complex problems such as image and sound recognition and machine language translation. Deep learning (DL) lies behind Siri and other digital assistants as well as ChatGPT. DL typically requires specialized hardware such as GPUs with special frameworks such as Keras, TensorFlow, and PyTorch.
The core NN algorithm is the neuron or unit. Many neurons arranged in an interconnected structure make up layers of a neural network with each neuron linking to the inputs and outputs of other neurons. NNs can only process numeric (continuous) information rather than categorical variables but this can be resolved by converting categorical to binary values (similar to dummy coding in regression).
In biology, neurons receive signals but don’t always release a signal and only fire an answer when it receives enough stimuli (otherwise it remains silent). In NNs, after receiving weighted values, sum them and use an activation function to evaluate the result which transforms it in a possibly nonlinear way. For example, the activation function can release a zero value unless the input achieves a certain threshold, or it can dampen or enhance a value by nonlinearly rescaling it. Each neuron in the network receives inputs from previous, weights them, sums them all and transforms the results into an activation function. After activating, the computed output becomes input for other neurons or the prediction of the network. The weights of the NN are similar to the coefficients of a linear regression, and the network learns their value by repeated passes (iteration or epochs) over the dataset.
The book provides code examples using Keras from TensorFlow. The example comes from the hand-written dataset as an example of multiclass classification. NNs are sensitive to the scale and distribution of the data (it’s good to normalize the variables by creating z-scores, i.e., scaling the mean to 0 and standard deviation to 1). The target or outcome requires each terminal neuron to make a prediction which is a numeric value or probability. For classification, one approach involves using one-hot encoding (similar to dummy coding in regression) and assigning a separate neuron with sigmoid activation to predict the probability for each class. The class with the highest probability is the winner among others.
To construct the architecture and train the data, initialize an empty network and add layers of neurons progressively starting from the top where the data is inputted to the bottom where results are obtained. The example has two layers of 64 and 32 neurons activated by the ReLU function (Rectified Linear Unit) defined mathematically as which means it outputs only non-negative values. The activation function enables the network to learn nonlinear patterns and each layer is followed by a dropout layer to serve as a regularization technique to prevent overfitting. The network concludes with a layer containing the probabilities for the classes where the winning class is determined. The softmax activation function is employed in the final layer which was defined previously in Transformers, BERT, and GPT (Chapter 4). The code will iterate over the data 50 times (epochs) and process the data in batches of 32 examples each. During the training process there will be updates and guidance regarding process made as well as evaluation metrics relative to the test data. The training loss and validation loss can be plotted over the epochs. Training loss is defined as the error between the prediction and the actual outcomes in the training data. In linear regression this is similar to the residual . Validation loss is assessed similarly but for the testing data as in .
Leave a Reply