Building a Linear Regression Model with Gradient Descent from Scratch
The inspiration for this project came from Andrew Ng’s “Machine Learning” course on Coursera.org. The projects for that course used Octave math coding language, which has many similar functions to Python’s Numpy library. With that in mind, I decided to practice what I learned by building a Logistic Regression model from scratch in Python. I chose a simple gradient descent method for training the model weights and then tested my code with the famous “Iris” dataset that comes packaged with Python’s Scikit Learn library.
Skills Demonstrated
- Python Libraries: Numpy, Seaborn, Pandas
- Logistic Regression coded with vectorized array functions
- Gradient Descent, also with vectorized array functions
The Data
The first use of the Iris dataset is credited to Sir R.A. Fisher, and the data has frequently been used in demonstrating pattern recognition and classification in machine learning. For this project, I elected to use the version of the data that comes packaged with the Scikit Learn Python library. The target variable is the iris species, with three possible values. It is a small dataset with 150 rows and four features. The observations are divided evenly into 50 rows of each species, and there are no missing values.
Data Features:
- Sepal Length (cm)
- Sepal Width (cm)
- Petal Length (cm)
- Petal Width (cm)
Target Variable: Species
- setosa
- versicolor
- virginica
Visualizing the Data
The scatter plots below, color-coded by iris species, show the relationships among the variables. In particular, petal length and petal width appear to have a positive linear relationship across the three species. We also see that setosa (green dots) has distinctly different petal and sepal measurements from the other two species. So, at a glance, one would expect the model to perform consistently well on setosa. On the other hand, the measurements for versicolor and virginica (coral and blue dots) overlap in the plots. So, one might expect the model to be somewhat less accurate in classifying these two species.
Figure 1: Pair plots comparing the relationships among data features
Python Code for Loading the Data and Creating the Pair Plots
Figure 2: The first five rows of the Iris dataframe
The Model: Logistic Regression with Gradient Descent
Logistic regression uses the sigmoid function to model values between 0 and 1, which makes it useful for modeling True/False classifications. In this model form, the inputs are an array of X values and a corresponding array of trained weights (also known as coefficients). Equation 1 illustrates the general form of the model in mathematical symbols.
When choosing among multiple classes of the target variable, y, a different set of model weights is trained for each class. After training, probability predictions are made using the weights for all classes and each row of X data. Final predictions are determined by finding the maximum class probability prediction for each row.
z can also be written in its expanded form, using Numpy indexing from zero:
The variables in the above equation are:
- X = the inputs values (measurements of parts of the iris, in this case)
- y = the classification of each iris’s species
- θ = theta, representing an array of weights for each classification
Calculation Notes:
The leftmost (subscripts zero) term is the bias term, and the value of x0 in this term is actually “1.” Written in its long form, term-by-term, the equation for z could be written without x0. However, adding a column of ones to the 2D array/matrix X makes the implementation nicer in Numpy. Note that some texts begin indexing at 1. However, I chose to label the equations indexed from 0 to be consistent with the Python code later in this article.
I kept the superscript T (meaning transpose) in the equation for purposes of textbook notation. However, the intention is for each individual x value to be multiplied by its corresponding θ coefficient for all rows of data. During Numpy implementation, I found it simpler to reverse the order of the terms and instead do matrix multiplication of X times θ (not transposed).
Also, while θ is referred to here as model “weights,” it functions similarly to coefficients used in algebra. I have seen notation differ across texts, but the general idea is that X represents multiple x values of the data collected. The θ values are being optimized to produce the smallest amount of error when X values are input into the trained model.
The functions I coded for the Logistic Regression model are:
- Sigmoid function
- Cost function (a derivative of the sigmoid function)
- Gradient computation function (derivative of the cost)
- Gradient descent function to solve for theta
- Model training function for a binary y array
- Model training function for y with multiple classes
- Predict probabilities for all classes
Additional function used: Numpy’s argmax function to find the column index for class that has the highest probability for each row
Python Code for the Training and Testing the Model
Results
To test my code, I trained three models, plus one extra with Scikit Learn’s logistic regression classifier. To keep training results consistent, I used the same lambda, alpha, and number of iterations for all three gradient descent models. I also split the data into 67% training rows and the remaining 33% as test rows. If the dataset had more observations, I would have also liked to split some rows off into a validation set. However, given that the entire dataset had only 150 observations, it seemed impractical to split it further.
Inputs for all three gradient descent models:
- lambda = 0.9,
- number of iterations = 1500,
- alpha = 0.01
Model 1: Gradient Descent with All Features
92% Accuracy
Model 2: Gradient Descent with Only ‘Sepal Width’ and ‘Petal Width’
82% Accuracy
Model 3: Scikit Learn’s Logistic Regression with All Features, Default Parameters
100% Accuracy
Model 4: Gradient Descent with Polynomial Terms, All Features
100% Accuracy
Conclusion
All three gradient descent models predicted correctly for the species setosa. This result is consistent with what we saw in the plots during the data visualization step, where setosa was represented by clusters of green dots that had noticeably different measurements from the other two species. The least accurate model (82%) was gradient descent trained with only two features, “Sepal Width” and “Petal Width.” Compare this to Scikit Learn’s logistic regression model, which performed the best at 100% accuracy using the four original features. Gradient descent also predicted the iris species with 100% accuracy when additional features were engineered by squaring the original iris measurements. However, this additional step seems unnecessary when the Scikit Learn model can produce the same accuracy using the unengineered features. So, I would select the Scikit Learn model for its ease of use.
Accuracy of the Results
Accuracy of 100% would usually be reason to suspect that something was amiss. Perhaps some variation of the target variable had been used as a feature to train the model. However, in this case, the dataset is small and very simple. Furthermore, the plots showed that at least one of the species could be neatly separated based on sepal and petal measurements, with no overlap into the measurements of other species. So, it seems more likely that the data in this case is simply very consistent. For these reasons, I concluded that the models are good, given the limited number of observations.
References
Ng, Andrew. Machine Learning. Coursera.org. www.coursera.org/learn/machine-learning
(1988). Iris [Data set]. Scikit Learn Python Library. scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html