This article talks about the Machine learning algorithm named Linear Regression, After the data is analysed and prepared it has to be processed using a ML algorithm, let’s talk about this, every dataset has Features and Target values, features are represented as the parameters passed to a function and the targets are the return / output value(s).
The below study is about processing the data set using Linear Regression algorithm , here we find the slope, y-intercept , y-prediction values . here we also find the R2 value, here R2 predicts the difference between the actual and predicted values, if the value falls in the range of 0.5 – 1.0 , then this equation can be considered as the good-fit.
Lets visualize this on a graph plot with X and y axis, here X is termed as a the independent variable and y as dependent variable. Assume as the X values increases so even the y value also increases then there is a positive slope
y^= b0 + b1X
Here b1 is called the slope, b0 the y intercept and y^ is the predicted value value Here we represent the x as Capital X, considering it as matrix and y as the vector.
Here we also have a negative slope as well , let’s discuss about it , as the X values increases , if y value decreases then it has negative slope
y^= b0 – b1X
Lets understand this with a simple data-set using Excel using a step by step approach and also how Python’s machine learning (Linear Regression) algorithm predicts the values
Here we will take a sample data-set with X ( Features) and y (Targets) variables.
Step 1: lets calculate the mean of X and y variables
Here x̄ ( check the bar symbol on top) is the mean of x and ȳ is the mean of y.
Step 2 : Subtract the distance of X with the X mean value of it and y for the same as shown below
Step 3: Now we will calculate the slope of which is b1 , the formula for this ,
b1 = Σ (x – x̄)(y – ȳ ) / Σ(x – x̄)2
here Σ (Sigma) stands for the Sum Of
Step 4: Now we need to find the y intercept
4 = b0 + 0.6 * 3 ( Mean of x)
4 = b0+ 1.8
4 – 1.8 = b0 + 1.8 – 1.8
b0 = 2.2
Now we have the slope and y intercept , using this we can calculate the predicted values y^.
y^ = 2.2 + 0.6 * 1
y^ = 2.8
Slope = 0.6 and y intercept = 2.2 for the first value of x = 1 is 2.8, similarly calculating for all the values ( 2, 3, 4 and 5).
So here the values for y are [2, 4, 5, 4, 5] are the actual values and y^ = [2.8, 3.4 4, 4.6, 5.2] are the predicted values.
Step 5: Now we need to find the R2 value to check the difference between the actual and predicted values. The formula for this is
r2 = Σ (y^ – ȳ )2 / Σ(y – ȳ )2
R2 = 3.6 /6 = 0.6
This section completes Linear Regression using excel.
Now let’s check how the Python’s machine learning algorithm ( Linear regression ) works with the same data-set used above. Here we will be using the jupyter notebook.
Step 1 : Loading the Python libraries which are needed.
Step 2: Reading the data-set and display the first 3 rows of it
Step 3: Plotting the actual values using a scatter plot to check how the actual values are displayed on the graph.
Step 4: Fitting the data into a Linear regression model and find the Slope, y intercept and the predicted values
Here we got the same values for slope and y intercept as excel ( 0.6 and 2.2). Both manual ( excel ) and ML ( Python ) predicted the same values.
Lets pass the predicted values to a pandas dataframe and join both the tables.
Lets plot the graphical representation using the python matplotlib library. The red line indicates the predicted values and the black line represents the actual values.
Now we will calculate the R2 value using the sklearn library as shown below.
Here also it displays the R2 value as 0.6. Which is a good fit meaning there is not much difference between the actual and predicted values , if the R2 value is equal to 1.0 , then both the Actual and Predicted values are equal.