Portfolio: KNN + OLS

Hi!

Well, this year I’ve been studying statistical learning during my free time — not only because I think it’s useful, but also because I feel powerful when I can predict something. Finding patterns feels like casting special spells in a game 🧙️✨ (it’s exciting… but you can’t do it all the time).

Most of the time, I’m building queries, doing descriptive statistics, creating data visualizations, and so on. These things can be complementary, depending on the scenarios you’re working with and the problems you’re trying to solve.

This year, I decided to join the statistical learning online course from Stanford University, and it’s been quite fun! I know it’s a classic (especially for analysts who use R), but I hadn’t had the chance to really focus on it until now.

The very first algorithms introduced in the course are KNN and linear regression — both are pretty interesting, and why not say complementary too?

If you think about it, KNN is one of the simplest modeling concepts out there. Imagine asking someone with no data science background to guess the temperature on a specific day of a certain month — and then giving them some past weather data. Chances are, they’d look at similar dates from previous years, take the average of those temperatures, and voilà — we’ve got a prediction!

KNN works in a similar way, right? It takes the closest data points to the one we want to predict and averages them.

Of course, while it’s a powerful technique, it becomes less appropriate when we have multiple predictors. The issue here is that we’re increasing the number of dimensions in our data, and in higher dimensions, the K closest neighbors might not be that close anymore. (This is known as the curse of dimensionality).

Besides that, there’s also a problem near the edges of the data. Imagine we want to predict the temperature of a day that’s going to be very hot — let’s say 30°C — in a city that rarely sees such temperatures. If we rely on previous years for this prediction, we might be near the edge of the available data, and the average would likely pull our prediction down.

Check the intuition below:

set.seed( 50 )
temperatures <- rnorm( n = 10, mean = 20, sd = 5 )

Here, I basically generated 10 random temperature values with a mean of 20 and standard deviation of 5. And let’s say this is our information to make the average — naturally, we will get the most extreme values from it (here I’m using the top 3):

round( mean( sort( temperatures )[ 8:10 ] ) )

[1] 23

But, as you can see, our prediction using KNN with K = 3 is not even close to 30.

It turns out that our average is making the forecast worse in edge cases — using the highest value from our data, for example, would actually reduce the error. So, we tend to have bigger mistakes at the borders of our data when using this algorithm.

That’s the reason why Ordinary Least Squares might help us — it reduces error in edge cases because linear regression doesn’t suffer from the same problem.

Fortunately, there are packages that implement this idea. Here, I’m going to do that using the qeML package.

Exploring Data

I choose a small data set from R that contains records of the growth of orange trees 🍊:

data.frame( sapply( Orange, range ) ); sapply( Orange, class )

  Tree  age circumference
1    1  118            30
2    5 1582           214

$Tree
[1] "ordered" "factor" 

$age
[1] "numeric"

$circumference
[1] "numeric"

sapply( Orange[, 1:2 ], function( x ) length( unique( x ) ) )

Tree  age 
   5    7

So, there are 5 distinct trees and 7 different ages analyzed for each species, while the circumference is a continuous variable. Let’s plot it before jumping into modeling.

Apparently there is a linear relationship going on here, which might go in favor to the OLS fit. Lets check it out!

Model

The goal here is to predict the circumference of orange trees based on the variables age and Tree. I’ll try this using KNN, Linear Regression, and finally, a hybrid model that combines both. To evaluate model performance, I’ll use repeated holdout validation — calculating the MAPE (Mean Absolute Prediction Error) for each run. This helps reduce the impact of randomness in a single train/test split.

Since this is a small dataset, I’ll set k = 3 for the KNN-based models.

Now, let’s compare the models:

qeCompare( data.frame( Orange ), y = 'circumference', 
           c( 'qeKNN', 'qeLin', 'qeLinKNN' ),
           opts = list( qeKNN = list( k = 3 ), qeLinKNN = list( k = 3 ) ),
           nReps = 10 )

     qeFtn   meanAcc
1    qeKNN 30.966667
2    qeLin 11.058602
3 qeLinKNN  8.986057

The function above runs each model across 10 different holdout sets, calculates the MAPE for each, and returns the average. Based on the results, we can say the hybrid model delivers the best performance.

Conclusion

As expected from the plot, the data appears to follow a linear pattern — so linear regression is definitely a solid choice. However, the hybrid model still managed to outperform the others, showing that combining techniques can be powerful even in seemingly linear scenarios.

That’s it! I just wanted to explore a few things I’ve been studying — and it was a real pleasure to write this one.