Predicting Missing Values

Analising forest inventory data and predicting NAs through least squares regression model

Gabriel de Freitas Pereira true
04-18-2021

Introduction

   

  This data was provided by my dendrometry teacher and it is the result of a sistematic collection of data that was carried out between 2017 and 2018 at Três Lagoas, which is a city from Mato Grosso do Sul state. The area has an extent of 1226 acres of Eucalyptus grandis W. Hill ex Maiden x Eucalyptus urophylla S. T. Blake clonal culture with mean age of 2.88 years.

Exploring the data

   

  Here we can take a look on the first 6 rows from the data frame which has 3313 obs. of 6 variables:

area(ha) spacing genetic material age DBH heigth
42.43 3,45X2,40 35 2.839151 12.16 16.98621
42.43 3,45X2,40 35 2.839151 11.52 16.61542
42.43 3,45X2,40 35 2.839151 12.00 16.89602
42.43 3,45X2,40 35 2.839151 11.87 16.82587
42.43 3,45X2,40 35 2.839151 11.33 16.49516
42.43 3,45X2,40 35 2.839151 12.45 17.13653

   

Checking whether we have got NAs or not:

sum(is.na(fustes_6$DBH))
[1] 134

   

  As we can see there are 134 NAs from DBH values that probably represent dead trees, that can be replanted in the future for next rotation. But just to be sure about it, let’s take a look on the age! Because those NAs values also could be trees too much young for DBH.

min(fustes_6$age)
[1] 2.811773

   

  The minimum age turns out the idea of NAs are representing dead trees. Therefore if we would like to get a second rotation forestry would be good to understand the behavior of trees under similar conditions to do future economic analysis. So what would we expect from these trees if they were alive?  

  We can make an approximation through least squares regression model based on our data and variables to find out their diameter at breast height! And for that we need to pre-process the data…    

Data pre-processing

   

Checking whether we have got outliers on DBH or not:

  As we can see, there are outliers on the data that we want to remove because it’s not good for least squares regression model. The result bellow show us the lower extreme, first quartile, median, third quartile and the upper extreme respectively.

boxplot.stats(fustes_6$DBH)$stats
[1]  8.820 11.585 12.510 13.430 16.180

   

Automating the process to get the extreme values:

upper_whisker <- boxplot.stats(fustes_6$DBH)$stats[5]
lower_whisker <- boxplot.stats(fustes_6$DBH)$stats[1]

   

Of course after the upper extreme we’ve got the outliers and before the first whisker as well. And both we are going to remove this way:

outlier_filter_upper <- fustes_6$DBH < upper_whisker 
outlier_filter_lower <- fustes_6$DBH > lower_whisker 

#filtering outliers: 
outlier_filter <- fustes_6[outlier_filter_lower & outlier_filter_upper,]

   

Now we need to check the class from the variables and if necessary make some changes…

str(fustes_6)
'data.frame':   3313 obs. of  6 variables:
 $ area(ha)        : num  42.4 42.4 42.4 42.4 42.4 ...
 $ spacing         : chr  "3,45X2,40" "3,45X2,40" "3,45X2,40" "3,45X2,40" ...
 $ genetic material: int  35 35 35 35 35 35 35 35 35 35 ...
 $ age             : num  2.84 2.84 2.84 2.84 2.84 ...
 $ DBH             : num  12.2 11.5 12 11.9 11.3 ...
 $ heigth          : num  17 16.6 16.9 16.8 16.5 ...

   

Categorical casting variables that we will use to predict:

fustes_6$`genetic material` <- as.factor(fustes_6$`genetic material`)

   

Building the model

   

Our formula has the following format: \(\hat{y}=\hat{\beta_0}+\hat{\beta_1}X_1+\hat{\beta_2}X_2\)

dbh_equation <- "DBH ~ `genetic material` + age"

   

Plotting these variables:

   

Regression model:

#removing NAs to make the model:
fustes_model <- na.omit(fustes_6)

dbh_model <- lm(
  formula = dbh_equation, 
  data = fustes_model[outlier_filter_lower & outlier_filter_upper,]
)

summary(dbh_model)

Call:
lm(formula = dbh_equation, data = fustes_model[outlier_filter_lower & 
    outlier_filter_upper, ])

Residuals:
    Min      1Q  Median      3Q     Max 
-7.6626 -0.7964  0.1341  1.0160  6.8762 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)           3.87366    2.40738   1.609 0.107708    
`genetic material`12 -0.52911    0.11485  -4.607 4.26e-06 ***
`genetic material`19  0.12014    0.17866   0.672 0.501359    
`genetic material`35 -0.20092    0.08401  -2.392 0.016838 *  
`genetic material`60 -0.26753    0.09896  -2.703 0.006905 ** 
age                   3.01675    0.82678   3.649 0.000268 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.619 on 2912 degrees of freedom
  (264 observations deleted due to missingness)
Multiple R-squared:  0.02809,   Adjusted R-squared:  0.02643 
F-statistic: 16.84 on 5 and 2912 DF,  p-value: < 2.2e-16

   

Therefore, as we could see above the age and genetic material are going to be useful variables on the predictions of DBH through this model!

   

Now we are going to isolate the NA values (that we want to predict) and all the columns that we are using to get DBH values:

dbh_rows <- fustes_6[is.na(fustes_6$DBH), c("genetic material","age")]

   

The last step is to predict and replace those missing values:

dbh_predictions <- predict(dbh_model, newdata = dbh_rows)
dbh_predictions
      22       80      122      163      175      181      187 
12.23775 12.17994 11.93434 11.93434 11.93434 12.72776 12.72776 
     209      219      324      328      357      378      421 
12.72776 12.72776 11.87653 11.87653 12.08301 12.08301 12.78557 
     448      455      564      576      588      601      602 
12.78557 12.78557 12.60387 12.55431 12.55431 12.55431 12.55431 
     608      627      628      686      694      695      718 
12.55431 12.44424 12.44424 12.62038 12.62038 12.62038 12.62038 
     776      793      831      859      866      869      879 
12.19592 12.16342 12.16342 12.33633 12.33633 12.33633 12.33633 
     886      887      889      891      905      962      990 
12.33633 12.33633 12.33633 12.33633 12.39415 12.61663 12.61663 
    1009     1027     1036     1045     1057     1060     1124 
12.54605 12.54605 12.54605 12.54605 12.54605 12.54605 12.19646 
    1211     1212     1281     1291     1292     1293     1352 
12.56257 12.56257 12.32861 12.79383 12.79383 12.79383 12.42772 
    1427     1471     1475     1479     1480     1481     1490 
12.76079 12.71950 12.71950 12.71950 12.71950 12.71950 12.71950 
    1491     1496     1498     1501     1503     1504     1509 
12.71950 12.71950 12.71950 12.71950 12.71950 12.71950 12.71950 
    1563     1624     1626     1627     1628     1629     1630 
12.61769 12.79383 12.79383 12.79383 12.79383 12.79383 12.79383 
    1631     1636     1642     1643     1646     1647     1648 
12.79383 12.79383 12.79383 12.79383 12.79383 12.79383 12.79383 
    1649     1650     1651     1660     1662     1663     1664 
12.79383 12.79383 12.79383 12.79383 12.79383 12.79383 12.79383 
    1665     1668     1669     1670     1671     1676     1677 
12.79383 12.79383 12.79383 12.79383 12.79383 12.79383 12.79383 
    1678     1679     1711     1748     1777     1781     1782 
12.79383 12.79383 11.84349 11.85175 11.85175 11.85175 11.85175 
    1788     1791     1794     1873     1886     1923     1924 
11.85175 11.85175 11.85175 12.21297 12.21297 12.22123 12.22123 
    1942     1943     1967     1973     1982     2003     2006 
12.22123 12.22123 12.22123 12.73602 12.73602 12.73602 12.73602 
    2007     2042     2043     2090     2095     2139     2145 
12.73602 12.60387 12.60387 12.22070 12.22070 12.71950 12.71950 
    2200     2201     2208     2226     2400     2626     2799 
12.18820 12.18820 12.18820 12.18820 12.55431 12.25427 12.71124 
    2826 
12.71124 

Filling the data with predicted values:

fustes_6[is.na(fustes_6$DBH), "DBH"] <- dbh_predictions