Multiple Linear Regressions

Quick analysis in pine trees

Gabriel de Freitas Pereira true
2023-07-23
 

Multiple LM by functional approach

 

Quick analysis

 

  Well, the idea here is to create multiple linear models in a straight forward analyzes to understand the relative changes between height in different ages and seeds of Pinus taeda, commonly known as loblolly pine, which is one of several pines native to the Southeastern United States. In order to do that I am going to apply a few statistical concepts to analyze the p-value and the estimates generated by the model, using the dataset called Loblolly, here it is the overview of it:

Table 1: Data summary
Name Loblolly
Number of rows 84
Number of columns 3
_______________________
Column type frequency:
factor 1
numeric 2
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Seed 0 1 TRUE 14 329: 6, 327: 6, 325: 6, 307: 6

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
height 0 1 32.36 20.67 3.46 10.47 34.0 51.36 64.1 ▇▂▃▅▆
age 0 1 13.00 7.90 3.00 5.00 12.5 20.00 25.0 ▇▃▃▃▃

Observation: I am aware that in a real world scenario we would need a bigger sample size in order to get reliable linear regression models, but here I am focused on studying the approach instead of checking all the assumptions required by the model.

 

Models

 

  I would like to go through somethings that I am considering here before jump to the modeling. So is important to mention that the p-value is a statistical measure that indicates the probability of obtaining the observed results, or even more extreme results, when the null hypothesis is true. In other words, it is a measure that helps us assess whether the available evidence suggests that an observed difference or effect is statistically significant or just a random result.

  When we perform multiple statistical comparisons on the same dataset, we increase the probability of obtaining at least one significant result (i.e., a p-value smaller than the chosen significance level) by chance alone, even if there is no true effect present. This phenomenon is known as the “multiple testing problem” or “multiple comparisons problem”, which can be solved through p-value adjustment.

  To illustrate the importance of p-value adjustment, consider an example where you conduct 20 independent hypothesis tests, each with a significance level of 0.05. The probability of obtaining at least one false positive (false significant result) is higher than 1 - (1 - 0.05)^20 ≈ 0.64, which is approximately 64%. This means that if you do not adjust the p-values, you run a relatively high risk of erroneously concluding that there are significant effects when they do not exist.

  Therefore to address this problem, there are several adjustment techniques. On this analysis I am going to use the Bonferroni correction in which the p-values are multiplied by the number of comparisons. Controlling the overall significance level to reduce the risk of false positives and make multiple comparisons more reliable.

  Besides that, I am not interested in the Intercepts generated, which represents the estimated age when the height is zero (makes no sense at this case). Only in the slopes, which represents the estimated change in height associated with a one-unit increase in age. These coefficients provide crucial information about the relationship between height and age across different Seeds. Thus, by analyzing the slopes, we can understand how the height of trees responds to changes in age.

Finally in order to accomplish all that was mentioned this code was written:

tidy_lm <- Loblolly %>%
  nest( data = c( height, age ) ) %>%
  mutate( model = map( data, ~ lm( height ~ age, data = .x ) ) )

slopes <- tidy_lm %>% 
  mutate( coefs = map( model, tidy ) ) %>% 
  unnest( coefs ) %>% 
  filter( term == "age" ) %>% 
  mutate( p.value = p.adjust( p.value ) )

slopes
# A tibble: 14 × 8
   Seed  data       model  term  estimate std.error statistic  p.value
   <ord> <list>     <list> <chr>    <dbl>     <dbl>     <dbl>    <dbl>
 1 301   <nfnGrpdD> <lm>   age       2.61     0.175      14.9 0.000691
 2 303   <nfnGrpdD> <lm>   age       2.71     0.164      16.5 0.000691
 3 305   <nfnGrpdD> <lm>   age       2.75     0.192      14.3 0.000691
 4 307   <nfnGrpdD> <lm>   age       2.57     0.136      18.9 0.000560
 5 309   <nfnGrpdD> <lm>   age       2.67     0.142      18.8 0.000560
 6 311   <nfnGrpdD> <lm>   age       2.60     0.142      18.3 0.000560
 7 315   <nfnGrpdD> <lm>   age       2.58     0.155      16.6 0.000691
 8 319   <nfnGrpdD> <lm>   age       2.60     0.161      16.2 0.000691
 9 321   <nfnGrpdD> <lm>   age       2.60     0.114      22.7 0.000310
10 323   <nfnGrpdD> <lm>   age       2.65     0.182      14.5 0.000691
11 325   <nfnGrpdD> <lm>   age       2.49     0.172      14.5 0.000691
12 327   <nfnGrpdD> <lm>   age       2.42     0.146      16.6 0.000691
13 329   <nfnGrpdD> <lm>   age       2.43     0.150      16.2 0.000691
14 331   <nfnGrpdD> <lm>   age       2.57     0.134      19.2 0.000560

 

Results

slopes %>% 
  ggplot( aes( estimate, p.value, label = Seed ) ) +
  geom_vline( xintercept = 0, lty = 2, linewidth = 0.9, alpha = 0.7, color = "gray30" ) +
  geom_point( aes( color = Seed ), alpha = 0.8, size = 2.5, show.legend = FALSE ) +
  facet_wrap( ~Seed ) +
  labs( x = "estimates", title = "Increase in height per age" ) +
  theme_bw( )  

Conclusion

 

  The plot shows the relationship between the increase in height per age (slopes) for different levels of the Seed variable in the Loblolly dataset. Each facet represents a specific level of the Seed variable. The points on the plot represent the slopes for each seed. The dashed gray line at x = 0 represents the reference line where there is no increase or decrease in height with age.

  In conclusion is important to mention that this approach for using statistical models to estimate changes in many subgroups at once is very useful in different situations and can be applied to get reliable insights across the data.