Multiple LM by functional approach

Quick analysis

Well, the idea here is to create multiple linear models in a straight forward analyzes to understand the relative changes between height in different ages and seeds of Pinus taeda, commonly known as loblolly pine, which is one of several pines native to the Southeastern United States. In order to do that I am going to apply a few statistical concepts to analyze the p-value and the estimates generated by the model, using the dataset called Loblolly, here it is the overview of it:

Table 1: Data summary
Name	Loblolly
Number of rows	84
Number of columns	3
_______________________
Column type frequency:
factor	1
numeric	2
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
Seed	0	1	TRUE	14	329: 6, 327: 6, 325: 6, 307: 6

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
height	0	1	32.36	20.67	3.46	10.47	34.0	51.36	64.1	▇▂▃▅▆
age	0	1	13.00	7.90	3.00	5.00	12.5	20.00	25.0	▇▃▃▃▃

Observation: I am aware that in a real world scenario we would need a bigger sample size in order to get reliable linear regression models, but here I am focused on studying the approach instead of checking all the assumptions required by the model.

Models

I would like to go through somethings that I am considering here before jump to the modeling. So is important to mention that the p-value is a statistical measure that indicates the probability of obtaining the observed results, or even more extreme results, when the null hypothesis is true. In other words, it is a measure that helps us assess whether the available evidence suggests that an observed difference or effect is statistically significant or just a random result.

When we perform multiple statistical comparisons on the same dataset, we increase the probability of obtaining at least one significant result (i.e., a p-value smaller than the chosen significance level) by chance alone, even if there is no true effect present. This phenomenon is known as the “multiple testing problem” or “multiple comparisons problem”, which can be solved through p-value adjustment.

To illustrate the importance of p-value adjustment, consider an example where you conduct 20 independent hypothesis tests, each with a significance level of 0.05. The probability of obtaining at least one false positive (false significant result) is higher than 1 - (1 - 0.05)^20 ≈ 0.64, which is approximately 64%. This means that if you do not adjust the p-values, you run a relatively high risk of erroneously concluding that there are significant effects when they do not exist.

Therefore to address this problem, there are several adjustment techniques. On this analysis I am going to use the Bonferroni correction in which the p-values are multiplied by the number of comparisons. Controlling the overall significance level to reduce the risk of false positives and make multiple comparisons more reliable.

Besides that, I am not interested in the Intercepts generated, which represents the estimated age when the height is zero (makes no sense at this case). Only in the slopes, which represents the estimated change in height associated with a one-unit increase in age. These coefficients provide crucial information about the relationship between height and age across different Seeds. Thus, by analyzing the slopes, we can understand how the height of trees responds to changes in age.

Finally in order to accomplish all that was mentioned this code was written:

tidy_lm <- Loblolly %>%
  nest( data = c( height, age ) ) %>%
  mutate( model = map( data, ~ lm( height ~ age, data = .x ) ) )

slopes <- tidy_lm %>% 
  mutate( coefs = map( model, tidy ) ) %>% 
  unnest( coefs ) %>% 
  filter( term == "age" ) %>% 
  mutate( p.value = p.adjust( p.value ) )

slopes

# A tibble: 14 × 8
   Seed  data       model  term  estimate std.error statistic  p.value
   <ord> <list>     <list> <chr>    <dbl>     <dbl>     <dbl>    <dbl>
 1 301   <nfnGrpdD> <lm>   age       2.61     0.175      14.9 0.000691
 2 303   <nfnGrpdD> <lm>   age       2.71     0.164      16.5 0.000691
 3 305   <nfnGrpdD> <lm>   age       2.75     0.192      14.3 0.000691
 4 307   <nfnGrpdD> <lm>   age       2.57     0.136      18.9 0.000560
 5 309   <nfnGrpdD> <lm>   age       2.67     0.142      18.8 0.000560
 6 311   <nfnGrpdD> <lm>   age       2.60     0.142      18.3 0.000560
 7 315   <nfnGrpdD> <lm>   age       2.58     0.155      16.6 0.000691
 8 319   <nfnGrpdD> <lm>   age       2.60     0.161      16.2 0.000691
 9 321   <nfnGrpdD> <lm>   age       2.60     0.114      22.7 0.000310
10 323   <nfnGrpdD> <lm>   age       2.65     0.182      14.5 0.000691
11 325   <nfnGrpdD> <lm>   age       2.49     0.172      14.5 0.000691
12 327   <nfnGrpdD> <lm>   age       2.42     0.146      16.6 0.000691
13 329   <nfnGrpdD> <lm>   age       2.43     0.150      16.2 0.000691
14 331   <nfnGrpdD> <lm>   age       2.57     0.134      19.2 0.000560

Results

slopes %>% 
  ggplot( aes( estimate, p.value, label = Seed ) ) +
  geom_vline( xintercept = 0, lty = 2, linewidth = 0.9, alpha = 0.7, color = "gray30" ) +
  geom_point( aes( color = Seed ), alpha = 0.8, size = 2.5, show.legend = FALSE ) +
  facet_wrap( ~Seed ) +
  labs( x = "estimates", title = "Increase in height per age" ) +
  theme_bw( )

Conclusion

The plot shows the relationship between the increase in height per age (slopes) for different levels of the Seed variable in the Loblolly dataset. Each facet represents a specific level of the Seed variable. The points on the plot represent the slopes for each seed. The dashed gray line at x = 0 represents the reference line where there is no increase or decrease in height with age.

Everything is in the right side because as expected the height is increasing over the years. The interesting thing about this analysis is to understand that the further to the right it is, the larger the increase over this time period, enabling a quick overview between seeds. In summary we can highlight the 321 seed, due the low p-value and the high estimate.
The p-value measures the probability of obtaining results as extreme as those observed, assuming that the null hypothesis (no relationship or no change with time) is true. Lower p-values on the y-axis of the plot indicate stronger evidence against the null hypothesis, suggesting more confidence in the existence of real relationships between height and age which at this case is very obvious.

In conclusion is important to mention that this approach for using statistical models to estimate changes in many subgroups at once is very useful in different situations and can be applied to get reliable insights across the data.

Multiple Linear Regressions

Multiple LM by functional approach

Quick analysis

Models

Results

Conclusion