Quick analysis in pine trees
Well, the idea here is to create multiple linear models in a straight forward analyzes to understand the relative changes between height in different ages and seeds of Pinus taeda, commonly known as loblolly pine, which is one of several pines native to the Southeastern United States. In order to do that I am going to apply a few statistical concepts to analyze the p-value and the estimates generated by the model, using the dataset called Loblolly
, here it is the overview of it:
Name | Loblolly |
Number of rows | 84 |
Number of columns | 3 |
_______________________ | |
Column type frequency: | |
factor | 1 |
numeric | 2 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
Seed | 0 | 1 | TRUE | 14 | 329: 6, 327: 6, 325: 6, 307: 6 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
height | 0 | 1 | 32.36 | 20.67 | 3.46 | 10.47 | 34.0 | 51.36 | 64.1 | ▇▂▃▅▆ |
age | 0 | 1 | 13.00 | 7.90 | 3.00 | 5.00 | 12.5 | 20.00 | 25.0 | ▇▃▃▃▃ |
Observation: I am aware that in a real world scenario we would need a bigger sample size in order to get reliable linear regression models, but here I am focused on studying the approach instead of checking all the assumptions required by the model.
I would like to go through somethings that I am considering here before jump to the modeling. So is important to mention that the p-value is a statistical measure that indicates the probability of obtaining the observed results, or even more extreme results, when the null hypothesis is true. In other words, it is a measure that helps us assess whether the available evidence suggests that an observed difference or effect is statistically significant or just a random result.
When we perform multiple statistical comparisons on the same dataset, we increase the probability of obtaining at least one significant result (i.e., a p-value smaller than the chosen significance level) by chance alone, even if there is no true effect present. This phenomenon is known as the “multiple testing problem” or “multiple comparisons problem”, which can be solved through p-value adjustment.
To illustrate the importance of p-value adjustment, consider an example where you conduct 20 independent hypothesis tests, each with a significance level of 0.05. The probability of obtaining at least one false positive (false significant result) is higher than 1 - (1 - 0.05)^20 ≈ 0.64, which is approximately 64%. This means that if you do not adjust the p-values, you run a relatively high risk of erroneously concluding that there are significant effects when they do not exist.
Therefore to address this problem, there are several adjustment techniques. On this analysis I am going to use the Bonferroni correction in which the p-values are multiplied by the number of comparisons. Controlling the overall significance level to reduce the risk of false positives and make multiple comparisons more reliable.
Besides that, I am not interested in the Intercepts generated, which represents the estimated age when the height is zero (makes no sense at this case). Only in the slopes, which represents the estimated change in height associated with a one-unit increase in age. These coefficients provide crucial information about the relationship between height and age across different Seeds. Thus, by analyzing the slopes, we can understand how the height of trees responds to changes in age.
Finally in order to accomplish all that was mentioned this code was written:
tidy_lm <- Loblolly %>%
nest( data = c( height, age ) ) %>%
mutate( model = map( data, ~ lm( height ~ age, data = .x ) ) )
slopes <- tidy_lm %>%
mutate( coefs = map( model, tidy ) ) %>%
unnest( coefs ) %>%
filter( term == "age" ) %>%
mutate( p.value = p.adjust( p.value ) )
slopes
# A tibble: 14 × 8
Seed data model term estimate std.error statistic p.value
<ord> <list> <list> <chr> <dbl> <dbl> <dbl> <dbl>
1 301 <nfnGrpdD> <lm> age 2.61 0.175 14.9 0.000691
2 303 <nfnGrpdD> <lm> age 2.71 0.164 16.5 0.000691
3 305 <nfnGrpdD> <lm> age 2.75 0.192 14.3 0.000691
4 307 <nfnGrpdD> <lm> age 2.57 0.136 18.9 0.000560
5 309 <nfnGrpdD> <lm> age 2.67 0.142 18.8 0.000560
6 311 <nfnGrpdD> <lm> age 2.60 0.142 18.3 0.000560
7 315 <nfnGrpdD> <lm> age 2.58 0.155 16.6 0.000691
8 319 <nfnGrpdD> <lm> age 2.60 0.161 16.2 0.000691
9 321 <nfnGrpdD> <lm> age 2.60 0.114 22.7 0.000310
10 323 <nfnGrpdD> <lm> age 2.65 0.182 14.5 0.000691
11 325 <nfnGrpdD> <lm> age 2.49 0.172 14.5 0.000691
12 327 <nfnGrpdD> <lm> age 2.42 0.146 16.6 0.000691
13 329 <nfnGrpdD> <lm> age 2.43 0.150 16.2 0.000691
14 331 <nfnGrpdD> <lm> age 2.57 0.134 19.2 0.000560
slopes %>%
ggplot( aes( estimate, p.value, label = Seed ) ) +
geom_vline( xintercept = 0, lty = 2, linewidth = 0.9, alpha = 0.7, color = "gray30" ) +
geom_point( aes( color = Seed ), alpha = 0.8, size = 2.5, show.legend = FALSE ) +
facet_wrap( ~Seed ) +
labs( x = "estimates", title = "Increase in height per age" ) +
theme_bw( )
The plot shows the relationship between the increase in height per age (slopes) for different levels of the Seed variable in the Loblolly dataset. Each facet represents a specific level of the Seed variable. The points on the plot represent the slopes for each seed. The dashed gray line at x = 0 represents the reference line where there is no increase or decrease in height with age.
Everything is in the right side because as expected the height is increasing over the years. The interesting thing about this analysis is to understand that the further to the right it is, the larger the increase over this time period, enabling a quick overview between seeds. In summary we can highlight the 321 seed, due the low p-value and the high estimate.
The p-value measures the probability of obtaining results as extreme as those observed, assuming that the null hypothesis (no relationship or no change with time) is true. Lower p-values on the y-axis of the plot indicate stronger evidence against the null hypothesis, suggesting more confidence in the existence of real relationships between height and age which at this case is very obvious.
In conclusion is important to mention that this approach for using statistical models to estimate changes in many subgroups at once is very useful in different situations and can be applied to get reliable insights across the data.