Time Series Homework: Chapter 5 Lesson 3

Please_put_your_name_here

Data

c02 <- rio::import("https://byuistats.github.io/timeseries/data/co2_mm_mlo.csv")

Questions

Question 1 - Context and Measurement (5 points)

The first part of any time series analysis is context. You cannot properly analyze data without knowing what the data is measuring. Without context, the most simple features of data can be obscure and inscrutable. This homework assignment will center around the series below.

Please research the time series. In the spaces below, give the data collection process, unit of analysis, and meaning of each observation for the series.

a) Atmospheric Carbon Dioxide

NOA

Answer

Data Collection Process:

The data in this series was collected by the National Oceanic and Atmospheric Administration (NOAA) at the Mauna Loa Observatory in Hawaii. The data has been averaged to produce monthly values, however this comes from averaging hourly observations. Prior to April 2019, measurements were obtained using infrared absorption techniques; since then, Cavity Ring-Down Spectroscopy has been employed. To determine the final CO₂ concentration values, five calibration gases with known CO₂ levels are run through a standardized formula. The dataset spans from 1957 through 2024.

Unit of Analysis: The unit of analysis for the series is the average amount of Carbon dioxide particles observed per one million particles each month in dry air.

Meaning of Each Observation: Each observation represents the concentration of carbon dioxide (CO₂) in the atmosphere, measured in parts per million (ppm), indicating the amount of CO₂ relative to other atmospheric gases. CO₂ is a major greenhouse gas and one of the primary drivers of climate change. Tracking its concentration over time provides valuable insight into seasonal patterns and long-term trends related to human activity, fossil fuel combustion, and land-use changes.

Question 2 - Seasonal Pattern Exploration (20 points)

a) Plot the Atmospheric Carbon Dioxide series.
Answer
co2_ts <- c02 |>
  mutate(date = mdy(date))|>
  as_tsibble(index = date)

ggplot(co2_ts, aes(x = date, y = monthly_avg))+
  geom_line()+
  labs(x = "Year",
       y = "Average monthly CO2 observed (in ppm)") +
  theme_bw()

b) Create box plot of the seasonal variation in co2 atmospheric measurements.
Answer
ggplot(co2_ts, aes(x = as.factor(month), y = monthly_avg))+
  geom_boxplot()+
  theme_bw()+
  labs(
    x = "Month",
    y = "Average monthly CO2 observed (in ppm)"
  )

c) Please explain three likely factors that drive co2 seasonal patterns.
Answer

An important factor to consider when examining seasonal patterns in CO₂ levels is travel activity, particularly vacations to Hawaii. The boxplot shows noticeable increases in carbon dioxide concentrations during May and June, months that coincide with the end of the school year in the United States when travel tends to spike. Similar increases appear in November and December, aligning with Thanksgiving and Christmas breaks, further suggesting that tourism may contribute to localized CO₂ increases. Another key factor is the ocean-atmosphere CO₂ exchange. During warmer months, surface ocean waters release more CO₂ into the atmosphere due to reduced gas solubility at higher temperatures. Additionally, plant photosynthesis plays a significant role. In colder months such as January and February, vegetation goes dormant or decays, resulting in lower CO₂ uptake and a relative rise in atmospheric CO₂ concentrations.

Question 3 - Model Selection: Additive Harmonic Seasonal Variables (50 points)

a) Using the atmospheric co2 time series, please estimate a linear model with a linear trend and harmonic seasonal variables. Include three models, one with the complete set of harmonic variables, and two with reduced harmonic components. Please use the Time Series Notebook Ch5 Lesson 3 result table format when presenting your results that include a column that identifies the variable as significant.
Answer
co2_harmonic <- c02 |>
  mutate(TIME = 1:n()) |>
  mutate(
    cos1 = cos(2 * pi * 1 * TIME/12),
    cos2 = cos(2 * pi * 2 * TIME/12),
    cos3 = cos(2 * pi * 3 * TIME/12),
    cos4 = cos(2 * pi * 4 * TIME/12),
    cos5 = cos(2 * pi * 5 * TIME/12),
    cos6 = cos(2 * pi * 6 * TIME/12),
    sin1 = sin(2 * pi * 1 * TIME/12),
    sin2 = sin(2 * pi * 2 * TIME/12),
    sin3 = sin(2 * pi * 3 * TIME/12),
    sin4 = sin(2 * pi * 4 * TIME/12),
    sin5 = sin(2 * pi * 5 * TIME/12),
    sin6 = sin(2 * pi * 6 * TIME/12)) |>
  as_tsibble(index = TIME) |>
  mutate(zTIME = (TIME - mean(TIME)) / sd(TIME))


full_linear_lm <- co2_harmonic |>
  model(full_linear = TSLM(monthly_avg ~ zTIME +
    sin1 + cos1 + sin2 + cos2 + sin3 + cos3 
    + sin4 + cos4 + sin5 + cos5 + cos6 ))

full_linear_lm |>
  tidy() |>
  mutate(sig = p.value < 0.05) |>
  pander::pander()
.model term estimate std.error statistic p.value sig
full_linear (Intercept) 359.4 0.1528 2352 0 TRUE
full_linear zTIME 30.95 0.1529 202.4 0 TRUE
full_linear sin1 2.345 0.2161 10.85 1.279e-25 TRUE
full_linear cos1 -1.612 0.2161 -7.459 2.351e-13 TRUE
full_linear sin2 -0.006434 0.2159 -0.0298 0.9762 FALSE
full_linear cos2 0.8039 0.2162 3.718 0.0002152 TRUE
full_linear sin3 0.007169 0.2161 0.03318 0.9735 FALSE
full_linear cos3 0.04291 0.2161 0.1986 0.8426 FALSE
full_linear sin4 -0.02205 0.2159 -0.1021 0.9187 FALSE
full_linear cos4 -0.09631 0.2162 -0.4455 0.6561 FALSE
full_linear sin5 -0.03203 0.2161 -0.1482 0.8822 FALSE
full_linear cos5 -0.0244 0.2161 -0.1129 0.9101 FALSE
full_linear cos6 0.004822 0.1528 0.03156 0.9748 FALSE
reduced_linear_1 <- co2_harmonic |>
  model(reduced_linear_1 = TSLM(monthly_avg ~ zTIME +
    sin1 + cos1 + sin2 + cos2 + sin3 + cos3 
    + sin4 + cos4 ))

reduced_linear_1 |>
  tidy() |>
  mutate(sig = p.value < 0.05) 
# A tibble: 10 × 7
   .model           term         estimate std.error statistic  p.value sig  
   <chr>            <chr>           <dbl>     <dbl>     <dbl>    <dbl> <lgl>
 1 reduced_linear_1 (Intercept) 359.          0.152 2357.     0        TRUE 
 2 reduced_linear_1 zTIME        30.9         0.153  203.     0        TRUE 
 3 reduced_linear_1 sin1          2.34        0.216   10.9    1.03e-25 TRUE 
 4 reduced_linear_1 cos1         -1.61        0.216   -7.47   2.12e-13 TRUE 
 5 reduced_linear_1 sin2         -0.00639     0.216   -0.0296 9.76e- 1 FALSE
 6 reduced_linear_1 cos2          0.804       0.216    3.73   2.09e- 4 TRUE 
 7 reduced_linear_1 sin3          0.00717     0.216    0.0332 9.73e- 1 FALSE
 8 reduced_linear_1 cos3          0.0429      0.216    0.199  8.43e- 1 FALSE
 9 reduced_linear_1 sin4         -0.0221      0.216   -0.103  9.18e- 1 FALSE
10 reduced_linear_1 cos4         -0.0963      0.216   -0.446  6.55e- 1 FALSE
reduced_linear_2 <- co2_harmonic |>
  model(reduced_linear_2 = TSLM(monthly_avg ~ zTIME +
    sin1 + cos1 + cos2  ))

reduced_linear_2 |>
  tidy() |>
  mutate(sig = p.value < 0.05) 
# A tibble: 5 × 7
  .model           term        estimate std.error statistic  p.value sig  
  <chr>            <chr>          <dbl>     <dbl>     <dbl>    <dbl> <lgl>
1 reduced_linear_2 (Intercept)  359.        0.152   2364.   0        TRUE 
2 reduced_linear_2 zTIME         30.9       0.152    203.   0        TRUE 
3 reduced_linear_2 sin1           2.34      0.215     10.9  7.29e-26 TRUE 
4 reduced_linear_2 cos1          -1.61      0.215     -7.50 1.79e-13 TRUE 
5 reduced_linear_2 cos2           0.804     0.215      3.74 2.00e- 4 TRUE 
b) Using the atmospheric co2 time series, please estimate a linear model with a quadratic trend and harmonic seasonal variables. Include three models, one with the complete set of harmonic variables, and two with reduced harmonic components. Please use the Time Series Notebook Ch5 Lesson 3 result table format when presenting your results that include a column that identifies the variable as significant.
Answer
full_quadratic_lm <- co2_harmonic |>
  model(full_quadratic = TSLM(monthly_avg ~ zTIME + I(zTIME^2) +
    sin1 + cos1 + sin2 + cos2 + sin3 + cos3 
    + sin4 + cos4 + sin5 + cos5 + cos6 ))

full_quadratic_lm |>
  tidy() |>
  mutate(sig = p.value < 0.05) 
# A tibble: 14 × 7
   .model         term         estimate std.error statistic   p.value sig  
   <chr>          <chr>           <dbl>     <dbl>     <dbl>     <dbl> <lgl>
 1 full_quadratic (Intercept) 355.         0.0426  8324.    0         TRUE 
 2 full_quadratic zTIME        30.9        0.0284  1088.    0         TRUE 
 3 full_quadratic I(zTIME^2)    4.66       0.0318   147.    0         TRUE 
 4 full_quadratic sin1          2.31       0.0402    57.5   1.27e-280 TRUE 
 5 full_quadratic cos1         -1.64       0.0402   -40.9   3.30e-195 TRUE 
 6 full_quadratic sin2         -0.0474     0.0402    -1.18  2.38e-  1 FALSE
 7 full_quadratic cos2          0.804      0.0402    20.0   6.34e- 72 TRUE 
 8 full_quadratic sin3         -0.0164     0.0402    -0.408 6.83e-  1 FALSE
 9 full_quadratic cos3          0.0665     0.0402     1.65  9.85e-  2 FALSE
10 full_quadratic sin4         -0.0221     0.0402    -0.549 5.83e-  1 FALSE
11 full_quadratic cos4         -0.0728     0.0402    -1.81  7.07e-  2 FALSE
12 full_quadratic sin5         -0.0234     0.0402    -0.583 5.60e-  1 FALSE
13 full_quadratic cos5         -0.0158     0.0402    -0.393 6.95e-  1 FALSE
14 full_quadratic cos6          0.00482    0.0284     0.170 8.65e-  1 FALSE
reduced_quadratic_1 <- co2_harmonic |>
  model(reduced_quadratic_1 = TSLM(monthly_avg ~ zTIME + I(zTIME^2) +
    sin1 + cos1 + sin2 + cos2 + sin3 + cos3 ))

reduced_quadratic_1 |>
  tidy() |>
  mutate(sig = p.value < 0.05) 
# A tibble: 9 × 7
  .model              term        estimate std.error statistic   p.value sig  
  <chr>               <chr>          <dbl>     <dbl>     <dbl>     <dbl> <lgl>
1 reduced_quadratic_1 (Intercept) 355.        0.0426  8329.    0         TRUE 
2 reduced_quadratic_1 zTIME        30.9       0.0284  1089.    0         TRUE 
3 reduced_quadratic_1 I(zTIME^2)    4.66      0.0318   147.    0         TRUE 
4 reduced_quadratic_1 sin1          2.31      0.0402    57.6   9.90e-282 TRUE 
5 reduced_quadratic_1 cos1         -1.64      0.0402   -41.0   7.75e-196 TRUE 
6 reduced_quadratic_1 sin2         -0.0472    0.0401    -1.18  2.40e-  1 FALSE
7 reduced_quadratic_1 cos2          0.804     0.0402    20.0   4.59e- 72 TRUE 
8 reduced_quadratic_1 sin3         -0.0164    0.0402    -0.407 6.84e-  1 FALSE
9 reduced_quadratic_1 cos3          0.0663    0.0402     1.65  9.92e-  2 FALSE
reduced_quadratic_2 <- co2_harmonic |>
  model(reduced_quadratic_2 = TSLM(monthly_avg ~ zTIME + I(zTIME^2) +
    sin1 + cos1+ cos2))

reduced_quadratic_2 |>
  tidy() |>
  mutate(sig = p.value < 0.05) 
# A tibble: 6 × 7
  .model              term        estimate std.error statistic   p.value sig  
  <chr>               <chr>          <dbl>     <dbl>     <dbl>     <dbl> <lgl>
1 reduced_quadratic_2 (Intercept)  355.       0.0426    8322.  0         TRUE 
2 reduced_quadratic_2 zTIME         30.9      0.0284    1088.  0         TRUE 
3 reduced_quadratic_2 I(zTIME^2)     4.66     0.0318     146.  0         TRUE 
4 reduced_quadratic_2 sin1           2.31     0.0402      57.5 5.01e-282 TRUE 
5 reduced_quadratic_2 cos1          -1.64     0.0402     -40.9 5.45e-196 TRUE 
6 reduced_quadratic_2 cos2           0.804    0.0402      20.0 4.99e- 72 TRUE 
c) Using the atmospheric co2 time series, please estimate a linear model with a exponential trend and harmonic seasonal variables. Include three models, one with the complete set of harmonic variables, and two with reduced harmonic components. Please use the Time Series Notebook Ch5 Lesson 3 result table format when presenting your results that include a column that identifies the variable as significant.
Answer
full_exponential_lm <- co2_harmonic |>
  model(full_exponential = TSLM(monthly_avg ~ exp(zTIME) + 
    sin1 + cos1 + sin2 + cos2 + sin3 + cos3 
    + sin4 + cos4 + sin5 + cos5 + cos6 ))

full_exponential_lm |>
  tidy() |>
  mutate(sig = p.value < 0.05) 
# A tibble: 13 × 7
   .model           term          estimate std.error statistic     p.value sig  
   <chr>            <chr>            <dbl>     <dbl>     <dbl>       <dbl> <lgl>
 1 full_exponential (Intercept) 327.           0.455 718.      0           TRUE 
 2 full_exponential exp(zTIME)   20.7          0.212  97.5     0           TRUE 
 3 full_exponential sin1          2.26         0.435   5.20    0.000000257 TRUE 
 4 full_exponential cos1         -1.73         0.435  -3.97    0.0000795   TRUE 
 5 full_exponential sin2         -0.128        0.435  -0.294   0.769       FALSE
 6 full_exponential cos2          0.797        0.436   1.83    0.0679      FALSE
 7 full_exponential sin3         -0.0625       0.435  -0.143   0.886       FALSE
 8 full_exponential cos3          0.113        0.435   0.259   0.796       FALSE
 9 full_exponential sin4         -0.0175       0.435  -0.0402  0.968       FALSE
10 full_exponential cos4         -0.0266       0.436  -0.0611  0.951       FALSE
11 full_exponential sin5         -0.00160      0.435  -0.00368 0.997       FALSE
12 full_exponential cos5         -0.00384      0.435  -0.00883 0.993       FALSE
13 full_exponential cos6          0.000939     0.308   0.00305 0.998       FALSE
reduced_exponential_1 <- co2_harmonic |>
  model(reduced_exponential_1 = TSLM(monthly_avg ~ exp(zTIME)  + 
    sin1 + cos1 + sin2 + cos2 + sin3 + cos3 
    + sin4 ))
reduced_exponential_1 |>
  tidy() |>
  mutate(sig = p.value < 0.05) 
# A tibble: 9 × 7
  .model                term        estimate std.error statistic   p.value sig  
  <chr>                 <chr>          <dbl>     <dbl>     <dbl>     <dbl> <lgl>
1 reduced_exponential_1 (Intercept) 327.         0.454  720.       0       TRUE 
2 reduced_exponential_1 exp(zTIME)   20.7        0.212   97.8      0       TRUE 
3 reduced_exponential_1 sin1          2.26       0.434    5.21     2.39e-7 TRUE 
4 reduced_exponential_1 cos1         -1.73       0.434   -3.98     7.62e-5 TRUE 
5 reduced_exponential_1 sin2         -0.128      0.434   -0.294    7.68e-1 FALSE
6 reduced_exponential_1 cos2          0.797      0.435    1.83     6.72e-2 FALSE
7 reduced_exponential_1 sin3         -0.0624     0.434   -0.144    8.86e-1 FALSE
8 reduced_exponential_1 cos3          0.113      0.434    0.260    7.95e-1 FALSE
9 reduced_exponential_1 sin4         -0.0175     0.434   -0.0403   9.68e-1 FALSE
reduced_exponential_2 <- co2_harmonic |>
  model(reduced_exponential_2 = TSLM(monthly_avg ~ exp(zTIME) + 
    sin1 + cos1 + cos2 ))
reduced_exponential_2 |>
  tidy() |>
  mutate(sig = p.value < 0.05) 
# A tibble: 5 × 7
  .model                term        estimate std.error statistic   p.value sig  
  <chr>                 <chr>          <dbl>     <dbl>     <dbl>     <dbl> <lgl>
1 reduced_exponential_2 (Intercept)  327.        0.453    721.     0       TRUE 
2 reduced_exponential_2 exp(zTIME)    20.7       0.211     98.0    0       TRUE 
3 reduced_exponential_2 sin1           2.26      0.433      5.22   2.25e-7 TRUE 
4 reduced_exponential_2 cos1          -1.73      0.433     -3.99   7.27e-5 TRUE 
5 reduced_exponential_2 cos2           0.797     0.434      1.84   6.65e-2 FALSE
d) Please use AIC, AICc, and BIC to help you argue for the best model to fit the atmospheric co2 data. Please include a table similar to the one found in the Model Comparison section of Time Series Notebook Ch5 Lesson 3. Make sure you take into account the discussion on the dangers of only using algorithms for model selection found on the Time Series notebook section on model selection.
Answer
model_combined <- co2_harmonic |>
  model(
    full_exponential = TSLM(monthly_avg ~ exp(zTIME) + 
    sin1 + cos1 + sin2 + cos2 + sin3 + cos3 
    + sin4 + cos4 + sin5 + cos5 + cos6 ),
    full_quadratic = TSLM(monthly_avg ~ zTIME + I(zTIME^2) +
    sin1 + cos1 + sin2 + cos2 + sin3 + cos3 
    + sin4 + cos4 + sin5 + cos5 + cos6 ),
    full_linear = TSLM(monthly_avg ~ zTIME +
    sin1 + cos1 + sin2 + cos2 + sin3 + cos3 
    + sin4 + cos4 + sin5 + cos5 + cos6 ),
    reduced_exponential_1  = TSLM(monthly_avg ~ exp(zTIME)  + 
    sin1 + cos1 + sin2 + cos2 + sin3 + cos3 
    + sin4 ),
    reduced_exponential_2  = TSLM(monthly_avg ~ exp(zTIME)  + 
    sin1 + cos1 + cos2 ),
    reduced_quadratic_1  = TSLM(monthly_avg ~ zTIME + I(zTIME^2) +
    sin1 + cos1 + sin2 + cos2 + sin3 + cos3 ),
    reduced_quadratic_2  = TSLM(monthly_avg ~ zTIME + I(zTIME^2) +
    sin1 + cos1+ cos2),
    reduced_linear_1  = TSLM(monthly_avg ~ zTIME +
    sin1 + cos1 + sin2 + cos2 + sin3 + cos3 
    + sin4 + cos4 ),
    reduced_linear_2  = TSLM(monthly_avg ~ zTIME +
    sin1 + cos1 + cos2  )
  )

glance(model_combined) |>
  select(.model, AIC, AICc, BIC)
# A tibble: 9 × 4
  .model                  AIC  AICc   BIC
  <chr>                 <dbl> <dbl> <dbl>
1 full_exponential      3382. 3383. 3447.
2 full_quadratic        -344. -343. -274.
3 full_linear           2286. 2287. 2351.
4 reduced_exponential_1 3374. 3374. 3421.
5 reduced_exponential_2 3366. 3366. 3394.
6 reduced_quadratic_1   -350. -349. -303.
7 reduced_quadratic_2   -351. -351. -319.
8 reduced_linear_1      2280. 2281. 2331.
9 reduced_linear_2      2270. 2271. 2298.

While AIC, AICc, and BIC are useful tools for model selection, they should not be the only factors guiding your choice. In the table above, the reduced_quadratic_2 model has the lowest values for all three criteria, suggesting it has the best balance of fit and simplicity among the candidates. However, model interpretability and domain knowledge also matter as a slightly less optimal model in terms of information criteria may still be preferable if it is easier to explain or aligns more closely with known physical processes. For example, if the reduced_quadractic_1 model is more interpretable and only marginally worse in AIC, AICc, and BIC, it would be a more justifiable choice. It’s also worth being cautious about overfitting, especially when models become increasingly complex with diminishing returns in performance.

Rubric

Criteria Mastery (5) Incomplete (0)

Question 1: Context and Measurement

The student thoroughly researches the data collection process, unit of analysis, and meaning of each observation for both the requested time series. Clear and comprehensive explanations are provided. The student does not adequately research or provide information on the data collection process, unit of analysis, and meaning of each observation for the specified series.
Mastery (5) Incomplete (0)

Question 2a: Time series plot

Students plot the Atmospheric Carbon Dioxide series, ensuring high-quality visualization with clear labels and titles. Submissions have low-quality visualizations or unclear labeling.
Mastery (5) Incomplete (0)

Question 2b: Box Plot

Students create a box plot of the seasonal variation in CO2 atmospheric measurements, providing clear interpretation and labeling. Submissions have low-quality visualizations or unclear labeling.
Mastery (10) Incomplete (0)

Question 2c: Seasonal Patterns

Students provide a clear and accurate explanation of three likely factors that drive CO2 seasonal patterns, demonstrating an understanding of the underlying time series and relevant environmental and ecological processes. Students provide incomplete or inaccurate explanations of factors that drive CO2 seasonal patterns or fail to demonstrate an understanding of the data generation process and relevant environmental and ecological processes.
Mastery (10) Incomplete (0)

Question 3a: Linear Trend Harmonic Seasonal Variables

Students accurately estimate three linear models with a linear trend and harmonic seasonal variables, providing clear presentation of results including a column identifying significant variables. Students fail to estimate one or more of the models requested. The presentation of the results is unclear or incomplete, or fail to identify significant variables appropriately.
Mastery (10) Incomplete (0)

Question 3b: Cuadratic Trend Harmonic Seasonal Variables

Students accurately estimate three linear models with a quadratic trend and harmonic seasonal variables, providing clear presentation of results including a column identifying significant variables. Students fail to estimate one or more of the models requested. The presentation of the results is unclear or incomplete, or fail to identify significant variables appropriately.
Mastery (10) Incomplete (0)

Question 3c: Exponential Trend Harmonic Seasonal Variables

Students accurately estimate three linear models with an exponential trend and harmonic seasonal variables, providing clear presentation of results including a column identifying significant variables. Students fail to estimate one or more of the models requested. The presentation of the results is unclear or incomplete, or fail to identify significant variables appropriately.
Mastery (20) Incomplete (0)

Question 3d: Model Selection

Students effectively use AIC, AICc, and BIC to compare and evaluate models, presenting results in a clear table format similar to the one found in the Model Comparison section of the Time Series Notebook Ch5 Lesson 3. Their discussions on the nuance model selection evidences they understand the importance of considering the context and data generating process that is part of model specification. Students struggle to effectively use AIC, AICc, and BIC to compare and evaluate models, resulting in unclear or incomplete presentation of results or failure to address the dangers of relying solely on algorithms for model selection.




Total Points 75