7.3 Data and Preparation
7.3.1 Available Data
The riversCentralAsia
Package provides available data of the gauging and meteorological stations in the Chirchik River Basin.
Before starting any type of modeling, it is important to get a good understanding of the data that we are dealing with and whether there exist problems with the raw data that need to be addressed prior to modeling. Problems usually include data gaps and outliers as data record that one obtains are usually ever complete nor clean of errors.
The steps performed here are thus required steps for any type of successful modeling and should be performed with great care. We concentrate our efforts here on discharge records and data from meteorological stations in the Chirchik River Basin. The techniques shown here for decadal (10-days) data naturally extend to monthly data and other basins.
7.3.2 Gap Filling Discharge Data
In the following, we will work with decadal discharge data from the two main tributaries, i.e. the Chatkal and (Gauge 16279) Pskem rivers (Gauge 16290) and the data of the inflow to the Charvak reservoir (Gauge 16924). The goal is to analyze the data and prepare for modeling. First, let us load the relevant discharge data.
<- ChirchikRiverBasin # load data
data <- data %>% filter(code == '16279' | code == '16290' | code == '16924') # Note for the new name of the object, we choose to add periodicity (_dec_) and data type (_tbl for tibble/dataframe) to the data name. This just helps to stay organized and is good practice in R programming.
q_dec_tbl q_dec_tbl
## # A tibble: 9,072 x 14
## date data norm units type code station river basin resolution
## <date> <dbl> <dbl> <chr> <fct> <chr> <chr> <chr> <chr> <fct>
## 1 1932-01-10 48.8 38.8 m3s Q 16279 Khudaydod Chatkal Chirch… dec
## 2 1932-01-20 48.4 37.5 m3s Q 16279 Khudaydod Chatkal Chirch… dec
## 3 1932-01-31 42.4 36.6 m3s Q 16279 Khudaydod Chatkal Chirch… dec
## 4 1932-02-10 43.7 36.4 m3s Q 16279 Khudaydod Chatkal Chirch… dec
## 5 1932-02-20 44.2 36.3 m3s Q 16279 Khudaydod Chatkal Chirch… dec
## 6 1932-02-29 47.7 36.9 m3s Q 16279 Khudaydod Chatkal Chirch… dec
## 7 1932-03-10 54.1 39.4 m3s Q 16279 Khudaydod Chatkal Chirch… dec
## 8 1932-03-20 63.2 47.6 m3s Q 16279 Khudaydod Chatkal Chirch… dec
## 9 1932-03-31 103 60.5 m3s Q 16279 Khudaydod Chatkal Chirch… dec
## 10 1932-04-10 103 86.4 m3s Q 16279 Khudaydod Chatkal Chirch… dec
## # … with 9,062 more rows, and 4 more variables: lon_UTM42 <dbl>,
## # lat_UTM42 <dbl>, altitude_masl <dbl>, basinSize_sqkm <dbl>
You can get more information about the available data by typing ?ChirchikRiverBasin
.
It is advisable to check at this stage for missing data in time series and to fill gaps where present. As can be seen in Figure 7.1 , close inspection of the time series indeed reveals some missing data in the 1940ies.
%>% plot_time_series(date,data,
q_dec_tbl .facet_vars = code,
.smooth = FALSE,
.interactive = TRUE,
.x_lab = "year",
.y_lab = "m^3/s",
.title = ""
)
Figure 7.1 and the following Figures are interactive, so you can zoom in to regions of interest.
Missing data are also confirmed by the warning that the function timetk::plot_time_series()
throws (suppressed here). Statistics of the missing data can be easily obtained. As the Table below shows, we can do this analysis for each discharge station separately.
%>% group_by(code) %>%
q_dec_tbl summarize(n.na = sum(is.na(data)), na.perc = n.na/n()*100)
## # A tibble: 3 x 3
## code n.na na.perc
## <chr> <int> <dbl>
## 1 16279 15 0.496
## 2 16290 39 1.29
## 3 16924 42 1.39
Summarizing the number of observation with missing data reveals 15 data points for station 16279 (0.5 % of total record length) and 39 for station 16290 (1.3 % of total record length). As there are only very few gaps in the existing time series, we use a simple method to fill these. Wherever there is a gap, we fill in the corresponding decadal norm as stored in the norm column in the object q_dec_tbl
. The visualization of the results confirms that our simple gap filling approach is indeed satisfactory (Figure 7.2).
<- q_dec_tbl
q_dec_filled_tbl
$data[is.na(q_dec_filled_tbl$data)] =
q_dec_filled_tbl$norm[is.na(q_dec_filled_tbl$data)] # Gap filling step
q_dec_filled_tbl
%>% plot_time_series(date, data,
q_dec_filled_tbl .facet_vars = code,
.smooth = FALSE,
.interactive = TRUE,
.x_lab = "year",
.y_lab = "m^3/s",
.title = ""
)
A note of caution here. This simple gap filling technique reduces variance in the time series. It should only be used when the percentage of missing data is low. As will be discussed in the next Section 7.3.3 below, better techniques have to be utilized when there exist substantial gaps and in the case of less regular data.
Finally, we discard the norm data which we used for gap filling of the missing discharge data and convert the data to wide format (see the Table below) to add to it meteorological data in the next Section.
<- q_dec_filled_tbl %>% # again we use the name convention of objects as introduced above
q_dec_filled_wide_tbl mutate(code = paste0('Q',code %>% as.character())) %>% # Since we convert everything to long form, we want to keep information as compact as possible. Hence, we paste the type identifier (Q for discharge here) in from of the 5 digit station code.
::select(date,data,code) %>% # ... and then ditch all the remainder information
dplyrpivot_wider(names_from = "code",values_from = "data") # in order to pivot to the long format, we need to make a small detour via the wide format.
<- q_dec_filled_wide_tbl %>% pivot_longer(-date) # and then pivot back
q_dec_filled_long_tbl q_dec_filled_wide_tbl
## # A tibble: 3,024 x 4
## date Q16279 Q16290 Q16924
## <date> <dbl> <dbl> <dbl>
## 1 1932-01-10 48.8 38.3 87.1
## 2 1932-01-20 48.4 37.7 86.1
## 3 1932-01-31 42.4 36.2 78.6
## 4 1932-02-10 43.7 35.6 79.3
## 5 1932-02-20 44.2 35 79.2
## 6 1932-02-29 47.7 37.1 84.8
## 7 1932-03-10 54.1 43.1 97.2
## 8 1932-03-20 63.2 47 110
## 9 1932-03-31 103 72.1 175
## 10 1932-04-10 103 73.2 176
## # … with 3,014 more rows
As a result, we now have a complete record of decadal discharge data for the two main tributaries of the Chirchik river and the inflow time series to Charvak Reservoir from the beginning of 1932 until and including 2015, i.e. 84 years. The same type of preparatory analysis will now be carried out for the meteorological data.
7.3.3 Gap Filling Meteorological Data
Here, we use precipitation and temperature data from Pskem (38462), Chatkal (38471) and Charvak Reservoir (38464) Meteorological Stations (see Chapter ?? for more information on these stations). We also have data from Oygaing station (Station Code 38339) but the record only starts in 1962 and the time resolution is monthly. Therefore, we do not take this station into account here for the time being.
We start with precipitation and plot the available data.
<- data %>% filter(type=="P" & code!="38339")
p_dec_tbl %>% plot_time_series(date,data,
p_dec_tbl .facet_vars = code,
.interactive = TRUE,
.smooth = FALSE,
.title = "",
.y_lab = "mm/decade",
.x_lab = "year"
)
The precipitation data from these 3 stations shows some significant data gaps. The Chatkal Meteorological Station that is located in Kyrgyzstan apparently did not work in the post-transition years and continuous measurements were only resumed there in 1998.
Let us see what happens if we were to use the same simple gap filling technique that we introduced above for discharge.
<- p_dec_tbl
p_dec_filled_tbl $data[is.na(p_dec_filled_tbl$data)] = p_dec_filled_tbl$norm[is.na(p_dec_filled_tbl$data)]
p_dec_filled_tbl%>% plot_time_series(date,data,
p_dec_filled_tbl .facet_vars = code,
.interactive = TRUE,
.smooth = FALSE,
.title = "",
.y_lab = "mm/decade",
.x_lab = "year"
)
Closely inspect the significant data gap in the 1990ies at Station 38741 (tip: play around and zoom into the time series in the 1990ies in Figure 7.3 and comparing it with the resulting gap-filled timeseries in Figure ??. We see that our technique of gap filling with long-term norms is not suitable for this type of data and the significant gap size. The effect of variance reduction is also clearly visible.
Hence, we resort to a more powerful gap filling technique that uses a (regression) model to impute the missing values from existing ones at the neighboring stations, i.e. Stations 38462 and 38464. To do so, we utilize an R package that is tightly integrated in the tidyverse
18.
library(simputation)
# First, we bring the data into the suitable format.
<- p_dec_tbl %>%
p_dec_wide_tbl mutate(code = paste0('P',code %>% as.character())) %>%
::select(date,data,code) %>%
dplyrpivot_wider(names_from = "code",values_from = "data")
# Second, we impute missing values.
<- p_dec_wide_tbl %>%
p_dec_filled_wide_tbl impute_rlm(P38471 ~ P38462 + P38464) %>% # Imputing precipitation at station 38471 using a robust linear regression model
impute_rlm(P38462 ~ P38471 + P38464) %>% # Imputing precipitation at station 38462 using a robust linear regression model
impute_rlm(P38464 ~ P38462 + P38471) # Imputing precipitation at station 38464 using a robust linear regression model
<- p_dec_filled_wide_tbl %>% pivot_longer(c('P38462','P38464','P38471'))
p_dec_filled_long_tbl
%>% plot_time_series(date,value,
p_dec_filled_long_tbl.facet_vars = name,
.interactive = TRUE,
.smooth = FALSE,
.title = '',
.y_lab = "mm/decade",
.x_lab = "year"
)
As you can see, we use simple linear regression models to impute missing value in the target time series using observations from the neighboring stations.
Through simple visual inspection, it becomes clear that this type of regression model for gap filling is better suited than the previous approach chosen. Let us check whether we could successfully fill all gaps with this robust linear regression approach.
%>% group_by(name) %>% summarize(n.na = sum(is.na(value)), n.na.perc = n.na/n()*100) p_dec_filled_long_tbl
## # A tibble: 3 x 3
## name n.na n.na.perc
## <chr> <int> <dbl>
## 1 P38462 12 0.402
## 2 P38464 12 0.402
## 3 P38471 3 0.100
It turns out that we still have very few gaps to deal with. We can see them by simply visualizing the wide tibble. The problem persisted at times when two or more values were missing across the available stations at the same time and where thus the linear regression could not be carried out.
%>% head(10) p_dec_filled_wide_tbl
## # A tibble: 10 x 4
## date P38462 P38464 P38471
## <date> <dbl> <dbl> <dbl>
## 1 1933-01-10 NA NA 2
## 2 1933-01-20 NA NA 10
## 3 1933-01-31 NA NA 5
## 4 1933-02-10 NA NA 33
## 5 1933-02-20 NA NA 8
## 6 1933-02-28 NA NA 10
## 7 1933-03-10 NA NA 31
## 8 1933-03-20 NA NA 50
## 9 1933-03-31 NA NA 6
## 10 1933-04-10 23 21.3 13
%>% tail() p_dec_filled_wide_tbl
## # A tibble: 6 x 4
## date P38462 P38464 P38471
## <date> <dbl> <dbl> <dbl>
## 1 2015-11-10 72 81 19
## 2 2015-11-20 122 76 43
## 3 2015-11-30 7 2 3
## 4 2015-12-10 NA NA NA
## 5 2015-12-20 NA NA NA
## 6 2015-12-31 NA NA NA
We can solve the issues related to the missing values at the start of the observation record by using the same technique as above and by only regressing P38462 and P38464 on P38471.
<- p_dec_filled_wide_tbl %>%
p_dec_filled_wide_tbl impute_rlm(P38462 ~ P38471) %>% # Imputing precipitation at station 38462 using a robust linear regression model
impute_rlm(P38464 ~ P38471) # Imputing precipitation at station 38464 using a robust linear regression model
%>% head(10) p_dec_filled_wide_tbl
## # A tibble: 10 x 4
## date P38462 P38464 P38471
## <date> <dbl> <dbl> <dbl>
## 1 1933-01-10 5.60 5.08 2
## 2 1933-01-20 18.3 16.7 10
## 3 1933-01-31 10.4 9.46 5
## 4 1933-02-10 54.9 50.3 33
## 5 1933-02-20 15.2 13.8 8
## 6 1933-02-28 18.3 16.7 10
## 7 1933-03-10 51.8 47.3 31
## 8 1933-03-20 82.0 75.0 50
## 9 1933-03-31 12.0 10.9 6
## 10 1933-04-10 23 21.3 13
Converse to this, the complete set of observations is missing for December 2015. We will thus remove these non-observations from our tibble.
<- p_dec_filled_wide_tbl %>% na.omit()
p_dec_filled_wide_tbl %>% tail() p_dec_filled_wide_tbl
## # A tibble: 6 x 4
## date P38462 P38464 P38471
## <date> <dbl> <dbl> <dbl>
## 1 2015-10-10 5 1 0
## 2 2015-10-20 89 108 58
## 3 2015-10-31 34 40 12
## 4 2015-11-10 72 81 19
## 5 2015-11-20 122 76 43
## 6 2015-11-30 7 2 3
<- p_dec_filled_wide_tbl %>% pivot_longer(-date) p_dec_filled_long_tbl
Inspecting the temperature data, we see similar data issues as in the precipitation data set.
<- data %>% filter(type=="T")
t_dec_tbl %>% plot_time_series(date,data,
t_dec_tbl .facet_vars = code,
.interactive = TRUE,
.smooth = FALSE,
.title = '',
.y_lab = "deg. Celsius",
.x_lab = "year"
)
# First, we bring the data into the suitable format.
<- t_dec_tbl %>%
t_dec_wide_tbl mutate(code = paste0('T',code %>% as.character())) %>%
::select(date,data,code) %>%
dplyrpivot_wider(names_from = "code",values_from = "data")
# Second, we impute missing values.
<- t_dec_wide_tbl %>%
t_dec_filled_wide_tbl impute_rlm(T38471 ~ T38462) %>% # Imputing precipitation at station 38471 using a robust linear regression model
impute_rlm(T38462 ~ T38471) # Imputing precipitation at station 38462 using a robust linear regression model
<- t_dec_filled_wide_tbl %>%
t_dec_filled_long_tbl pivot_longer(c('T38462','T38471'))
%>%
t_dec_filled_long_tblplot_time_series(date,value,
.facet_vars = name,
.interactive = TRUE,
.smooth = FALSE,
.title = '',
.y_lab = "deg. Celsius",
.x_lab = "year"
)
There are some irregularities in the temperature time series of Chatkal Meteorological Station in the first decade of the 20th century (tip: zoom in to see these more clearly). Note that these were not introduced by the gap filling technique that we used but are most likely wrong temperature readings. We will return to these in the outlier analysis below in Section 7.3.4.
%>%
t_dec_filled_long_tbl group_by(name) %>%
summarize(n.na = sum(is.na(value)), n.na.perc = n.na/n()*100)
## # A tibble: 2 x 3
## name n.na n.na.perc
## <chr> <int> <dbl>
## 1 T38462 3 0.100
## 2 T38471 3 0.100
To see where the missing value are, we find them easily again by looking at the head and tail of the tibble.
%>% head() t_dec_filled_wide_tbl
## # A tibble: 6 x 3
## date T38462 T38471
## <date> <dbl> <dbl>
## 1 1933-01-10 -6.9 -16.6
## 2 1933-01-20 -6.1 -15.5
## 3 1933-01-31 -6.3 -15.6
## 4 1933-02-10 -2 -8.6
## 5 1933-02-20 -3.3 -12.5
## 6 1933-02-28 -0.1 -8.5
%>% tail() t_dec_filled_wide_tbl
## # A tibble: 6 x 3
## date T38462 T38471
## <date> <dbl> <dbl>
## 1 2015-11-10 2.4 -2.5
## 2 2015-11-20 2 -2.2
## 3 2015-11-30 4.6 -3.7
## 4 2015-12-10 NA NA
## 5 2015-12-20 NA NA
## 6 2015-12-31 NA NA
Finally, we remove the non observations again as above with the function na.omit
.
<- t_dec_filled_wide_tbl %>% na.omit()
t_dec_filled_wide_tbl <- t_dec_filled_wide_tbl %>% pivot_longer(-date) t_dec_filled_long_tbl
To deal with the missing values at the end of the observational record, we could also have used any other technique. Using the norm values however would have artificially reduced the variance in both cases as explained above. Furthermore and at least in the case of temperature, it is also questionable to what extent a norm calculated over the last 84 years is still representative given global warming. We will look in this important and interesting topic in the next section.
7.3.4 Anomalies and Outliers
We use the function timetk::plot_anomaly_diagnostics
to investigate anomalies in the time series. For discharge, we first log-transform the raw data with the following transformation to reduce the variance of the original data.
\[ \hat{q}(t) = log(q(t) + 1) \]
where \(\hat{q}(t)\) denotes the transformed discharge. Prior to the log transformation, 1 is added so as to avoid cases where discharge would be 0 and the logarithmic transform thus undefined. The transformation can easily be done with the log1p()
function in R. Backtransformation via the function expm1()
simply involves taking the exponent and subtracting 1 from the result. Figure ?? shows the result.
The exceptionally wet year 19169 shows up as anomalous in the Chatkal River Basin and at the downstream Charvak Reservoir inflow gauge.
, ?? and ?? show anomalies diagnostics of the available data.
%>%
q_dec_filled_long_tbl plot_anomaly_diagnostics(date,
%>% log1p(),
value .facet_vars = name,
.frequency = 36,
.interactive = TRUE,
.title = "")
The investigation of precipitation anomalies shows a succession of regular anomalous wet events over time. It is interesting to see that the winter 1968/69 regularly anomalous at all three stations (Figure 7.6, zoom in to investigate).
%>%
p_dec_filled_long_tbl plot_anomaly_diagnostics(date,
value,.facet_vars = name,
.interactive = TRUE,
.title = "")
While intuitively, we would have expected an eceptionally mild winter in 1968/69 due to the precipitation excess, the corresponding anomaly does not show up in the temperature record (Figure 7.7).
%>%
t_dec_filled_long_tbl plot_anomaly_diagnostics(date,value,
.facet_vars = name,
.interactive = TRUE,
.title = "")
Apart from the identification of extremal periods since as the 1969 discharge year in the Chatkal river basin, the diagnostics of anomalies also helps to identify likely erroneous data records. In Figure @ref(anomalies_T) for example, when we zoom into the data of the series T38471 in the first decade of the 21st century, problems in relation to positive anomalies during the winter are visible in 4 instances. One explanation would be that in at least some instances, the data are erroneously recorded as positive values when in fact they were negative (see dates ‘2002-01-31,’ ‘2005-01-10’ and ‘2007-02-28,’ Chatkal Station 38471).
7.3.5 Putting it all together
Finally, we are now in the position to assemble all data that we will use for empirical modeling. The data is stored in long and wide form and used accordingly where required. For example, in Section @ref{TimeSeriesReg}, we are working with the wide data format to investigate model features in linear regression. Note that we also add a column with a decade identifier. Its use will become apparent in the Section 7.5 below.
# Final concatenation
<- right_join(q_dec_filled_wide_tbl,p_dec_filled_wide_tbl,by='date')
data_wide_tbl <- right_join(data_wide_tbl,t_dec_filled_wide_tbl,by='date')
data_wide_tbl # Add period identifiers (decades in this case)
<- data_wide_tbl$date %>% first()
s <- data_wide_tbl$date %>% last()
e <- decadeMaker(s,e,'end')
decs <- decs %>% rename(per=dec)
decs <- data_wide_tbl %>% left_join(decs,by='date')
data_wide_tbl # Creating long form
<- data_wide_tbl %>%
data_long_tbl pivot_longer(-date)
# Cross checking completeness of record
%>%
data_long_tbl group_by(name) %>%
summarize(n.na = sum(is.na(value)), n.na.perc = n.na/n()*100)
## # A tibble: 9 x 3
## name n.na n.na.perc
## <chr> <int> <dbl>
## 1 P38462 0 0
## 2 P38464 0 0
## 3 P38471 0 0
## 4 per 0 0
## 5 Q16279 0 0
## 6 Q16290 0 0
## 7 Q16924 0 0
## 8 T38462 0 0
## 9 T38471 0 0
## Temp storage of data (remove later)
# fPath <- '/Users/tobiassiegfried/Dropbox (hydrosolutions)/1_HSOL_PROJECTS/PROJECTS/SDC/DKU_WRM_COURSE_CA/Course Materials/Handbook/Applied_Hydrological_Modeling_Bookdown/temp/'
# saveRDS(data_wide_tbl,file=paste(fPath,'data_wide_tbl',sep=""))
# saveRDS(data_long_tbl,file=paste(fPath,'data_long_tbl',sep=""))
A consistent data record from 1933 until and including November 2015 is now prepared^Please note that by using left_join
above, we have cut off discharge data from the year 1932 since we do not have meteorological data there.^. Let us analyze these data now.
Please note that if you do not have the required package installed locally, you should install it prior to its use with the following command
install.packages('simputation')
↩︎