7.3 Data and Preparation

7.3.1 Available Data

The riversCentralAsia Package provides available data of the gauging and meteorological stations in the Chirchik River Basin.

Before starting any type of modeling, it is important to get a good understanding of the data that we are dealing with and whether there exist problems with the raw data that need to be addressed prior to modeling. Problems usually include data gaps and outliers as data record that one obtains are usually ever complete nor clean of errors.

The steps performed here are thus required steps for any type of successful modeling and should be performed with great care. We concentrate our efforts here on discharge records and data from meteorological stations in the Chirchik River Basin. The techniques shown here for decadal (10-days) data naturally extend to monthly data and other basins.

7.3.2 Gap Filling Discharge Data

In the following, we will work with decadal discharge data from the two main tributaries, i.e. the Chatkal and (Gauge 16279) Pskem rivers (Gauge 16290) and the data of the inflow to the Charvak reservoir (Gauge 16924). The goal is to analyze the data and prepare for modeling. First, let us load the relevant discharge data.

data <- ChirchikRiverBasin # load data
q_dec_tbl <- data %>% filter(code == '16279' | code == '16290' | code == '16924') # Note for the new name of the object, we choose to add periodicity (_dec_) and data type (_tbl for tibble/dataframe) to the data name. This just helps to stay organized and is good practice in R programming.
q_dec_tbl
## # A tibble: 9,072 x 14
##    date        data  norm units type  code  station   river   basin   resolution
##    <date>     <dbl> <dbl> <chr> <fct> <chr> <chr>     <chr>   <chr>   <fct>     
##  1 1932-01-10  48.8  38.8 m3s   Q     16279 Khudaydod Chatkal Chirch… dec       
##  2 1932-01-20  48.4  37.5 m3s   Q     16279 Khudaydod Chatkal Chirch… dec       
##  3 1932-01-31  42.4  36.6 m3s   Q     16279 Khudaydod Chatkal Chirch… dec       
##  4 1932-02-10  43.7  36.4 m3s   Q     16279 Khudaydod Chatkal Chirch… dec       
##  5 1932-02-20  44.2  36.3 m3s   Q     16279 Khudaydod Chatkal Chirch… dec       
##  6 1932-02-29  47.7  36.9 m3s   Q     16279 Khudaydod Chatkal Chirch… dec       
##  7 1932-03-10  54.1  39.4 m3s   Q     16279 Khudaydod Chatkal Chirch… dec       
##  8 1932-03-20  63.2  47.6 m3s   Q     16279 Khudaydod Chatkal Chirch… dec       
##  9 1932-03-31 103    60.5 m3s   Q     16279 Khudaydod Chatkal Chirch… dec       
## 10 1932-04-10 103    86.4 m3s   Q     16279 Khudaydod Chatkal Chirch… dec       
## # … with 9,062 more rows, and 4 more variables: lon_UTM42 <dbl>,
## #   lat_UTM42 <dbl>, altitude_masl <dbl>, basinSize_sqkm <dbl>

You can get more information about the available data by typing ?ChirchikRiverBasin.

It is advisable to check at this stage for missing data in time series and to fill gaps where present. As can be seen in Figure 7.1 , close inspection of the time series indeed reveals some missing data in the 1940ies.

q_dec_tbl %>% plot_time_series(date,data,
                               .facet_vars  = code,
                               .smooth      = FALSE,
                               .interactive = TRUE,
                               .x_lab       = "year",
                               .y_lab       = "m^3/s",
                               .title       = ""
                               )

Figure 7.1: Discharge data of selected gauges in the upstream zone of runoff formation in the Chirchik River Basin. Data Source: Uzbek Hydrometeorological Service.

Figure 7.1 and the following Figures are interactive, so you can zoom in to regions of interest.

Missing data are also confirmed by the warning that the function timetk::plot_time_series() throws (suppressed here). Statistics of the missing data can be easily obtained. As the Table below shows, we can do this analysis for each discharge station separately.

q_dec_tbl %>% group_by(code) %>% 
  summarize(n.na = sum(is.na(data)), na.perc = n.na/n()*100)
## # A tibble: 3 x 3
##   code   n.na na.perc
##   <chr> <int>   <dbl>
## 1 16279    15   0.496
## 2 16290    39   1.29 
## 3 16924    42   1.39

Summarizing the number of observation with missing data reveals 15 data points for station 16279 (0.5 % of total record length) and 39 for station 16290 (1.3 % of total record length). As there are only very few gaps in the existing time series, we use a simple method to fill these. Wherever there is a gap, we fill in the corresponding decadal norm as stored in the norm column in the object q_dec_tbl. The visualization of the results confirms that our simple gap filling approach is indeed satisfactory (Figure 7.2).

q_dec_filled_tbl <- q_dec_tbl

q_dec_filled_tbl$data[is.na(q_dec_filled_tbl$data)] = 
  q_dec_filled_tbl$norm[is.na(q_dec_filled_tbl$data)] # Gap filling step

q_dec_filled_tbl %>% plot_time_series(date, data, 
                                      .facet_vars  = code, 
                                      .smooth      = FALSE,
                                      .interactive = TRUE,
                                      .x_lab       = "year",
                                      .y_lab       = "m^3/s",
                                      .title       = ""
                                      )

Figure 7.2: Gap filled Pskem and Chatkal river discharges.

A note of caution here. This simple gap filling technique reduces variance in the time series. It should only be used when the percentage of missing data is low. As will be discussed in the next Section 7.3.3 below, better techniques have to be utilized when there exist substantial gaps and in the case of less regular data.

Finally, we discard the norm data which we used for gap filling of the missing discharge data and convert the data to wide format (see the Table below) to add to it meteorological data in the next Section.

q_dec_filled_wide_tbl <- q_dec_filled_tbl %>% # again we use the name convention of objects as introduced above
  mutate(code = paste0('Q',code %>% as.character())) %>% # Since we convert everything to long form, we want to keep information as compact as possible. Hence, we paste the type identifier (Q for discharge here) in from of the 5 digit station code.
  dplyr::select(date,data,code) %>% # ... and then ditch all the remainder information
  pivot_wider(names_from = "code",values_from = "data") # in order to pivot to the long format, we need to make a small detour via the wide format.

q_dec_filled_long_tbl <- q_dec_filled_wide_tbl %>% pivot_longer(-date) # and then pivot back
q_dec_filled_wide_tbl
## # A tibble: 3,024 x 4
##    date       Q16279 Q16290 Q16924
##    <date>      <dbl>  <dbl>  <dbl>
##  1 1932-01-10   48.8   38.3   87.1
##  2 1932-01-20   48.4   37.7   86.1
##  3 1932-01-31   42.4   36.2   78.6
##  4 1932-02-10   43.7   35.6   79.3
##  5 1932-02-20   44.2   35     79.2
##  6 1932-02-29   47.7   37.1   84.8
##  7 1932-03-10   54.1   43.1   97.2
##  8 1932-03-20   63.2   47    110  
##  9 1932-03-31  103     72.1  175  
## 10 1932-04-10  103     73.2  176  
## # … with 3,014 more rows

As a result, we now have a complete record of decadal discharge data for the two main tributaries of the Chirchik river and the inflow time series to Charvak Reservoir from the beginning of 1932 until and including 2015, i.e. 84 years. The same type of preparatory analysis will now be carried out for the meteorological data.

7.3.3 Gap Filling Meteorological Data

Here, we use precipitation and temperature data from Pskem (38462), Chatkal (38471) and Charvak Reservoir (38464) Meteorological Stations (see Chapter ?? for more information on these stations). We also have data from Oygaing station (Station Code 38339) but the record only starts in 1962 and the time resolution is monthly. Therefore, we do not take this station into account here for the time being.

We start with precipitation and plot the available data.

p_dec_tbl <- data %>% filter(type=="P" & code!="38339") 
p_dec_tbl %>% plot_time_series(date,data,
                               .facet_vars  = code,
                               .interactive = TRUE,
                               .smooth      = FALSE,
                               .title       = "",
                               .y_lab       = "mm/decade",
                               .x_lab       = "year"
                               )

Figure 7.3: Raw decadal precipitation data from Pskem (38462), Charvak Reservoir (38471) and Chatkal Meteo Station (38471).

The precipitation data from these 3 stations shows some significant data gaps. The Chatkal Meteorological Station that is located in Kyrgyzstan apparently did not work in the post-transition years and continuous measurements were only resumed there in 1998.

Let us see what happens if we were to use the same simple gap filling technique that we introduced above for discharge.

p_dec_filled_tbl <- p_dec_tbl
p_dec_filled_tbl$data[is.na(p_dec_filled_tbl$data)] = p_dec_filled_tbl$norm[is.na(p_dec_filled_tbl$data)]
p_dec_filled_tbl %>% plot_time_series(date,data,
                                      .facet_vars  = code,
                                      .interactive = TRUE,
                                      .smooth      = FALSE,
                                      .title       = "",
                                      .y_lab       = "mm/decade",
                                      .x_lab       = "year"
                                      )

Figure 7.4: Precipitation Data gap-filled with norms. The filled values from 1990 - 2000 in the case of the Station 38471 indicate that the norm-filling technique is not good.

Closely inspect the significant data gap in the 1990ies at Station 38741 (tip: play around and zoom into the time series in the 1990ies in Figure 7.3 and comparing it with the resulting gap-filled timeseries in Figure ??. We see that our technique of gap filling with long-term norms is not suitable for this type of data and the significant gap size. The effect of variance reduction is also clearly visible.

Hence, we resort to a more powerful gap filling technique that uses a (regression) model to impute the missing values from existing ones at the neighboring stations, i.e. Stations 38462 and 38464. To do so, we utilize an R package that is tightly integrated in the tidyverse18.

library(simputation)
# First, we bring the data into the suitable format. 
p_dec_wide_tbl <- p_dec_tbl %>% 
  mutate(code = paste0('P',code %>% as.character())) %>% 
  dplyr::select(date,data,code) %>% 
  pivot_wider(names_from = "code",values_from = "data")

# Second, we impute missing values.
p_dec_filled_wide_tbl <- p_dec_wide_tbl  %>% 
  impute_rlm(P38471 ~ P38462 + P38464) %>% # Imputing precipitation at station 38471 using a robust linear regression model
  impute_rlm(P38462 ~ P38471 + P38464) %>% # Imputing precipitation at station 38462 using a robust linear regression model
  impute_rlm(P38464 ~ P38462 + P38471) # Imputing precipitation at station 38464 using a robust linear regression model

p_dec_filled_long_tbl <- p_dec_filled_wide_tbl %>% pivot_longer(c('P38462','P38464','P38471')) 

p_dec_filled_long_tbl%>% plot_time_series(date,value,
                                          .facet_vars  = name,
                                          .interactive = TRUE,
                                          .smooth      = FALSE,
                                          .title       = '',
                                          .y_lab       = "mm/decade",
                                          .x_lab       = "year"
                                          )

(#fig:EMrawData_P_rlm)Precipitation Data gap filled with a robust linear regression modeling approach

As you can see, we use simple linear regression models to impute missing value in the target time series using observations from the neighboring stations.

Through simple visual inspection, it becomes clear that this type of regression model for gap filling is better suited than the previous approach chosen. Let us check whether we could successfully fill all gaps with this robust linear regression approach.

p_dec_filled_long_tbl %>% group_by(name) %>% summarize(n.na = sum(is.na(value)), n.na.perc = n.na/n()*100)
## # A tibble: 3 x 3
##   name    n.na n.na.perc
##   <chr>  <int>     <dbl>
## 1 P38462    12     0.402
## 2 P38464    12     0.402
## 3 P38471     3     0.100

It turns out that we still have very few gaps to deal with. We can see them by simply visualizing the wide tibble. The problem persisted at times when two or more values were missing across the available stations at the same time and where thus the linear regression could not be carried out.

p_dec_filled_wide_tbl %>% head(10)
## # A tibble: 10 x 4
##    date       P38462 P38464 P38471
##    <date>      <dbl>  <dbl>  <dbl>
##  1 1933-01-10     NA   NA        2
##  2 1933-01-20     NA   NA       10
##  3 1933-01-31     NA   NA        5
##  4 1933-02-10     NA   NA       33
##  5 1933-02-20     NA   NA        8
##  6 1933-02-28     NA   NA       10
##  7 1933-03-10     NA   NA       31
##  8 1933-03-20     NA   NA       50
##  9 1933-03-31     NA   NA        6
## 10 1933-04-10     23   21.3     13
p_dec_filled_wide_tbl %>% tail()
## # A tibble: 6 x 4
##   date       P38462 P38464 P38471
##   <date>      <dbl>  <dbl>  <dbl>
## 1 2015-11-10     72     81     19
## 2 2015-11-20    122     76     43
## 3 2015-11-30      7      2      3
## 4 2015-12-10     NA     NA     NA
## 5 2015-12-20     NA     NA     NA
## 6 2015-12-31     NA     NA     NA

We can solve the issues related to the missing values at the start of the observation record by using the same technique as above and by only regressing P38462 and P38464 on P38471.

p_dec_filled_wide_tbl <- p_dec_filled_wide_tbl  %>% 
  impute_rlm(P38462 ~ P38471) %>% # Imputing precipitation at station 38462 using a robust linear regression model
  impute_rlm(P38464 ~ P38471) # Imputing precipitation at station 38464 using a robust linear regression model
p_dec_filled_wide_tbl %>% head(10)
## # A tibble: 10 x 4
##    date       P38462 P38464 P38471
##    <date>      <dbl>  <dbl>  <dbl>
##  1 1933-01-10   5.60   5.08      2
##  2 1933-01-20  18.3   16.7      10
##  3 1933-01-31  10.4    9.46      5
##  4 1933-02-10  54.9   50.3      33
##  5 1933-02-20  15.2   13.8       8
##  6 1933-02-28  18.3   16.7      10
##  7 1933-03-10  51.8   47.3      31
##  8 1933-03-20  82.0   75.0      50
##  9 1933-03-31  12.0   10.9       6
## 10 1933-04-10  23     21.3      13

Converse to this, the complete set of observations is missing for December 2015. We will thus remove these non-observations from our tibble.

p_dec_filled_wide_tbl <- p_dec_filled_wide_tbl %>% na.omit()
p_dec_filled_wide_tbl %>% tail()
## # A tibble: 6 x 4
##   date       P38462 P38464 P38471
##   <date>      <dbl>  <dbl>  <dbl>
## 1 2015-10-10      5      1      0
## 2 2015-10-20     89    108     58
## 3 2015-10-31     34     40     12
## 4 2015-11-10     72     81     19
## 5 2015-11-20    122     76     43
## 6 2015-11-30      7      2      3
p_dec_filled_long_tbl <-  p_dec_filled_wide_tbl %>% pivot_longer(-date)

Inspecting the temperature data, we see similar data issues as in the precipitation data set.

t_dec_tbl <- data %>% filter(type=="T") 
t_dec_tbl %>% plot_time_series(date,data,
                               .facet_vars  = code,
                               .interactive = TRUE,
                               .smooth      = FALSE,
                               .title       = '',
                               .y_lab       = "deg. Celsius",
                               .x_lab       = "year"
                               )

(#fig:EMrawData_T)Raw temperature data from the meteorological stations Pskem (38462) and Chatkal (38471)

# First, we bring the data into the suitable format. 
t_dec_wide_tbl <- t_dec_tbl %>% 
  mutate(code = paste0('T',code %>% as.character())) %>% 
  dplyr::select(date,data,code) %>% 
  pivot_wider(names_from = "code",values_from = "data")

# Second, we impute missing values.
t_dec_filled_wide_tbl <- t_dec_wide_tbl  %>% 
  impute_rlm(T38471 ~ T38462) %>% # Imputing precipitation at station 38471 using a robust linear regression model
  impute_rlm(T38462 ~ T38471) # Imputing precipitation at station 38462 using a robust linear regression model

t_dec_filled_long_tbl <- t_dec_filled_wide_tbl %>% 
  pivot_longer(c('T38462','T38471')) 

t_dec_filled_long_tbl%>% 
  plot_time_series(date,value,
                   .facet_vars  = name,
                   .interactive = TRUE,
                   .smooth      = FALSE,
                   .title       = '',
                   .y_lab       = "deg. Celsius",
                   .x_lab       = "year"
                   )

(#fig:EMrawData_T_rlm)Temperature data gap filled with robust linear regression modeling.

There are some irregularities in the temperature time series of Chatkal Meteorological Station in the first decade of the 20th century (tip: zoom in to see these more clearly). Note that these were not introduced by the gap filling technique that we used but are most likely wrong temperature readings. We will return to these in the outlier analysis below in Section 7.3.4.

t_dec_filled_long_tbl %>% 
  group_by(name) %>% 
  summarize(n.na = sum(is.na(value)), n.na.perc = n.na/n()*100)
## # A tibble: 2 x 3
##   name    n.na n.na.perc
##   <chr>  <int>     <dbl>
## 1 T38462     3     0.100
## 2 T38471     3     0.100

To see where the missing value are, we find them easily again by looking at the head and tail of the tibble.

t_dec_filled_wide_tbl %>% head()
## # A tibble: 6 x 3
##   date       T38462 T38471
##   <date>      <dbl>  <dbl>
## 1 1933-01-10   -6.9  -16.6
## 2 1933-01-20   -6.1  -15.5
## 3 1933-01-31   -6.3  -15.6
## 4 1933-02-10   -2     -8.6
## 5 1933-02-20   -3.3  -12.5
## 6 1933-02-28   -0.1   -8.5
t_dec_filled_wide_tbl %>% tail()
## # A tibble: 6 x 3
##   date       T38462 T38471
##   <date>      <dbl>  <dbl>
## 1 2015-11-10    2.4   -2.5
## 2 2015-11-20    2     -2.2
## 3 2015-11-30    4.6   -3.7
## 4 2015-12-10   NA     NA  
## 5 2015-12-20   NA     NA  
## 6 2015-12-31   NA     NA

Finally, we remove the non observations again as above with the function na.omit.

t_dec_filled_wide_tbl <- t_dec_filled_wide_tbl %>% na.omit()
t_dec_filled_long_tbl <- t_dec_filled_wide_tbl %>% pivot_longer(-date)

To deal with the missing values at the end of the observational record, we could also have used any other technique. Using the norm values however would have artificially reduced the variance in both cases as explained above. Furthermore and at least in the case of temperature, it is also questionable to what extent a norm calculated over the last 84 years is still representative given global warming. We will look in this important and interesting topic in the next section.

7.3.4 Anomalies and Outliers

We use the function timetk::plot_anomaly_diagnostics to investigate anomalies in the time series. For discharge, we first log-transform the raw data with the following transformation to reduce the variance of the original data.

\[ \hat{q}(t) = log(q(t) + 1) \]

where \(\hat{q}(t)\) denotes the transformed discharge. Prior to the log transformation, 1 is added so as to avoid cases where discharge would be 0 and the logarithmic transform thus undefined. The transformation can easily be done with the log1p() function in R. Backtransformation via the function expm1() simply involves taking the exponent and subtracting 1 from the result. Figure ?? shows the result.

The exceptionally wet year 19169 shows up as anomalous in the Chatkal River Basin and at the downstream Charvak Reservoir inflow gauge.

, ?? and ?? show anomalies diagnostics of the available data.

q_dec_filled_long_tbl %>% 
  plot_anomaly_diagnostics(date,
                           value %>% log1p(),
                           .facet_vars  = name,
                           .frequency = 36,
                           .interactive = TRUE,
                           .title = "")

Figure 7.5: Anomaly diagnostics of discharge data. The transparent grey band shows the width of the normal range. The highly anomalous wet year of 1969 is clearly visible in the discharge record of the Chatkal river basin (Station 16279).

The investigation of precipitation anomalies shows a succession of regular anomalous wet events over time. It is interesting to see that the winter 1968/69 regularly anomalous at all three stations (Figure 7.6, zoom in to investigate).

p_dec_filled_long_tbl %>% 
  plot_anomaly_diagnostics(date,
                           value,
                           .facet_vars  = name,
                           .interactive = TRUE,
                           .title = "")

Figure 7.6: Anomaly diagnostics of precipitation data.

While intuitively, we would have expected an eceptionally mild winter in 1968/69 due to the precipitation excess, the corresponding anomaly does not show up in the temperature record (Figure 7.7).

t_dec_filled_long_tbl %>%  
  plot_anomaly_diagnostics(date,value,
                           .facet_vars  = name,
                           .interactive = TRUE,
                           .title = "")

Figure 7.7: Anomaly diagnostics of temperature data.

Apart from the identification of extremal periods since as the 1969 discharge year in the Chatkal river basin, the diagnostics of anomalies also helps to identify likely erroneous data records. In Figure @ref(anomalies_T) for example, when we zoom into the data of the series T38471 in the first decade of the 21st century, problems in relation to positive anomalies during the winter are visible in 4 instances. One explanation would be that in at least some instances, the data are erroneously recorded as positive values when in fact they were negative (see dates ‘2002-01-31,’ ‘2005-01-10’ and ‘2007-02-28,’ Chatkal Station 38471).

7.3.5 Putting it all together

Finally, we are now in the position to assemble all data that we will use for empirical modeling. The data is stored in long and wide form and used accordingly where required. For example, in Section @ref{TimeSeriesReg}, we are working with the wide data format to investigate model features in linear regression. Note that we also add a column with a decade identifier. Its use will become apparent in the Section 7.5 below.

# Final concatenation
data_wide_tbl <- right_join(q_dec_filled_wide_tbl,p_dec_filled_wide_tbl,by='date')
data_wide_tbl <- right_join(data_wide_tbl,t_dec_filled_wide_tbl,by='date')
# Add period identifiers (decades in this case)
s <- data_wide_tbl$date %>% first()
e <- data_wide_tbl$date %>% last()
decs <- decadeMaker(s,e,'end')
decs <- decs %>% rename(per=dec)
data_wide_tbl <- data_wide_tbl %>% left_join(decs,by='date')
# Creating long form
data_long_tbl <- data_wide_tbl %>% 
  pivot_longer(-date)
# Cross checking completeness of record
data_long_tbl %>% 
  group_by(name) %>% 
  summarize(n.na = sum(is.na(value)), n.na.perc = n.na/n()*100)
## # A tibble: 9 x 3
##   name    n.na n.na.perc
##   <chr>  <int>     <dbl>
## 1 P38462     0         0
## 2 P38464     0         0
## 3 P38471     0         0
## 4 per        0         0
## 5 Q16279     0         0
## 6 Q16290     0         0
## 7 Q16924     0         0
## 8 T38462     0         0
## 9 T38471     0         0
## Temp storage of data (remove later)
# fPath <- '/Users/tobiassiegfried/Dropbox (hydrosolutions)/1_HSOL_PROJECTS/PROJECTS/SDC/DKU_WRM_COURSE_CA/Course Materials/Handbook/Applied_Hydrological_Modeling_Bookdown/temp/'
# saveRDS(data_wide_tbl,file=paste(fPath,'data_wide_tbl',sep=""))
# saveRDS(data_long_tbl,file=paste(fPath,'data_long_tbl',sep=""))

A consistent data record from 1933 until and including November 2015 is now prepared^Please note that by using left_join above, we have cut off discharge data from the year 1932 since we do not have meteorological data there.^. Let us analyze these data now.


  1. Please note that if you do not have the required package installed locally, you should install it prior to its use with the following command install.packages('simputation')↩︎