Missing data
To explain how the package handles missing data given different options, it is easiest to work by example.
Let's say we have the following time series and NA represents a missing value:
\(t\) | \(a_t\) |
---|---|
1.0 | 11 |
2.5 | 12 |
3.0 | NA |
4.5 | 14 |
5.0 | 15 |
6.0 | 16 |
Let's also fix \(E = 2\), \(\tau = 1\) and \(p = 1\) for these examples.
Here we have one obviously missing value for \(a\) at time 3. However, there are some hidden missing values also.
By default, the package will assume that your data was measured at a regular time interval and will insert missing values as necessary to create a regular grid.
For example, the above time series will be treated as if it were sampled every half time unit. So, when creating the \(E=2\) manifold it will
\(t\) | \(a_t\) |
---|---|
1.0 | 11 |
1.5 | NA |
2.0 | NA |
2.5 | 12 |
3.0 | NA |
3.5 | NA |
4.0 | NA |
4.5 | 14 |
5.0 | 15 |
5.5 | NA |
6.0 | 16 |
The manifold of \(a\) and it's projections \(b\) will have missing values in them:
We can see that the original missing value, combined with some slightly irregular sampling, created a reconstructed manifold that is mostly missing values!
By default, the points which contain missing values will not be added to the library or prediction sets.
For example, if we let the library and prediction sets be as big as possible then we will have:
Here we see that the library set is totally empty! This is because for a point to be in the library (with default options) it must be fully observed and the corresponding \(b\) projection must also be observed. Similarly, the prediction set is almost empty because (with default options) it must be fully observed.
The allowmissing
flag
If we set the allowmissing
option, then a point is included in the manifold even with some missing values.
The only caveats to this rule are:
- points which are totally missing will always be discarded,
- we can't have missing targets for points in the library set.
The largest possible library and prediction sets with allowmissing
in this example would be:
This discussion is implicitly assuming the algorithm
is set to the simplex algorithm.
When the S-map algorithm is chosen, then we cannot let missing values into the library set \(\mathscr{L}\).
This may change in a future implementation of the S-map algorithm.
The dt
flag
When we add dt
, we tell the package to remove missing observations and to also add the time between the observations into the manifold.
So, in this example, instead of the observed time series being:
\(t\) | \(a_t\) |
---|---|
1.0 | 11 |
2.5 | 12 |
3.0 | NA |
4.5 | 14 |
5.0 | 15 |
6.0 | 16 |
the dt
basically acts as if the supplied data were:
\(t\) | \(a_t\) | \(\mathrm{d}t\) |
---|---|---|
1.0 | 11 | 1.5 |
2.5 | 12 | 2.0 |
4.5 | 14 | 0.5 |
5.0 | 15 | 1.0 |
6.0 | 16 | NA |
The resulting manifold and projections are:
The largest possible library and prediction sets with dt
in this example would be:
Both allowmissing
and dt
flags
If we set both flags, we tell the package to allow missing observations and to also add the time between the observations into the manifold.
So our original time series
\(t\) | \(a_t\) |
---|---|
1.0 | 11 |
2.5 | 12 |
3.0 | NA |
4.5 | 14 |
5.0 | 15 |
6.0 | 16 |
will generate the manifold
and the largest possible library and prediction sets would be