In this section, we will explain the proposed ASISO method in detail. ASISO introduces two algorithms: K-Space and K-Match. To clearly illustrate the proposed method, we will first provide an overview of ASISO and then introduce the specific algorithms involved.
3.1. Overview
The ASISO is mainly based on the linear interpolation to increase the size and improve the quality of the original dataset. The idea is to divide the original feature space into several subspaces with an equal number of samples, and then perform linear interpolation for the samples in the adjacent subspaces. This method requires two hyperparameters (
k and
) in advance. Interpretation of parameter
k is the number of samples existing in each feature subspace, while
is the number of equidistant nodes interpolated per unit distance in the linear interpolation of the samples. This proposed method is illustrated in
Figure 1.
The data set
has been given and is assumed to be contaminated with unknown noise, where
and
. Assuming
where
is the actual value of
,
is the actual value of
and
is a continuous function, represents the relationship in reality between
and
y. Consider the model:
where
is the noise in
,
is the noise in
, and
represents the error term. The expression (1) can be rewritten as:
Let , we have for , and call the sample minimum point.
Given the hyperparameter
k, we provide an unsupervised clustering method called K-Space. As it is shown in
Figure 1(b), the space can be partitioned into
subspaces, each containing
k samples. i.e.,
,
and
. The datasets corresponding to different subspaces
,
. For two adjacent subspaces, since
is a continuous function, we assume that it can be approximated as a linear function
, and then (2) can be transformed into:
where
is the linear fitting error term. When the distance between two adjacent subspaces approaches zero and the measurements of subspaces tend to zero, obtain
. Next, we will perform sample interpolation between adjacent subspaces.
We need to calculate the centers of each cluster, as follow:
To make
, we need to ensure that the interpolation is performed between clusters that are close in distance as much as possible. Among
, we define
whose cluster center has the minimum distance to the sample minimum point
, and define
whose cluster center has the minimum distance to the center of
and
.
where
is the center of
. Perform interpolation in
sequentially according to the order of
d values, and interpolate only between adjacent subspaces (i.e., interpolate between
and
, between
and
, and so on).
When performing linear interpolation between adjacent subspaces, we should pair the k samples from the first subspace with an equal number of samples from the second subspace. The interpolation rules between adjacent subspaces are as follows:
1. Linear interpolation can only be performed between two samples belonging to different adjacent subspace sets.
2. Interpolation must be performed for each sample.
3. Participation of each sample point is restricted to a single interpolation instance.
The number of matching schemes is
. As it is shown in
Figure 1(c), we provide a matching method called K-Match. Suppose
, this method can select a good-performing matching scheme
from
.
Assuming
and
y are continuous variables. Given another hyperparameter
, the number of samples inserted using linear interpolation method between
and
is
. Taking
and
as example,
is the set of inserted samples, the linear interpolation formula is defined as:
After ASISO processing, the original dataset will be optimized. The main steps of the ASISO algorithm are summarized in Algorithm 1.
Algorithm 1: ASISO |
Input: Data set ; hyperparameters k and
Output: Optimized data set
|
The assumptions of ASISO are as follows:
1. is a continuous function.
2. the linear fitting error .
3. and y are continuous variables.
3.2. K-Space
The implementation of ASISO requires an unsupervised clustering method to partition the feature space into multiple subspaces, each containing k samples. Based on this, we propose the K-Space clustering method. The clustering method has the following performance:
1.Each subspace contains an equal number of samples, i.e., ;
2.Each sample belongs to only one subset, i.e., .
Maintaining continuity and similarity between adjacent subspaces is essential for synthesizing data via multiple linear interpolations in ASISO. Our objective is to minimize the linear fitting error , which helps to satisfy ASISO assumption 2 as much as possible.
To determine the sample set
for subspace
, it is necessary to determine the first sample
in
.
where
,
is the cluster center of
,
. We define
and determine
as follows:
where
. Obtain
and update
.
The main steps of the K-Space algorithm are summarized in Algorithm 2.
Algorithm 2: K-Space |
Input: Data set ; hyperparameter k
Output:
|
3.3. K-Match
We can calculate the total error of the matching scheme to measure the quality of the scheme, for the sake of simplicity, let
, as follow:
where
is the linear expression passing through the points
and
.
Theorem 1. Let and be two adjacent subspaces, the datasets corresponding to different subspaces are , and . Consider the model , let . For , suppose that , then .
Proof of Theorem 1. Since
, and according to (3), the model can be transformed into:
where
is a linear function. According to (11), it follows that:
When
, let
be the intersection point between
and
. We can simplify
using basic geometric area calculations, and according to Law of Iterated Expectations (LIE):
where
. Since
, and
, it follows that
.
□
If our approach is to randomly select a matching scheme, the validity of this method can be proved by Theorem 1. However, randomly selecting a matching scheme does not guarantee the uniqueness of the results, and it also does not guarantee that we will necessarily select the good-performing matching scheme. We found that for and , if , there will be a better interpolation effect.
Theorem 2. Let . Suppose that , then we have .
Proof of Theorem 2. Since
, based on the proof of theorem 1, we have:
□
According to Theorem 2, we can match samples with opposite signs of to achieve a good data synthesis effect. Therefore, the core idea of K-match is to make a judgment on the positive or negative sign of for each sample, and then interpolate the samples with opposite signs as much as possible.
In K-Match, we need to choose an appropriate linear regression method to fit the dataset
based on the performance of the noise. For example, Lasso regression, Locally Weighted Linear Regression (LWLR) [
23], and other methods can be used [
24,
25]. In our experiments, we use OLS or SVR method to fit and obtain
. Specially, the kernel function is Linear in SVR. According to (3), and suppose the linear fitting error
, for dataset
, we have:
Then, we sort the samples in dataset
in ascending order according to the value of
, and obtain
; we sort the samples in dataset
in descending order and obtain
. As it is shown in
Figure 1(d), combine the sorted datasets
and
into the matching scheme
.
Algorithm 3: K-Match |
Input: Subset
Output: Matching scheme
1 Fitting the dataset and obtain
2 Obtain using (10)
3 Sorting the samples in and according to the value of , obtain and
4 Combine and into
|
3.4. Supplements
The proposed method can effectively expand the size of dataset and adjust the dataset structure, reducing the proportion of samples that deviate significantly from the actual distribution, thereby improving the model generalization. Refer to
Figure 2.
The supplements about ASISO are given below:
1. The choice of the hyperparameter k is crucial as different datasets require different values of k. Conversely, hyperparameter tends to exhibit better performance as its value increases, which will be illustrated in the following experimental results.
2. It is necessary to normalize the data if there is a significant difference in the dimensional scale between the features of the data. It avoids the issue of generating an excessive number of samples.
3. In most cases,
is not an integer, and for the excess samples, we usually have two solutions of handling them. The first one is to use the LOF algorithm [
26] to filter out the excess samples that will not participate in the ASISO, as it is shown in
Figure 2(c). Another solution is to treat the excess samples as a dataset
of a subspace,
.When interpolating between
and other subspaces
, choose an appropriate linear regression method to fit the dataset
, and obtain
. Then, use the same method to sort
and
. Only
interpolations are performed, with each sample in
being interpolated, while for
, only
samples are interpolated. Moreover, interpolate the samples with opposite signs of
as much as possible, as it is shown in
Figure 3.