1. Introduction
Recently proposed two-sample test allows comparison of the means of multivariate samples with unknown distributions [
5]. It utilizes the distances between the elements of the samples and the centroid of both samples and the distances between the elements of the samples and their centroids. These distances are considered as random variables, and the test compares distributions of these variables using two-sample Kolmogorov-Smirnov test.
In this note, we propose one-sample version of this test and illustrate its application to the simulated data and the real-world datasets - the women’s nutrition data [
2] and the perspiration data [
4].
In the discourse below we use the same notation and terms as in the two-sample test in the paper [
5].
2. Problem Formulation
Let
be
-dimensional sample such that each observation
is a random vector
,
. The sample
is represented by random matrix
The problem is formulated as follows: given the multivariate sample it is required to check, whether the mean of the population from which this sample was drawn is equal to certain value, or not.
For univariate sample drawn from normally distributed population, the problem is solved by one-sample
-test, and for multivariate sample with normal distribution it is solved by the one-sample Hotelling
-test [
3]. For the data with non-normal but close to normal distributions the extended one-sample Hotelling
-test was suggested [
1]. However, for arbitrary distributed data the problem was not solved yet.
Denote the vector of the means by
,
and expectation of the sample
by
. Then, according to the formulated problem, the null and alternative hypotheses are
Below, we also assume that the distribution of sample is continuous and that the expectation is finite.
3. Suggested Solution
Similar to the two-sample test [
5] the proposed test reduces the multivariate data to the univariate arrays, and then considers these arrays as realizations of certain random variables.
Given the
-dimensional random sample
drawn from the population
with distribution
and finite expectation
, the vector
of the distances between the observations
and its expectation
, and the vector
of the distances between the observations
and elements of the means vector
, are created.
From the equivalence of the expectation and it follows that the vectors and are equivalent and vice versa. Hence, to check the hypothesis it is enough to check whether the vectors and are statistically equivalent.
The algorithm of the test is outlined as follows.
Algorithm: one-sample test of difference between means of multivariate dataset and given means vector. |
Input: -dimensional sample that is random matrix and means vector that is -dimensional array. Output: conclusion about difference between the expectation and . |
(-dimentional vector) of the sample .
of the sample and its mean and combine them into vector .
of the sample and means vector and combine them into vector .
and of the vectors and .
is accepted, then
,
else
,
end if.
Return accepted hypothesis.
|
Similar to the two-sample test [
5], in the calculations we use Euclidian distances, and for comparison of the vectors
and b we apply two-sample Kolmogorov-Smirnov test.
4. Verification of the Method
The suggested method was verified using real-world and simulated data. The algorithm was implemented in MATLAB®. Significance level in the two-sample Kolmogorov-Smirnov tests is .
4.1. Trials on the Simulated Data
In trials on the simulated data, we compared the activity of the one-sample Hotelling
-test [
3] with the activity of the proposed test. The implementation of the Hotelling
-test was downloaded from the MATLAB Central File Exchange [
6].
For verification, we compared the samples drawn from normally distributed population with several values of standard deviation
with several means’ vectors
. Results of the trials are summarized in
Table 1.
The obtained results mally distributed samples the suggested test recognizes the differences between the samples as correct as the Hotelling -test, but, as it was expected, it is less sensitive.
Thus, if it is known that the samples were drawn from the populations with normal distributions, then the Hotelling -test is preferable, but if the distributions of the populations are not normal or unknown, then the suggested test can be used.
4.2. Trials on the Real-World Data
In addition, the algorithm was applied on two known datasets. The first is the women’s nutrition dataset [
2], which contains
records. Five nutritional components were measured: calcium, iron, protein, vitamin A, and vitamin C (
).
Question of interest was whether women meet the federal nutritional intake guidelines. To answer this question, we test the null hypothesis
The suggested test correctly rejected the null hypothesis
with significance level
and
-value close to zero. The same result was reported for the one-sample Hoteling
test [
2].
The second dataset contains perspiration data from healthy females and includes records with parameters: sweat rate, sodium content, and potassium content [4, pp. 214-215].
Null hypothesis
was checked.
As a result, with significance level the one-sample Hoteling test accepted null hypotheses with -value 0.065, and the suggested test accepted the null hypothesis with -value close to .
5. Conclusion
The proposed test compares the mean of multivariate sample with unknown distributions and given means vector. It correctly identifies statistical equivalence or difference between them.
Since the test utilizes the Kolmogorov-Smirnov statistic, it is not limited by the type of data distribution and is applicable to any reasonable data.
The method was verified on simulated and real-world data and resulted in correct decisions.
Funding
This research has not received any grant or funding.
Data Availability Statement
The data have been obtained from open access repositories; the links appear in the references.
Competing interests
The authors declare no competing interests.
References
- Bulut H. A robust Hotelling test statistic for one sample case in high dimensional data. Communications in Statistics - Theory and Methods, 2021, 52(13), 4590–4604.
- Inferences Regarding Multivariate Population Mean. In the course notes Applied Multivariate Statistical Analysis. Eberly College of Science, Pennsylvania State University. The women’s nutrition dataset was downloaded from the page https://online.stat.psu.edu/stat505/lesson/7/7.1/7.1.4 (accessed 7 September 2024).
- Hotelling H. The generalization of Student’s ratio. Annals of Mathematical Statistics, 1931, 2(3), 360-378.
- Johnson R.A., Wichern D.W. Applied Multivariate Statistical Analysis. 6th ed. Pearson Education: Upper Saddle River, NJ, 2007.
- Novoselsky A., Kagan E. A distance based two-sample test of means difference for multivariate datasets. Statistical Papers, 2024, 1-14.
- Trujillo-Ortiz A. HotellingT2, 2024. MATLAB Central File Exchange, https://www.mathworks.com/matlabcentral/fileexchange/2844-hotellingt2 (accessed 7 September 2024).
Table 1.
Results of the one-sample Hotelling -test and the suggested test for bivariate normally distributed sample with different standard deviations (, , ).
Table 1.
Results of the one-sample Hotelling -test and the suggested test for bivariate normally distributed sample with different standard deviations (, , ).
|
|
Hotelling test |
Suggested test |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).