Submitted:
01 March 2024
Posted:
01 March 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
- 1)
- When the time series of the different units are written/stored one next to the other, one or more values from one unit A may be erroneously inserted in the space allowed to a contiguous unit B, and vice versa the corresponding values from B are inserted in the space of A. We call this situation inversion of values between units. This type of error is often not detectable by general error detection techniques. Moreover, even if the problem is detected, because for example a value vi is too high of too low for unit A, the generalist imputation techniques will probably try to reconstruct the correct value based on elaborations involving unit A, and ignoring that the correct values are already stored in the database but in the space of record B. Several problems may arise if this type of error is not fully recognized.
- 2)
- Data contain one or more large “jumps” in the values of the time series corresponding to one unit. For example, given a unit A, imagine that the values of one of its variables are 100, 120, 280, 130, 120, 150. The third value is far from the others, so we may suspect some problem. However, if we discover that such a variable has a high volatility, the situation can also be normal after all. We call this situation anomalous jump. In this case, we need to identify some threshold above which the values should be considered erroneous. This is a very delicate issue, and standard error detection techniques are often insufficient in this case.
- 3)
- A time series is composed of values produced by a data provider (for example an agent or an organization) at every given interval of time (for example, every year). In this case, it may happen that the data provider computes a value vt for a given time t, and later discovers that vt was incorrect, because some units should have been added to vt but they were not considered, so vt should actually be increased by δt, or because some units counted in vt are actually belonging to the next time interval, so vt should be decreased by -δt. In this case, if it is too late to modify vt, the data provider often tries to compensate the error by modifying the next value produced vt+1, providing vt+1 + δt in the first case, and vt+1 – δt in the second. We call this situation recalculation operated by the data provider. Clearly, this type of problem is hardly detectable by general error detection techniques, and again several problems may arise if this type of error is not fully recognized.
2. Materials and Methods
- Inversion of values between units;
- Anomalous jump;
- Recalculation operated by the data provider.
2.1. Inversion Problem
- H1.a.
- Denote by i the index of the generic unit (a row in the dataset), with i=1...m = U. Unit i has values of a variable (or attribute) v over several time instants t=1...n=S. Define now Δvi(t,t+1) as the difference (delta) between the two values assumed by unit i in two consecutive time instants t, t+1 for variable v, that is:
- H1.b.
- Compute a for each unit i the value DVi defined as the modulus of the product between the sums of the positive deltas and the sum of negative deltas:
- H1.c.
- Compute the DMi value for each unit i as the ratio between DVi and the arithmetic mean of all DVs in the entire dataset considered:
- H1.d.
- The numerical values of the above DMi may still vary greatly. To avoid numerical instability, we compress their scale by computing the cubic root, obtaining values called RQi representing the compressed normalized intrinsic variability of the unit.
- H1.e.
- Compute the value GMi as the geometric mean of all the deltas in module of unit i. This value represents an evaluation of the size of the unit. If some of the deltas are zero, then they can again be replaced with 1 to avoid all collapsing to zero when this is not acceptable.
- H1.f.
- Now, to compute a reasonably upper limit on the delta values that unit i could attain, we multiply the compressed normalized intrinsic variability by the measure of the size of the unit, obtaining the following threshold Ti:Ti = GMi RQi
- H1.g.
- Now, to finally recognize the situation of inversion of a value between two consecutive units A and B by computing H1, we need that four conditions are verified at the same time: unit A has two consecutive deltas larger (in modulus) than the threshold TA and with opposite signs (w.l.o.g, the first is positive and the second is negative), and unit B for the same time instants has again two consecutive deltas larger (in modulus) than the threshold TB but with signs reversed with respect to A (the first is negative and the second is positive). In practice, condition H1 is given by the following boolean expression:H1(A,B)t: {[(ΔvA(t-1,t) > 0 ∧ ΔvA(t,t+1) < 0) ∧ (ΔvB(t,t+1) < 0 ∧ ΔvB(t,t+1) > 0)] ∨
[(ΔvA(t-1,t) < 0 ∧ ΔvA(t,t+1) > 0) ∧ (ΔvB(t-1,t) > 0 ∧ ΔvB(t,t+1) < 0)]} ∧
(|ΔvA(t-1,t)| > TA ∧ |ΔvA(t,t+1)|> TA ∧ |ΔvB(t-1,t)|> TB ∧ | ΔvB(t,t+1)|> TB)
- H2.a.
- For each unit i, we define Iit as the distance of the value vit at time t from the mean value of v over time without the value at time t:Iit = vit – (∑k∈S/t vik)/n-1
- H2.b.
- We define now Nit as the distance of the value vit at time t from the mean value of v over time without the value at time t, but this time taking the values of the subsequent unit i+1 (the one with which the values could have been exchanged):Nit = vit – (∑k∈S/t vi+1k)/n-1
- H2.c.
- Finally, we define Fit as the minimum between the modulus of the two above values: In practice, we are comparing the distance between value vit and all the other values of unit i, and between vit and all the other values of unit i+1. If vit is closer to the values of unit i+1, that means the minimum is |Nit|, then inversion is probable.Fit = min (|Iit|, |Nit|)
2.2. Anomalous Jump Problem
- Calculate the value LGMi as the natural logarithm of the GMi value presented in Section 2.1. This logarithm of the size represents a compressed measure of the size of the unit.
- Compute VIi as the integer upper part of the value LGMi plus a constant c representing another element of customization of the procedure. This value can be determined either with a priori reasoning or even derived from the data itself.VIi = ⌈LGMi + c⌉
- Compute GMTi as the sum of GMi + Ti. In practice, we are summing size and threshold for unit i, obtaining a kind of deformation of the threshold by its size.
- Finally, identify the threshold with tolerance TTi as the largest between the two size-derived values described above. This is used as an upper bound on the reasonable jumps observed in the values of the unit.TTi = max(VIi, GMTi).
2.3. Recalculation Problem
2.4. Data
- i)
- the number of academic staff whose primary assignment is instruction, research or public service,
- ii)
- staff who hold an academic rank, like professor, assistant professor, lecturer or an equivalent title,
- iii)
- staff with other titles (like dean, head of department, etc.) if their principal activity is instruction or research, and
- iv)
- PhD students employed for teaching assistance or research.
3. Results
4. Discussion
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Quality framework and guidelines for OECD statistical activities; OECD Publishing, 2011.
- Daraio, C., Iazzolino, G., Laise, D., Coniglio, I.M., Di Leo, S. Meta-choices in ranking knowledge-based organizations. Management Decision 2022, 60, 995–1016. [CrossRef]
- Ballou, D.P., & Pazer, H.L. Modeling data and process quality in multi-input, multi-output information systems. Management science 1985, 31, 150–162. [CrossRef]
- Pipino, L.L., Lee, Y.W., Wang, R.Y. Data quality assessment. Communications of the ACM 2002, 45, 211–218. [CrossRef]
- Wang, R.Y., Strong, D.M. Beyond accuracy: What data quality means to data consumers. Journal of management information systems 1996, 12, 5–33. [CrossRef]
- Wang, R.Y., Ziad. Data quality; Springer Science & Business Media, 2006; Volume 23. [Google Scholar]
- Sadiq, S. (Ed.) Handbook of data quality: Research and practice. Springer Science & Business Media, 2013. [Google Scholar]
- Batini, C., Barone, D., Cabitza, F., Grega, S. A data quality methodology for heterogeneous data. International Journal of Database Management Systems 2011, 3, 60–79. [CrossRef]
- Batini, C., Scannapieco. Data and information quality; Springer International Publishing: Cham, Switzerland, 2016. [Google Scholar]
- Corrales, D.C., Corrales, J.C., Ledezma, A. How to address the data quality issues in regression models: A guided process for data cleaning. Symmetry 2018, 10, 99. [CrossRef]
- Corrales, D.C., Ledezma, A., & Corrales, J.C. From theory to practice: A data quality framework for classification tasks. Symmetry 2018, 10, 248. [CrossRef]
- Liu, C., Peng, G., Kong, Y., Li, S., Chen, S. Data Quality Affecting Big Data Analytics in Smart Factories: Research Themes, Issues and Methods. Symmetry 2021, 13, 1440. [CrossRef]
- Daraio, C., Di Leo, S., Scannapieco, M. Accounting for quality in data integration systems: a completeness-aware integration approach. Scientometrics 2022, 127, 1465–1490. [CrossRef]
- Bruni, R. Error Correction for Massive Data Sets. Optimization Methods and Software 2005, 20, 295–314. [Google Scholar] [CrossRef]
- Bruni, R., Daraio, C., Aureli, D. Imputation techniques for the Reconstruction of Missing Interconnected Data from higher Educational Institutions. Knowledge-Based Systems 2021, 212, 106512. [CrossRef]
- Alwin, D. The margins of error: A study of reliability in survey measurement. Wiley-Blackwell, 2007. [Google Scholar]
- Saris, W., Gallhofer. Design, evaluation, and analysis of questionnaires for survey research; Wiley-Interscience, 2007. [Google Scholar]
- Cernat, A., Oberski, D. Estimating Measurement Error in Longitudinal Data Using the Longitudinal MultiTrait Multi Error Approach. Structural Equation Modeling: A Multidisciplinary Journal 2023, 30, 592–603. [CrossRef]
- Oberski, D. L. , Kirchner, A., Eckman, S., Kreuter, F. Evaluating the quality of survey and administrative data with generalized multitrait-multimethod models. Journal of the American Statistical Association 2017, 112, 1477–1489. [Google Scholar] [CrossRef]
- Pavlopoulos, D., Pankowska. Modelling error dependence in categorical longitudinal data. In Measurement error in longitudinal data; Oxford University Press, 2021. [Google Scholar] [CrossRef]
- Batini, C., Cappiello, C., Francalanci, C., Maurino, A. Methodologies for data quality assessment and improvement. ACM computing surveys (CSUR) 2009, 41, 1–52. [CrossRef]
- Bruni, R., Daraio, C.; Di Leo, S. A detection tool for longitudinal data specific errors applied to the case of European universities. Mendeley Data 2024, V1. [CrossRef]
- ETER Project Website. Available online: https://www.eter-project.com/#/home (accessed on 23 February 2024).
- Bonaccorsi, A.; Daraio, C. (Eds.) Universities and strategic knowledge creation: Specialization and performance in Europe; Edward Elgar Publishing, 2007. [Google Scholar]
- Daraio, C., Bonaccorsi, A., Geuna, A., Lepori, B., Bach, L., Bogetoft, P., ... Eeckaut, P. V. The European university landscape: A micro characterization based on evidence from the Aquameth project. Research Policy 2011, 40, 148–164. [CrossRef]
- Lepori, B., et al.; et al. Establishing a European tertiary education register; Publications Office of the European Union, 2016; ISBN 978-92-79-52368-7. [Google Scholar] [CrossRef]
- Daraio, C. , Bruni, R., Catalano, G., Daraio, A., Matteucci, G., Scannapieco, M., Wagner-Schuster, D. Lepori, B. A Tailor-made Data Quality Approach for Higher Educational Data. Journal of Data and Information Science 2020, 5, 129–160. [Google Scholar] [CrossRef]
| v1 | v2 | v3 | v4 | v5 | Δ (1,2) | Δ (2,3) | Δ (3,4) | Δ (4,5) | |
|---|---|---|---|---|---|---|---|---|---|
| Unit 1 | 125 | 18 | 130 | 120 | 130 | 107 | -112 | 10 | -10 |
| Unit 2 | 21 | 150 | 30 | 25 | 20 | -129 | 120 | 5 | 5 |
| v1 | v2 | v3 | v4 | v5 | Δ (1,2) | Δ (2,3) | Δ (3,4) | Δ (4,5) | |
|---|---|---|---|---|---|---|---|---|---|
| Unit 3 | 200 | 220 | 500 | 210 | 230 | -20 | -280 | 290 | -20 |
| v1 | v2 | v3 | v4 | v5 | Δ (1,2) | Δ (2,3) | Δ (3,4) | Δ (4,5) | |
|---|---|---|---|---|---|---|---|---|---|
| Unit 4 | 163 | 167 | 80 | 235 | 160 | -4 | 87 | -155 | 75 |
| HEIs available in ETER | |
|---|---|
| Italy | 219 |
| Germany | 424 |
| Spain | 84 |
| France | 417 |
| Poland | 314 |
| Portugal | 129 |
| 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | Total | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Italy | 115 | 115 | 115 | 114 | 114 | 114 | 114 | 114 | 114 | 114 | 1143 |
| Germany | 365 | 378 | 383 | 385 | 385 | 383 | 400 | 400 | 396 | 399 | 3874 |
| Spain | 77 | 80 | 80 | 81 | 81 | 80 | 82 | 83 | 83 | 84 | 811 |
| France | 131 | 132 | 130 | 129 | 126 | 0 | 123 | 123 | 119 | 111 | 1124 |
| Poland | 0 | 0 | 0 | 0 | 0 | 0 | 247 | 243 | 241 | 237 | 968 |
| Portugal | 113 | 106 | 94 | 91 | 90 | 95 | 90 | 90 | 89 | 92 | 950 |
| # of H1 flags | # of H2 flags | #of inversions flags | # of jumps flags | # of recalculation flags | |
|---|---|---|---|---|---|
| Italy | 159 (0.14) | 287 (0.25) | 40 (0.03) | 396 (0.35) | 58 (0.05) |
| Germany | 314 (0.08) | 398 (0.10) | 34 (0.01) | 1059 (0.27) | 32 (0.01) |
| Spain | 24 (0.03) | 81 (0.10) | 4 (0.005) | 249 (0.31) | 21 (0.03) |
| France | 18 (0.02) | 20 (0.02) | 1 (0.00) | 160 (0.14) | 5 (0.004) |
| Poland | 79 (0.08) | 71 (0.07) | 12 (0.01) | 9 (0.01) | 18 (0.02) |
| Portugal | 50 (0.05) | 131 (0.14) | 7 (0.01) | 236 (0.25) | 32 (0.03) |
| 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | Total | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Italy | 5 | 1 | 0 | 0 | 2 | 3 | 4 | 4 | 5 | 16 | 40 |
| Germany | 9 | 1 | 1 | 0 | 2 | 2 | 1 | 3 | 2 | 13 | 34 |
| Spain | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 |
| France | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| Poland | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 9 | 2 | 1 | 12 |
| Portugal | 2 | 0 | 0 | 0 | 2 | 0 | 0 | 2 | 0 | 1 | 7 |
| Δ2011- 2012 |
Δ2012- 2013 |
Δ2013- 2014 |
Δ2014- 2015 |
Δ2015- 2016 |
Δ2016- 2017 |
Δ2017- 2018 |
Δ2018- 2019 |
Δ2019- 2020 |
Total | |
|---|---|---|---|---|---|---|---|---|---|---|
| Italy | 42 | 52 | 53 | 42 | 39 | 40 | 38 | 42 | 48 | 396 |
| Germany | 132 | 144 | 106 | 111 | 115 | 110 | 112 | 101 | 128 | 1059 |
| Spain | 26 | 21 | 15 | 58 | 31 | 30 | 19 | 26 | 23 | 249 |
| France | 102 | 5 | 6 | 16 | 0 | 0 | 12 | 10 | 9 | 160 |
| Poland | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 9 |
| Portugal | 31 | 28 | 25 | 30 | 34 | 17 | 20 | 32 | 19 | 236 |
| 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | Total | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Italy | N.A. | 5 | 9 | 13 | 6 | 5 | 5 | 9 | 6 | N.A. | 58 |
| Germany | N.A. | 0 | 2 | 0 | 6 | 8 | 3 | 6 | 7 | N.A. | 32 |
| Spain | N.A. | 3 | 0 | 3 | 4 | 4 | 2 | 2 | 3 | N.A. | 21 |
| France | N.A. | 2 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | N.A. | 5 |
| Poland | N.A. | 0 | 0 | 0 | 0 | 0 | 0 | 18 | 0 | N.A. | 18 |
| Portugal | N.A. | 0 | 2 | 0 | 6 | 8 | 3 | 6 | 7 | N.A. | 32 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
