2.1. Inversion Problem
To identify inversion problem between two units A and B, we evaluate two types of conditions that we call here H1 and H2. The first type (H1) consists in assessing, for each possible couple of units A and B, whether there are possible systematic exchanges between the values of A and B over one or more time instants through the evaluation of the differences (called Δ) between each pair of temporally consecutive values of the same variable. In more detail, the generic condition H1 is evaluated by executing the following steps.
- H1.a.
Denote by i the index of the generic unit (a row in the dataset), with i=1...m = U. Unit i has values of a variable (or attribute) v over several time instants t=1...n=S. Define now Δvi(t,t+1) as the difference (delta) between the two values assumed by unit i in two consecutive time instants t, t+1 for variable v, that is:
Those deltas are computed for each period of the dataset and for each unit (and for each variable if there is more than one variable in the dataset). Obviously, for the last period n the Δvi(n,n+1) is not computable. The generic value Δvi(t,t+1) can take on a negative or a positive value. We define as P the set of the indices t for which Δvi(t,t+1) is positive, and as N the set of the same indices for which Δvi(t,t+1) is negative.
- H1.b.
Compute a for each unit i the value DVi defined as the modulus of the product between the sums of the positive deltas and the sum of negative deltas:
This is somehow a measure of the intrinsic variability of the unit i. Indeed, in practical cases, this measures the fact that some units will be “changing” their values more than others. In case any of the ∑t∈P Δvi(t,t+1) or ∑ t∈N Δvi(t,t+1) is equal to zero, its value is changed to 1 to avoid all collapsing to zero when the intrinsic variability of a unit must be nonnegative. Note that this is one of the customizable aspects, depending on the practical case under study.
- H1.c.
Compute the DMi value for each unit i as the ratio between DVi and the arithmetic mean of all DVs in the entire dataset considered:
This value represents a normalization of the above measure of intrinsic variability. The normalization should be conducted over some homogeneous set of units to which unit
i belongs. Thus, depending on the context, such homogeneous set must be identified. For example, in the case presented in
Section 3, there is strong heterogeneity in data from different national contexts (i.e., different countries). For this reason, the
DVi is averaged by the mean of
DVi over the country to which the unit belongs.
- H1.d.
The numerical values of the above
DMi may still vary greatly. To avoid numerical instability, we compress their scale by computing the cubic root, obtaining values called
RQi representing the compressed normalized intrinsic variability of the unit.
- H1.e.
Compute the value GMi as the geometric mean of all the deltas in module of unit i. This value represents an evaluation of the size of the unit. If some of the deltas are zero, then they can again be replaced with 1 to avoid all collapsing to zero when this is not acceptable.
- H1.f.
Now, to compute a reasonably upper limit on the delta values that unit
i could attain, we multiply the compressed normalized intrinsic variability by the measure of the size of the unit, obtaining the following threshold
Ti:
- H1.g.
Now, to finally recognize the situation of inversion of a value between two consecutive units
A and
B by computing
H1, we need that four conditions are verified at the same time: unit
A has two consecutive deltas larger (in modulus) than the threshold
TA and with opposite signs (w.l.o.g, the first is positive and the second is negative), and unit
B for the same time instants has again two consecutive deltas larger (in modulus) than the threshold
TB but with signs reversed with respect to
A (the first is negative and the second is positive). In practice, condition
H1 is given by the following boolean expression:
If H1(A,B)t is true, then to have a probable swap problem we also need a corresponding condition H2(A,B)t to be true. The generic condition H2 is evaluated by the following steps.
- H2.a.
For each unit
i, we define
Iit as the distance of the value
vit at time
t from the mean value of
v over time without the value at time
t:
- H2.b.
We define now
Nit as the distance of the value
vit at time
t from the mean value of
v over time without the value at time
t, but this time taking the values of the subsequent unit
i+1 (the one with which the values could have been exchanged):
- H2.c.
Finally, we define
Fit as the minimum between the modulus of the two above values: In practice, we are comparing the distance between value
vit and all the other values of unit
i, and between
vit and all the other values of unit
i+1. If
vit is closer to the values of unit
i+1, that means the minimum is |
Nit|, then inversion is probable.
Hence, condition
H2 for units
A and
B is evaluated as follows:
Conditions H1 and H2 are computed and checked for every couple of units A and B and every time instant t. If H1(A,B)t is true and H2(A,B)t is also true, a possible swapping error flag is raised for units A and B at time instant t, otherwise no flag is raised. Note that this error may even affect more than one time instant of the same two units.
Example 1. We provide an example of the check for the inversion problem for two units (called unit 1 and unit 2) on a variable v of a longitudinal dataset with t=5. The data of the units are shown in
Table 1. We first compute the deltas for each unit, see
Table 1. For instance, unit 1 has v
1 2 = 18 and v
1 3 = 130, hence Δv
1 (2,3) = 18 - 130 = -112. After this, DV is equal to: |(-112-10)(107+10)| = 14274 for unit 1 and |(-129)(120+5+5)| = 16770 for unit 2. Subsequently, the value of the geometric mean GM is 33.09 for unit 1 and 24.94 for unit 2; DM is 0.92 for unit 1 and 1.08 for unit 2, and RQ is 0.97 for unit 1 and 1.03 for unit 2. Consequently, the thresholds T is 32.17 for unit 1 and 25.59 for unit 2.
Now we find H1(1,2)t. Considering that for unit 1 Δv1 (1,2) > 0 and Δv1 (2,3) < 0, and for unit 2, Δv2 (1,2) < 0 and Δv2 (2,3) > 0, the first part of the H1 condition is verified. Additionally, all those Δv exceed the respective thresholds T. Therefore, H1(1,2)2 is true.
To evaluate H2(1,2)2, we compute I12 and N12 for unit 1 and time 2.
We have value I12: = 18 - (125+18+130+120+130-18)/4 = -108.25.
Value N12 = 18 - (21+150+30+25+20-150)/4 = -6.
Since -6 has the smallest modulus value, F12 = 6, thus F12 ≠ I12 and H2(1,2)2 is true. As both conditions are true, a probable inversion error flag is reported for the period t=2.
2.2. Anomalous Jump Problem
To identify anomalous jumps, we now compute for each unit
i a ‘threshold with tolerance’
TTi larger that before, obtained as follows. After the computation of the threshold
Ti described in
Section 2.1, we execute the following steps.
Now, an anomalous jump flag is raised for a unit i in a time t, t+1 for variable v if the module of Δvi(t,t+1) is greater than the threshold TTi.
Example 2. We provide an example of anomalous jump problem. Consider a unit (called unit 3) with variable v of a longitudinal dataset with t=5. The data and the deltas of the unit are shown in
Table 2. We compute the threshold T = 101.66, as already seen in the previous example. Then, we find LGM = 4.32, VI = 13 and GMT =177.15. By considering c = 8 and the mean of deltas = 10, the resulting threshold with tolerance TT value is 177.15.
As | Δv3 (2,3) |= 280 > 177.15 and |Δv3 (3,4)|= 290 > 177.15, we report an anomalous jump flag for the period t=2,3 and for the period t=3,4. The data manager will have to check the values of t=2, t=3 and t=4 to understand the reasons for this anomalous jump.
2.4. Data
The European Tertiary Education Register (ETER) [
23] is a key initiative for understanding the higher education landscape in Europe developed after the successful AQUAMETH project [
24,
25]. This database provides a reference list of Higher Education Institutions (HEIs) and institutional data on their activities and achievements, including students, graduates, staff and finances. It thus complements national and regional education statistics provided by EUROSTAT [
26].
As of March 2024, ETER includes 41 European countries and provides data from 2011 to 2020, with a total of over 3,500 HEIs. ETER collects a wide range of data on HEIs, including: institutional characteristics (type, size, specialization), student information (enrolment, graduates, mobility), staff (lecturers, researchers, administrative staff), finances (income, expenditure, investment) and research and development activities. ETER complies extensively with statistical regulations and manuals, in particular the UOE Manual on Data Collection on Formal Education and the OECD Frascati Manual on Research and Experimental Development Statistics. This ensures the comparability of data with other international sources.
Collaboration with a network of experts and data providers in all participating countries ensures that information is collected from reliable and consistent sources. Established methodologies are used to define variables and indicators, enabling the re-use of collected data for statistical purposes and comparability with other sources. Data undergo rigorous quality control and validation to identify and correct errors or inconsistencies, as described in [
27]. However, as described in
Section 3, the proposed techniques were able to locate several cases of the specific longitudinal data problems described above.
ETER provides comprehensive documentation of the methodologies used and the data collection processes, ensuring transparency and replicability. ETER contributes to a better understanding of the higher education landscape and is a valuable resource for researchers, policy makers and stakeholders in European higher education. Within ETER, we selected the case of the variable Total academic personnel in headcount (HC) because it is widely used in empirical analysis and by policy makers as a proxy for the size of the universities. Therefore, it is one of the most important variables, and it is of paramount importance to detect any possible errors on that. Total academic personnel in HC, according to the ETER manual, includes
- i)
the number of academic staff whose primary assignment is instruction, research or public service,
- ii)
staff who hold an academic rank, like professor, assistant professor, lecturer or an equivalent title,
- iii)
staff with other titles (like dean, head of department, etc.) if their principal activity is instruction or research, and
- iv)
PhD students employed for teaching assistance or research.
We report our experiments on the largest EU countries present in ETER, i.e., Germany, France, Italy, Spain, Poland and Portugal, for a total of 1587 HEIs, in the time period from 2011 to 2020.
Table 4 shows the subdivision by country.
Table 5 reports the number of HEIs having complete data for each year.