I. Introduction
Table-to-Text is a subset of Natural Language Generation (NLG)—which includes extracting structured data from a table, chart, or other comparable format and generating cohesive and precise natural language descriptions from it. This study is a component of the larger Data-to-Text production process—with input data coming from a variety of structured forms such as JavaScript Object Notation (JSON), database tables, and knowledge graphs. To make the data understandable to a Language Model (LM)—data is transformed into a text or linearized format that preserves the original structure and content. This linearization technique is normally carried out by people. Therefore, the basic goal of Table-to-Text is to provide fluent and precise explanations of the data in the table once it is in the form that the LM can process. Despite tremendous advances in Natural Language Processing (NLP) in the past few decades—the majority of research, models and datasets have concentrated on the English language—left many other languages with huge speaker populations—for example Igbo in Nigeria, underserved. Such Low-Resource Language (LRLs) risk missing out on the benefits of fast advancing NLP technologies—if study does not extend to include them[
1,
2,
3,
4,
5,
6]. Thus, there is an urgent need to create strategies that improve NLP’s efficiency across a broader range of languages. Table-to-Text’s complexity make it a reliable approach for measuring LM reasoning powers—as it frequently includes combining data from several tables cells and executing basic arithmetic. Current basic multilingual models can produce fluent representations of data in low-resource African languages—but such descriptions frequently lack accuracy to the underlying table data, introducing mistakes that must be corrected. Intermediate planning is a promising way to improve the accuracy of generated text—in which models generate some type of planning text prior to the final output. This planning text serves as a content blueprint and has been demonstrated to increase faithfulness in tasks such as summary. Techniques such as using Question-Answer (QA) pairs and entity chains as intermediary plans have proven effective in various NLG tasks.
Therefore, the study’s goal is to use intermediate planning approaches, particularly QA pairs, to address the issue of multilingual Table-to-Text creation and evaluate they’re on the comprehension (fluency) and attribution (faithfulness) of the resulting descriptions. The primary study issue is if QA blueprints may enhance the attribution of multilingual Table-to-Text creation. The use of QA blueprints, a previously useful strategy for increasing the accuracy of summaries—in Table-to-Text production is innovative—especially in a multilingual situation involving African languages. The findings show that QA blueprints improve the attributability of findings for models trained and evaluated only on English data. However, their usefulness decreases in a bilingual setting. The difficulties originate from errors in machine-translating the English-generated blueprints into the target languages, which jeopardized the quality of the training dataset and generate a significant disadvantage before training even begins. Additionally, models have difficulty developing descriptions that rely significantly on these blueprints.
The study is as follows; the background will be seen in the following section. The related works are listed in Section III. The pre-implementation is covered in Section IV. The post implementation is provided in Section V. The experimental analysis is carried out in Section VI. The result analysis is covered in Section VII, and in Section VIII, we wrap up the investigation with some conclusions and ideas for future research.
II. Background
In 2014—the notion of sequence-to-sequence (seq2seq) learning was established—which includes mapping an input sequence to an output sequence using an encoder-decoder neural network. This design works well for jobs like automated translation and summarization. The seq2seq framework’s flexibility enables it to encode arbitrary sequences—particularly structured representations such as tables or images—making it a widely used solution for NLG. Historically, encoders and decoders were mainly Recurrent Neural Networks (RNNs) or Long Short-Term Memory (LSTMs) networks—however—the transformer architecture has recently gained popularity due to its efficiency and performance. The proposal of attention processes—notably self-attention—result in the creation of the transformer architecture. The transformer is a simpler model that focuses solely on attention, avoiding the requirement for repetition and convolutions. It is more parallelizable as well as quicker to train—resulting in cutting-edge performance on machine translation jobs. In a transformer-based encoder-decoder, the encoder converts an input sequence to a series of hidden state—while the decoder models a conditional distribution of the target sequence using these hidden states and all prior target vectors. This auto-regressive generation makes sure that the decoder develops the next token based on both the encoder output and all prior decoder outputs—so keeping context and coherence in the output. T5, or Text-to-Text Transfer Transformer—is a sequence-to-sequence transformer model that employs an encoder- decoder design. T5 can handle tasks including summarization, categorization, and text-to-text regression. It uses transfer learning—in which the model is pretrained on a huge dataset—such as the Colossal Clean Crawled Corpus (C4)—and then fine-tuned for specific downstream applications. During initial training, phrases or spans in the input are disguised, and the algorithm is trained to anticipate them—allowing it to accurately capture language structure. T5 earned cutting-edge performance on various benchmarks upon its introduction in 2019. Transfer learning can be useful in multilingual NLP. mT5, or “Massively Multilingual pre-trained Text-to-Text Transformer”—is a bilingual variant of T5 that can handle 101 languages with a wider vocabulary and more powerful training procedures. mT5 is trained before use on the mC4 dataset—a multilingual variant of C4, and improves performance by using Gated Exponential Linear Units (GeGLU) rather than conventional Gated Linear Units (GLU). The model comes in a variety of sizes, and training requires significant memory and computing resources.
Despite developments in neural generation models, they continue to encounter issues that include hallucination, repetition, and trouble maintaining consistency with input data. To increase the authenticity of generated text—techniques such as content selection and planning have been used—in which the model develops an intermediate plan before producing the final output. Entity chaining, for example, entails establishing an ordered set of entities from the desired summary—which is generated by the model before the summary. Another approach employs QA pairs as intermediate plans, providing greater control and justification of the model’s output. Content selection as well as planning was additionally demonstrated to improve Table-to-Text outputs. Prior to verbalization, a “plan” regarding what to convey and in which sequence is developed. This technique—which employs an LSTM-based encoder-decoder architecture with consideration, produced cutting-edge BiLingual Evaluation Understudy (BLEU) scores on datasets such as RotoWire and ToTTo. Modern text-to-text pretraining-finetuning frameworks, like mT5, have since outperformed this method. Evaluating whether a piece is intelligible and accountable remains difficult. While human evaluators make the best decisions, the process is costly, time-consuming, and subtle, impeding quick model development and study. Automated evaluation criteria like as BLEU and ROUGE—that evaluate similarity between predictions and references based on
N-gram overlap have been widely used to assess Table-to-Text production. However, these indicators have low correlation with human judgements and are insufficient for judging adherence to the source table. To solve this, PARENT (Precision and Recall of Entailed N-grams from the Table) was introduced in 2019—which combines the reference and table while evaluating precision to reward correct input in the output. FactKB
1 is another metric that assesses the factual accuracy of summaries via comparing them to source texts rather than references. Despite these developments, automated measures are still inadequate for reliably judging performance on difficult multilingual Table-to-Text tasks. The continual development of new metrics intends to enhance the evaluation of NLG systems; however there are considerable hurdles to assuring their effectiveness and dependability.
III. State of the Art
Pretraining on large volumes of data is required for modern neural models like Generative Pre-Trained Transformers (GPT)-3. GPT-3, for example, was trained using 570GB of material extracted from 45TB of compressed basic text obtained from online crawls, Wikipedia, and books according to [
7]. This enormous training data includes commonly spoken languages such as English, French, and German—each of which has a wealth of written information available online and in books. Languages with little internet presence or written material—known as Low-Resource Languages (LRLs)—are underrepresented such as [
8,
9,
10,
11]. As a result, models trained largely on high-resource languages struggle with downstream tasks in LRLs because of a lack of experience. To alleviate data scarcity in LRLs, [
9] have considered expanding and augmenting datasets using computational approaches or manually creating new datasets. This method is critical for jobs like multilingual Table-to-Text production—which is difficult due to the prevalence of English in existing datasets such as ToTTo. [
12]—address this issue—the dataset TATA, or "A Multilingual Table-to-Text Dataset for African Languages", was established. TATA contains Table-to-Text samples in nine languages, including four African languages (Hausa, Igbo, Swahili, and Yoruba) and Russian as a zero-shot test language—which determines how well the model generalizes to a language not observed during training. Despite its novel technique, TATA is very tiny, having just 8,700 samples as opposed to ToTTo's 120,761. ToTTo differs from TATA in that it uses highlighted cells in tables to manage the information in the verbalization, but TATA does not—making the task less confined and more demanding. This results in a wider range of correct verbalizations for any particular table—potentially making it more difficult to attain high TATA scores. When fine-tuned on TATA, the mT5 model demonstrated poor understandability and attributability, with only 44% of its outputs meeting these criteria, 9% lower than the reference rate. Furthermore, mT5 had low BLEURT
2, Recall-Oriented Understudy for Gisting Evaluation (ROUGE), and CHRF ratings on TATA—which were poorly associated with human judgments according to [
12]. [
13] inspired the development of the Statistical Assessment of Table-to-Text in African Languages (STATA) metric—which was fine-tuned using human reviews of model outputs and references. STATA demonstrated a significantly greater association with human annotations, underscoring the limitations of typical automatic metrics for evaluating model performance on difficult datasets such as TATA.In the baseline model—mT5 takes linearized tables as input and learns to produce target verbalizations. To improve model performance, intermediate planning strategies for content selection and planning—which were previously used for summarizing tasks, could be utilized. Therefore, the goal of this study is to implement these strategies and evaluate them using STATA and other automatic measures. This technique seeks to improve the consistency and quality of outputs in multilingual Table-to-Text generation jobs—addressing the unique constraints given by LRLs and ensuring improved model performance and evaluation accuracy [
14,
15,
16,
17,
18,
19,
20,
21,
22].
IV. Pre-Implementation
This study aims to improve the effectiveness of models trained on the TATA dataset by addressing data quality issues and optimizing the handling of samples with no references—making them more suitable for multilingual Table-to-Text generation jobs. The strategy highlights the need of clean, prepared datasets in building strong neural models. TATA contains samples without citations—which [
12] dealt with in two ways—skipping them or making inputs to distinguish between those with and without references. They discovered that omitting refences produced superior results—hence this method is used in this study. TATA’s columns include example id, title, and unit of measure, chart type, translation status, table data, linearized input, and table text. The table text column contains a list of all references, and to improve training data—samples are repeated for each reference. Following cleaning and reference extension—the training set has 7,060 rows, the validation set 754, and the test set 763. In the set used for validation—the first reference is utilized as the target—however—during testing, metrics are produced between the prediction and each reference, with the greatest score taken. A supplementary dataset containing just English instances from the main set is constructed to fine-tune English-only models. This dataset’s partitions consist of 902, 100 and 100 rows for training, validation, and testing, respectively. During data preprocessing—many rows were discovered to have problems in the table text field. Some rows contained acceptable references separated by multiple commas, a “START OF TEXT” tag intended for human annotators, and unrelated references. The remaining rows including solely repeating commas. For instance, the reference “Four percent of Tanzanian women aged 15-49 reported having two or more sexual partners in the past 12 months” appears wrongly in example DM51-en-3 but accurately in SR196-en-3. To address this, erroneous commas, annotator tags, and improper references were removed, but the repaired rows were kept in the dataset to preserve training data. References following the “START OF TEXT” tag were expected to be misplaced based on English instance—but this was not confirmed for non-English cases.
V. Post-Implementation
An approach similar to that described by [
23] is used to create QA blueprints for bilingual generations—however with some changes. Each English reference—which is a single sentence—is initially divided down into propositions. A proposition is a sub-sentential logical item or fact included within a phrase. Rather than the manual approach utilized by [
23]—this study uses flan-t5-large
3 finetuned for “propositionizing” phrases to construct minimum sentences for each proposition. For instance, from the line “In Nigeria, young women with low empowerment would like an average of 6.8 children, 2 children more than young women with high empowerment”, the following proposal are extracted: “In Nigeria, young women with low empowerment would like an average of 6.8 children” together with “In Nigeria, young women with low empowerment would like to have an average of 2 more children than those with high empowerment”. This method eliminates overlapping or unnecessary information because QA pairs arise directly from each proposal. Five QA pairs are produced for each proposition using T5-Large—which has been fine-tuned for Stanford Question Answering Dataset (SQuAD). Trial and error revealed that five pairings are enough to produce high-quality pairs. To ensure variation, the generation uses do sample true. To tidy up the data, QA pairs with questions that lack a question mark or have an empty string as an answer are removed. A regex is used to identify and remove QA pairs that contain hallucinated integers not found in the source reference. QA pairings in which the response is entirely contained within the question are likewise omitted. For identical answers, the QA pair with the highest lexical closeness to the proposition is retained. The QA pair with the closest lexical similarity to the reference is then chosen. [
23] filter QA pairings where the response does not occur at the end of the proposition using the theme structure of natural language sentences. This phase is avoided in this study because it risks removing more natural and statistically focused QA pairs—which are critical for accurate information extraction. On average, a QA blueprint created this way has two or three QA pairs. [
23] discovered that concatenating answers before questions worked well. Answers and questions are distinct by a period, while QA pairs by a pipe. The special tokens "Blueprint:" and "Verbalization:" are prefixed to the blueprint and verbalization, respectively, before being concatenated. [
23] refer to this as a "global" design for an End-to-End model—which determines the content focus for the whole output. They also try out Multi-Task and Iterative models—but this study focuses on the E2E plan because the verbalizations are often shorter. The encoder-decoder model accepts the linearized table as input—anticipates the blueprint—and produces the output. Errors in blueprints spread across the pipeline, influencing the final output [
24,
25,
26,
27,
28,
29,
30,
31,
32,
33,
34].
One drawback is that QA pairings formed from reference verbalizations do not include data not found in the references. TATA's papers are linearized tables—therefore more detailed QA pairings cannot be built directly from them. To build bilingual generations—two blueprint setups are generated—one with English blueprints and output in the target language, and another with blueprints translated into the target language. A language tag is put after "Verbalization" to specify the target language. Machine Translation (MT) of QA pairs spreads minor translation problems across the pipeline. To assess this risk, quality of translation is evaluated on the training set. Google translate is used to translate each English sample's reference into the other languages in the training set—which are then compared to the equivalent samples in those languages utilizing automatic MT metrics. The results suggest that translation quality is higher for widely spoken languages such as French and Portuguese—but lower for African languages. Swahili has the best translation among African languages—even outperforming Arabic—while Yoruba has the lowest. This variation is to be expected, given that mT5 was trained on more English and French samples.
VI. Experimental Analysis
[
12] conducted their trials using the mT5
Small and mT5
XXL models. However, due to the massive complexity of mT5
XXL and mT5
XL—fine-tuning these models were not possible given the computational resources available during this study. As a result, the largest model that could be trained was mT5
Large—with 1.2 billion parameters. Both mT5
Small and mT5
Large were finetuned with a dependent generation head—via the following hyper parameters and setup for each of the experiments—a constant rate of learning of 0.001, dropout of 0.1, per-device batch size of 4, five epochs for the small model and three for the large, weight decay of 0.001 for the large model, and linearized tables (inputs) and references truncated to 512 tokens. This truncation figure was determined after analyzing the lengths of un-truncated tokenized samples—the majority of which were less than 512 tokens in length. Validation loss was calculated every 100 or 500 steps—based on the size of the batch, and the point of validation with the lowest loss was chosen. It was discovered that the fine-tuned models were prone to generate extremely repetitious outputs, particularly in non-English languages. The use of blueprints compounded this problem. As a result, the blueprint was unable to be trimmed before comparing the output to the reference—which was extremely undesired because the blueprint was not designed to be evaluated against the reference. In order to fix this problem, a repetition penalty was imposed during inference (rather than finetuning) via Hugging face's
4 repeat penalty parameter that can be provided to model generate. This penalty, as defined by [
35]—penalizes tokens that were previously created. The distribution of probabilities for the next token is changed based on whether it has previously been generated. The penalty severity is determined by the parameter
θ. An investigation of varied repetition penalties was performed on the English-finetuned mT5
Small with blueprints—as the smaller model and dataset enabled faster analysis. It was discovered that when the amount of repetitions penalty grew, performance on all metrics enhanced until the penalty reached 1.4—at which point the metrics began to degrade. A
θ of 1.0 is equivalent to no penalty—but a penalty of 1.2 was shown to be sufficient to stop most highly troublesome repeating outputs—while not materially changing typical generations. This finding is consistent with [
35]—who found that a penalty of 1.2 strikes a reasonable balance between reducing repetition and retaining smooth, sensible outputs [
36,
37,
38,
39,
40,
41]. Where excessively repetitious blueprints were still present during testing—they marginally lowered the total metrics since a blueprint was being compared to a reference verbalization that was not meant for review. However, these troublesome candidates were not excluded from the analysis. Despite the improvements brought about by the repetition penalty—some degree of repetition persisted, demonstrating the intrinsic issue of managing repetitive outputs in multilingual models, particularly when computational resources are limited.
VII. Result Analysis
A. English Subset
Table I shows that CHRF and BLEU scores are generated using anticipated and reference verbalizations—where FACTKB and STATA scores are produced from the predicted verbalization and the linearized input table. The results show that blueprints improve the attributability of outputs for a model fine-tuned on English rows. Specifically, the small models’ STATA score rises from 0.513 to 0.523. Given the very small range of value to produce by STATA (about 4.9 to 7.2) —a 0.1 increase is significant. This small range is characteristics of learnt measures—implying that if the metric were taught with a bigger model—the range of values seen would expand. The blueprint model also sees an increase in FACTKB score of 0.4. The mT5-Large model was additionally fine-tuned for the English subset. Due to the tiny quantity of the dataset—it overfitted quickly and failed to give representative findings. The English subset—which has 902 training samples—is approximately eight times less than the total dataset and is insufficient for fine-tuning the large model. As a result—these findings were disregarded and not publicized. In the primary example in Table II—neither the vanilla nor the blueprint verbalizations are easily understood or attributed. Both allude to valid concepts, but the former makes no mention of any actual data—whilst the latter completely hallucinates the data[
22]. The pattern is semi-correct, with “wanted then” ranked as the most frequent category at 0.57. However, the data is not properly cited in the verbalization. In the following example, a very effective blueprint extracts essential details from the table—correctly recognizing the percentage (88%) and the year (2003). Despite this, the verbalization fails to make effective use of the blueprint—noting the proper year but not the figure. This is still preferable to the vanilla verbalization—which makes no mention of either. Manual examination of model outputs reveals regular anomalies. Models frequently create words like "increased from 15 percent to 15 percent", demonstrating a misunderstanding of the idea of "increase" in this case—as the numbers are identical. Similar errors arise when comparing two numbers—even if they exist in the input table. For example, the phrase "mortality rate is 19, compared to 19" shows a lack of rationale. It is worth noting that the 1.2 repetition penalty does not avoid this issue. Most verbalizations begin with phrases such as "The percentage of..." or "The proportion..." —which match prevalent sentence forms in the training data. While this is not a problem in and of itself, it raises questions about the model's capacity to create various verbalizations from the dataset. Furthermore, several verbalizations are incomplete, such as "The fraction of children under the age of 5 who have no use or too small for their age" or "Though a majority of women who got antenatal care from an experienced provider". These are valid as the first half of verbalizations, but they are incomplete and so meaningless. In general, the title and unit parts of the input tables are more regularly and accurately represented verbally—whereas the actual data points are less so. This observation is consistent with observations by [
12].
Table 1.
Small adjusted and assessed using the tata english subset.
Table 1.
Small adjusted and assessed using the tata english subset.
Model |
CHRF |
BLEU |
FACTKB |
STATA |
mT5Small
|
0.33 |
0.15 |
0.24 |
0.513 |
mT5Small (Blueprints)
|
0.30 |
0.11 |
0.28 |
0.523 |
Table 2.
Provide examples of output from both the vanilla and the blueprint models, utilizing the pertinent data from the database to create verbalizations and blueprints.
Table 2.
Provide examples of output from both the vanilla and the blueprint models, utilizing the pertinent data from the database to create verbalizations and blueprints.
Reference |
Linearised Input |
Setup |
Blueprint |
Verbalisation |
17 percent of births in Kenya are unwanted, and 26 percent are mistimed (wanted later). |
Planning Status of Births | Percent | (Wanted then, 0.57) (Unwanted, 0.17) (Wanted later, 0.26) |
Vanilla |
The proportion of births were wanted at the time of delivery. |
wanted births. What was the most common form of births? The proportion of births were wanted at births, from a low of 7% in 1990 to a high of 11% in 2012. |
The 2008-09 data indicate a rise since 2003 in medical antenatal care coverage. |
Trends in Receipt of Antenatal Care from a Skilled Medical Provider, Kenya 2003-2008 | Percentage of women with live birth in the past 5 years | (2003, 88) (2008-09-01 00:00:00, 92) |
Vanilla |
Although the proportion of women who have received antenatal care from a skilled provider. |
88%. What was the rate of antenatal care from a skilled provider in 2003? The proportion of women with antenatal care from a skilled provider in 2003. |
B. Multilingual Analysis
The examination of multilingual models reveals that small basic model trained by [
12] obtained a CHRF score of 0.33—which is nearly identical to the 0.32 score achieved by the baseline small model in this research as shown in Table III—showing satisfactory replication of the results. However, models taught with English blueprints fared badly. These models provide material in both English and the desired language—which is typically mixed. For example, a generated statement blended English and Swahili. Translated blueprints outperformed English blueprints in all parameters but fell significantly short of models that used no blueprints at all. [
12] reported a significant increase in TATA performance from mT5
Small to mT5
XXL (13B). This begs the question of why in this study—large (1.2B) does not outperform—small. One plausible hypothesis is model-wise double descent—which occurs when performance suffers as model size increases to a certain point before improving again as model size goes further. Even the entire TATA dataset is little, and the large model converges soon or overfits. Expanding the model scale by a factor of 10 can result in fast convergence, but at a lesser loss.
Table 3.
Outcomes of refined multilingual models applied to the entire test set.
Table 3.
Outcomes of refined multilingual models applied to the entire test set.
Model |
CHRF |
BLEU |
STATA |
mT5Small
|
0.32 |
0.16 |
0.552 |
mT5Small (Eng blueprints)
|
0.29 |
0.09 |
0.525 |
mT5Small (Trans blueprints)
|
0.30 |
0.12 |
0.542 |
mT5Large
|
0.33 |
0.13 |
0.552 |
mT5Large (Eng blueprints)
|
0.24 |
0.04 |
0.519 |
mT5Large (Trans blueprints)
|
0.27 |
0.11 |
0.544 |
C. Per-Language Analysis
The multilingual models were tested separately for every language in the test set—with translated blueprints performing better. The pre-language evaluation findings were inconsistent—with blueprints infrequently improving metrics and a minor performance difference between small and large models. Some intriguing observations emerged—Yoruba and Igbo—two low-resource language commonly used in Western Africa—especially Nigeria—had a divergent findings. Igbo did extraordinarily well—with the highest STATA score of any language as shown in Table IV. On the other hand, Yoruba fell to the bottom of the league in terms of BLEU and CHRF ratings—but performing rather well on STATA. Yoruba’s complicated characters and accents may have contributed to its low BLEU ratings. [
12] did not record similarly dismal findings for Yoruba. Surprisingly lower-resource African languages like Igbo, Hausa, and Yoruba gained greatly from increased model size. These languages have higher STATA scores using the large model, demonstrating that scale is especially important for low-resource languages. The findings imply that, while overall increases are minor, increasing model size can significantly improving performance for languages with restricted resources.
Table 4.
Performance of the mt5 multilingual models (bleu, chrf, and stata) according to language.
Table 4.
Performance of the mt5 multilingual models (bleu, chrf, and stata) according to language.
Lang |
Small (CHRF / BLEU / STATA) |
Small Blueprints (CHRF / BLEU / STATA) |
Large (CHRF / BLEU / STATA) |
Large Blueprints (CHRF / BLEU / STATA) |
En |
0.19 / 0.33 / 0.551 |
0.15 / 0.34 / 0.529 |
0.17 / 0.37 / 0.538 |
0.10 / 0.29 / 0.549 |
Sw |
0.21 / 0.39 / 0.589 |
0.17 / 0.36 / 0.569 |
0.16 / 0.38 / 0.581 |
0.13 / 0.30 / 0.585 |
Yo |
0.03 / 0.13 / 0.567 |
0.03 / 0.14 / 0.563 |
0.03 / 0.14 / 0.577 |
0.02 / 0.14 / 0.561 |
Fr |
0.17 / 0.36 / 0.528 |
0.11 / 0.33 / 0.526 |
0.14 / 0.38 / 0.526 |
0.12 / 0.31 / 0.529 |
Pt |
0.17 / 0.39 / 0.527 |
0.16 / 0.34 / 0.518 |
0.15 / 0.39 / 0.512 |
0.15 / 0.32 / 0.531 |
Ha |
0.17 / 0.33 / 0.526 |
0.12 / 0.33 / 0.523 |
0.12 / 0.33 / 0.546 |
0.12 / 0.29 / 0.515 |
Ar |
0.14 / 0.32 / 0.539 |
0.11 / 0.33 / 0.523 |
0.12 / 0.33 / 0.533 |
0.12 / 0.31 / 0.519 |
Ig |
0.20 / 0.35 / 0.596 |
0.17 / 0.32 / 0.584 |
0.16 / 0.34 / 0.605 |
0.15 / 0.27 / 0.558 |
D. Blueprint Analysis
Table V illustrates how closely projected blueprints resemble reference blueprints in the development set—as candidates and references are separated by “Verbalization” and just the blueprints are compared. The development set is chosen because only verbalizations are compared during testing. The scores are noticeably low—but getting high CHRF or BLEU on the blueprints, or the total output—is not the training’s express purpose. If that were the case—these statistics would be collected on the development set during refinement—with the best model selected depending on which checkpoint optimized them. Multiple acceptable blueprints may exist for a same table—particularly when creating brief verbalizations from huge tables. More importantly, blueprints must be tired to the input table, and verbalizations must be related to their blueprints.
Table 5.
Chrf and bleu between the validation set's gold and projected blueprints.
Table 5.
Chrf and bleu between the validation set's gold and projected blueprints.
Model |
CHRF |
BLEU |
Small Trans Blueprints (Multilingual) |
0.27 |
0.07 |
Small Blueprints (English) |
0.23 |
0.05 |
Whereas, Table VI evaluates this link by demonstrating how well models create blueprints and how verbalizations use them. CHRF and BLEU are computed between the linearized data and the blueprint to measure the information from the table in the blueprint—and among the blueprint and the verbalization to determine how closely the output depends on the blueprint for content selection. The goal is not to have similar blueprints to input tables or verbalizations to blueprints—but rather to interpret the model scores as indicative of desirable dataset properties. It also shows the best-case scores for the English and Multilingual datasets—which were obtained based on the corresponding training datasets. The rows for the English model and the Multilingual model show the scores calculated on the respective test sets—as well as the models’ generated blueprints and verbalizations. It is clear that just the English model cannot produce blueprints as closely connected to the linearized table as the sources (CHRF of 0.24 versus 0.28). The degree of similarity among blueprints and outputs is also quite poor (CHRF 0.39, BLEU 0.20) when compared with the dataset (CHRF 0.61, BLEU 0.24) —suggesting that the model struggles to develop quality blueprints and does not stay true to its design. The multilingual model is more effective at developing verbalizations based on blueprints—with a lower percentage loss. However, the model’s verbalisations do not depend heavily on its blueprints—implying that employing a type of limited decoding could assist focus the model on using words development in its blueprint—which is an area for future research. Before training begins, it is clear that the multilingual setting has a fundamental disadvantage. Despite this, multilingual blueprint models have slightly greater STATA scores than English-only models as they undergo training on more data—allowing them to learn the overall task more effectively—despite language-specific obstacles. This is also why the multilingual model outperforms the expected and reference blueprints in terms of BLEU and CHRF.
Table 6.
Between linearized inputs, blueprints, and verbalizations in the training data and model outputs, chrf and bleu are used.
Table 6.
Between linearized inputs, blueprints, and verbalizations in the training data and model outputs, chrf and bleu are used.
|
CHRF |
BLEU |
CHRF |
BLEU |
Linearised input → Blueprint |
0.28 |
0.02 |
0.61 |
0.24 |
English dataset |
0.24 |
0.02 |
0.39 |
0.20 |
Multilingual dataset |
0.25 |
0.02 |
0.43 |
0.13 |
Multilingual model |
0.23 |
0.01 |
0.36 |
0.16 |
VIII. Conclusion and Future Works
Evidence suggests that plans slightly enhance the attributability of Table-to-Text results in English—but more testing with a bigger dataset and model is required. In multilingual settings, utilizing English blueprints reduces efficiency and promotes language mixing in verbalizations. Translated blueprints perform better but still poorer than no blueprints due to translation mistakes—resulting in low-quality samples for model learning. This issue, as measured by BLEU, is particularly severe in multilingual models. STATA scores rise in English results while BLEU and CHRF scores fall—indicating that these measures are unsuitable for evaluating TATA because to their low connection with human evaluations. FACTKB works significantly better—but it is not suggested for TATA assessment. When human assessors are absent—STATA should serve as the final metric for model performance on TATA. STATA, trained using mT5Large—is available online—however it should be retrained with mT5XXL to ensure consistency. Increasing model size yield greater advantages for low-resource languages. Multilingual Table-to-Text synthesis remains a challenge for neural models. Future research ought to concentrate on using Large Language Models (LLMs) to produce more synthetic training data in multiple languages—investigating constrained decoding to improve blueprint utilization, and making TATA a more confined task by emphasizing table cells used in reference verbalizations, as done in the ToTTo dataset.
References
- H. Habib, G. S. Kashyap, N. Tabassum, and T. Nafis, “Stock Price Prediction Using Artificial Intelligence Based on LSTM– Deep Learning Model,” in Artificial Intelligence & Blockchain in Cyber Physical Systems: Technologies & Applications, CRC Press, 2023, pp. 93–99. [CrossRef]
- N. Marwah, V. K. Singh, G. S. Kashyap, and S. Wazir, “An analysis of the robustness of UAV agriculture field coverage using multi-agent reinforcement learning,” International Journal of Information Technology (Singapore), vol. 15, no. 4, pp. 2317–2327, May 2023. [CrossRef]
- S. Wazir, G. S. Kashyap, K. Malik, and A. E. I. Brownlee, “Predicting the Infection Level of COVID-19 Virus Using Normal Distribution-Based Approximation Model and PSO,” Springer, Cham, 2023, pp. 75–91. [CrossRef]
- S. Naz and G. S. Kashyap, “Enhancing the predictive capability of a mathematical model for pseudomonas aeruginosa through artificial neural networks,” International Journal of Information Technology 2024, pp. 1–10, Feb. 2024. [CrossRef]
- G. S. Kashyap et al., “Detection of a facemask in real-time using deep learning methods: Prevention of Covid 19,” Jan. 2024, Accessed: Feb. 04, 2024. [Online]. Available: https://arxiv.org/abs/2401.15675v1.
- M. Kanojia, P. Kamani, G. S. Kashyap, S. Naz, S. Wazir, and A. Chauhan, “Alternative Agriculture Land-Use Transformation Pathways by Partial-Equilibrium Agricultural Sector Model: A Mathematical Approach,” Aug. 2023, Accessed: Sep. 16, 2023. [Online]. Available: https://arxiv.org/abs/2308.11632v1.
- B. Min et al., “Recent Advances in Natural Language Processing via Large Pre-trained Language Models: A Survey,” ACM Computing Surveys, vol. 56, no. 2, Sep. 2023. [CrossRef]
- L. Duong, T. Cohn, S. Bird, and P. Cook, “Low resource dependency parsing: Cross-lingual parameter sharing in a neural network parser,” in ACL-IJCNLP 2015 - 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, Proceedings of the Conference, 2015, vol. 2, pp. 845–850. [CrossRef]
- M. Artetxe and H. Schwenk, “Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond,” Transactions of the Association for Computational Linguistics, vol. 7, pp. 597–610, Dec. 2019. [CrossRef]
- Z. Ahmad, R. Jindal, A. Ekbal, and P. Bhattachharyya, “Borrow from rich cousin: transfer learning for emotion detection using cross lingual embedding,” Expert Systems with Applications, vol. 139, p. 112851, Jan. 2020. [CrossRef]
- W. Zhang, S. M. Aljunied, C. Gao, Y. K. Chia, and L. Bing, “M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models,” Jun. 2023, Accessed: Aug. 23, 2023. [Online]. Available: https://arxiv.org/abs/2306.05179v1.
- S. Gehrmann et al., “TATA: A Multilingual Table-to-Text Dataset for African Languages,” in Findings of the Association for Computational Linguistics: EMNLP 2023, Oct. 2023, pp. 1719–1740. [CrossRef]
- F. Meyer and J. Buys, “Triples-to-isiXhosa (T2X): Addressing the Challenges of Low-Resource Agglutinative Data-to-Text Generation,” Mar. 2024, Accessed: Jun. 17, 2024. [Online]. Available: https://arxiv.org/abs/2403.07567v1.
- V. Kanaparthi, “Credit Risk Prediction using Ensemble Machine Learning Algorithms,” in 6th International Conference on Inventive Computation Technologies, ICICT 2023 - Proceedings, 2023, pp. 41–47. [CrossRef]
- V. K. Kanaparthi, “Examining the Plausible Applications of Artificial Intelligence & Machine Learning in Accounts Payable Improvement,” FinTech, vol. 2, no. 3, pp. 461–474, Jul. 2023. [CrossRef]
- V. Kanaparthi, “Examining Natural Language Processing Techniques in the Education and Healthcare Fields,” International Journal of Engineering and Advanced Technology, vol. 12, no. 2, pp. 8–18, Dec. 2022. [CrossRef]
- V. K. Kanaparthi, “Navigating Uncertainty: Enhancing Markowitz Asset Allocation Strategies through Out-of-Sample Analysis,” Dec. 2023. [CrossRef]
- V. Kanaparthi, “Exploring the Impact of Blockchain, AI, and ML on Financial Accounting Efficiency and Transformation,” Jan. 2024, Accessed: Feb. 04, 2024. [Online]. Available: https://arxiv.org/abs/2401.15715v1.
- V. Kanaparthi, “AI-based Personalization and Trust in Digital Finance,” Jan. 2024, Accessed: Feb. 04, 2024. [Online]. Available: https://arxiv.org/abs/2401.15700v1.
- V. Kanaparthi, “Evaluating Financial Risk in the Transition from EONIA to ESTER: A TimeGAN Approach with Enhanced VaR Estimations,” Jan. 2024. [CrossRef]
- V. Kanaparthi, “Robustness Evaluation of LSTM-based Deep Learning Models for Bitcoin Price Prediction in the Presence of Random Disturbances,” Jan. 2024. [CrossRef]
- V. Kanaparthi, “Transformational application of Artificial Intelligence and Machine learning in Financial Technologies and Financial services: A bibliometric review,” Jan. 2024. [CrossRef]
- S. Narayan et al., “Conditional Generation with a Question-Answering Blueprint,” Transactions of the Association for Computational Linguistics, vol. 11, pp. 974–996, Dec. 2023. [CrossRef]
- Sathishkumar Chintala, “Enhancing Study Space Utilization at UCL: Leveraging IoT Data and Machine Learning,” Journal of Electrical Systems, vol. 20, no. 6s, pp. 2282–2291, 2024. [CrossRef]
- Arpita Soni, “Advancing Household Robotics: Deep Interactive Reinforcement Learning for Efficient Training and Enhanced Performance,” Journal of Electrical Systems, vol. 20, no. 3s, pp. 1349–1355, May 2024. [CrossRef]
- M. M. T. Ayyalasomayajula, A. Tiwari, R. K. Arora, and S. Khan, “Implementing Convolutional Neural Networks for Automated Disease Diagnosis in Telemedicine,” in 3rd IEEE International Conference on Distributed Computing and Electrical Circuits and Electronics, ICDCECE 2024, 2024. [CrossRef]
- N. Kamuni, M. Jindal, A. Soni, S. R. Mallreddy, and S. C. Macha, “Exploring Jukebox: A Novel Audio Representation for Music Genre Identification in MIR,” in 2024 3rd International Conference on Artificial Intelligence for Internet of Things, AIIoT 2024, 2024. [CrossRef]
- N. Kamuni, M. Jindal, A. Soni, S. R. Mallreddy, and S. C. Macha, “A Novel Audio Representation for Music Genre Identification in MIR,” Apr. 2024, Accessed: May 05, 2024. [Online]. Available: https://arxiv.org/abs/2404.01058v1.
- A. Kumar, S. Dodda, N. Kamuni, and R. K. Arora, “Unveiling the Impact of Macroeconomic Policies: A Double Machine Learning Approach to Analyzing Interest Rate Effects on Financial Markets,” Mar. 2024, Accessed: May 05, 2024. [Online]. Available: https://arxiv.org/abs/2404.07225v1.
- A. Kumar, M. M. T. Ayyalasomayajula, D. Panwar, and Y. Vasa, “Optimizing Photometric Light Curve Analysis: Evaluating Scipy’s Minimize Function for Eclipse Mapping of Cataclysmic Variables,” Journal of Electrical Systems, vol. 20, no. 7s, pp. 2557–2566, May 2024. [CrossRef]
- S. Dodda, A. Kumar, N. Kamuni, M. Mohan, and T. Ayyalasomayajula, “Exploring Strategies for Privacy-Preserving Machine Learning in Distributed Environments,” Authorea Preprints, Apr. 2024. [CrossRef]
- A. Kumar, S. Dodda, N. Kamuni, and V. S. M. Vuppalapati, “The Emotional Impact of Game Duration: A Framework for Understanding Player Emotions in Extended Gameplay Sessions,” Mar. 2024, Accessed: May 05, 2024. [Online]. Available: https://arxiv.org/abs/2404.00526v1.
- A.Kumar, “Implementation core Business Intelligence System using modern IT Development Practices (Agile & DevOps),” International Journal of Management, vol. 8, no. 9, pp. 444–464, 2018, Accessed: Aug. 23, 2024. [Online]. Available: https://www.indianjournals.com/ijor.aspx?target=ijor:ijmie&volume=8&issue=9&article=032.
- R. Arora, S. Gera, and M. Saxena, “Mitigating security risks on privacy of sensitive data used in cloud-based ERP applications,” in Proceedings of the 2021 8th International Conference on Computing for Sustainable Global Development, INDIACom 2021, 2021, pp. 458–463. [CrossRef]
- N. S. Keskar, B. McCann, L. R. Varshney, C. Xiong, and R. Socher, “CTRL: A Conditional Transformer Language Model for Controllable Generation,” Sep. 2019, Accessed: Jun. 17, 2024. [Online]. Available: https://arxiv.org/abs/1909.05858v2.
- S. Wazir, G. S. Kashyap, and P. Saxena, “MLOps: A Review,” Aug. 2023, Accessed: Sep. 16, 2023. [Online]. Available: https://arxiv.org/abs/2308.10908v1.
- P. Kaur, G. S. Kashyap, A. Kumar, M. T. Nafis, S. Kumar, and V. Shokeen, “From Text to Transformation: A Comprehensive Review of Large Language Models’ Versatility,” Feb. 2024, Accessed: Mar. 21, 2024. [Online]. Available: https://arxiv.org/abs/2402.16142v1.
- G. S. Kashyap, A. Siddiqui, R. Siddiqui, K. Malik, S. Wazir, and A. E. I. Brownlee, “Prediction of Suicidal Risk Using Machine Learning Models.” Dec. 25, 2021. Accessed: Feb. 04, 2024. [Online]. Available: https://papers.ssrn.com/abstract=4709789.
- G. S. Kashyap et al., “Revolutionizing Agriculture: A Comprehensive Review of Artificial Intelligence Techniques in Farming,” Feb. 2024. [CrossRef]
- G. S. Kashyap, K. Malik, S. Wazir, and R. Khan, “Using Machine Learning to Quantify the Multimedia Risk Due to Fuzzing,” Multimedia Tools and Applications, vol. 81, no. 25, pp. 36685–36698, Oct. 2022. [CrossRef]
- F. Alharbi and G. S. Kashyap, “Empowering Network Security through Advanced Analysis of Malware Samples: Leveraging System Metrics and Network Log Data for Informed Decision-Making,” International Journal of Networked and Distributed Computing, pp. 1–15, Jun. 2024. [CrossRef]
|
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).