The head count in multi-head attention directly impacts the model’s capability to focus on various parts of the input sequence, each potentially capturing different linguistic features and relationships required to understand the underlying semantics [
1]. An optimal head count is pivotal for the model to generalise well on unseen data, as too few heads might limit the complexity of learned representations. In contrast, too many could lead to redundant feature extraction [
3]. The hidden dimension of the feed-forward neural network layers within each transformer block dictates the ability to perform complex mappings from input to output space, serving as an abstraction layer which encapsulates more intricate relationships in the data [
1]. The layer count or depth of the network is equally paramount, with deeper networks generally able to perform higher-level reasoning, though at the risk of increased computational demand and potential difficulties in training, such as vanishing or exploding gradients [
4]. Dropout, applied within transformer blocks, is a regularisation mechanism; randomly omitting a subset of features during training forces the network to learn more robust features invariant to the input noise [
5]. Carefully tuning the dropout rate is fundamental, as too high a rate can impede learning, whilst too low fails to regularise effectively [
6]. Model dimensionality not only influences the model’s capacity but also determines the scalability and computational efficiency, with higher dimensions typically requiring more training time and memory resources [
2]. This intricate balancing act between the architectural components of the Atinuke model embodies the current challenges faced in the design of neural network architectures, where the quest for improved performance must also contend with the constraints of computational resources and training efficiency [
7]. Furthermore, the model design considered the transferability across different tasks and languages, ensuring its learned representations are not overly task-specific [
8]. Ultimately, the innovation in architectures like Atinuke lies in carefully engineering these hyperparameters to achieve an optimal balance catering to the diverse range of NLP tasks [
9].