2.1. Database Generation
As mentioned in the Section above, previously published studies on the effect of the alloy composition and the microstructure in solder alloys have not systematically developing approaches with a data-oriented mindset. Therefore, in order to enable the possibility to apply alloy design concepts based on predictive models, the first step in this study was to address this issue. The data for the present study come from a compilation of several research articles published focused on specific alloy/systems. These were previously developed works mostly by two international Research Teams devoted to solidification investigation of several alternative Sn-based solders, i.e., the M2PS (Microstructure and Properties in Solidification Processes)/UFSCar, Brazil and the GPS (Solidification Research Team)/Unicamp, Brazil, as well as some minor contributions from other groups.
A summary of the resultant compiled database with an overview of the number of registers per expected results (UTS, YTS, and EF) and attributes (alloy composition and microstructural features), also referencing the research paper where the data was extracted, is summarized in the
Table 1. After all, data from a total of 35 alloys either ternary or binary chemistries from different systems were gathered. An exploratory analysis was carried out to better evaluate the consistency of the database, based on Panda’s data frame package’s resources [
10] and Sweetviz [
11].
Most of the studies [
12,
13,
14,
15,
16,
17,
18,
19,
20,
21,
22,
23,
24,
25,
26,
27,
28,
29] employed directional solidification systems to allow microstructural and tensile data to be assessed. After solidifying the casting, transverse (perpendicular to the solidification direction) and longitudinal samples (at various sections from the cooled bottom of the castings) were removed of alloy casting for the metallographic procedure using optical microscopy.
To measure λ
1, the cross sections were considered, as well as the neighborhood criterion, which considers the spacing value equal to the average distance between the geometric centers of the primary dendritic trunks in question, as defined by the triangle method. The secondary (λ
2) and tertiary (λ
3) dendritic arm spacings were measured by using the linear intercept method using longitudinal sections (parallel to the extraction direction of heat) and transversal, respectively. The same linear intercept method was adopted in some of the studies [
12,
13,
14,
15,
16,
17,
18,
19,
20,
21,
22,
23,
24,
25,
26,
27,
28,
29] for determining λ
fino, λ
coarse, λc
u6Sn5, λ
Ag3Sn and λ
Zn. Moreover, the tensile properties of the alloys were determined through tensile tests at different positions in the castings with strain rates in the order of 10
-3 s
-1.
2.2. Regression Models
This Section presents a brief description of the main concept of all the regression models used in this study, focusing on how the key parameters could be used in order to enhance the model quality. Most of the equations and descriptions were based on James et al. [
30], whose book is indeed a great review of the statistical learning concepts. On the other hand, some other concepts such as ElasticNet and Multilayer Perceptron (MLP) had better analysis and description in Morettin and Singer [
31], which, therefore, can be considered the basis for these specific topics. Regarding the content organization itself, the present Section evolutes from more simple concepts, such as Linear Regression, up to more modern concepts, that is, Neural Networks.
The regression models are widely used in statistical learning and supervised learning. They aim to predict a response variable (Y) based on predictor variable (X) [
30], as represented in the Equation 1. Multiple regression models are basically an extension of simple linear regression, which is a basic method for supervised learning. In multiple regression, there are multiple predictor variables, and the model aims to find the relationship between these variables and the response variable, as shown in Equation 2. The model is represented by an equation that includes intercept and slopes for each predictor variable. The model is estimated using the least squares method, which minimizes the Residual Sum of Squares - RSS (Equation 3). In this study, the model applied was based on Scikit Learn Linear Regression [
32], where further description about the algorithm can be found.
where: X represents the predictor variable, Y is the answer variable that one is trying to predict. On top of these two, to compose a linear model, it is also important to mention that
and
are unknown constants that represents the intercept with the y-axis and slopes with regard to the x-axis, respectively [
30]. Finally, the
can be defined as a mean-zero random error term, added to represent the deviations that cannot be explained by the linear model. Regarding Equation 2, each pair of
represent a linear function with its own slope, having p as an integer from 1 to the number of considered predictor variables. Finally, in Equation 3, RSS” means the Residual Sum of Squares and, one might also notice
, which means the predicted value based in the MLR formula with a vector of coefficients
and
.
James et al. [
30] pointed out that linear regression models can be surprisingly competitive even when compared to more sophisticated non-linear models [
30]. For this reason, these authors also brought up in their work analysis about methods to improve the linear method in terms of prediction accuracy and interpretability. The prediction accuracy is an important factor to avoid overfitting (
Figure 1), in which later the predictive model might have poor predictions. This is a big concern especially when the number of predictor variables is similar in magnitude to the number of n samples itself. Concerning interpretability, this concept was developed as an attempt to reduce the complexity of the model by removing irrelevant variables for the prediction [
30].
To enhance the linear regression model, several regularization techniques have been introduced. Ridge regression is one method that adds a shrinkage penalty term to the least squares’ equation, as shown in Equation 4. This penalty term helps reduce the coefficients of irrelevant variables, leading to a more interpretable model. Lasso regression, on the other hand, imposes a penalty that forces some coefficients to be exactly zero, effectively performing variable selection (Equation 5). Finally, Elastic-net regression combines the penalties of both Ridge and Lasso regressions, providing a trade-off between the two approaches (Equation 6). Further description of the concept can be found in James et al. [
30] work, especially for Ridge and Lasso, whereas ElasticNet was better developed in the work by Morettin and Singer [
31]. In any case, all three models used are available in the Scikit Learn package. [
34,
35,
36].
where: in Equation 4, the term
is the Rigde’s shrinkage factor, which gets smaller when the estimated coefficients
get closer to zero. The same reasoning is applied in Equation 5, where there is also a modular penalty. Finally, in Equation 6, the final formula basically shows a balance regarding both penalties, which is given based on the value of α".
In addition to the linear regression, other regression models have been developed to capture different characteristics of the data. Two concepts could be mentioned in this work, which were the spatial and the tree-based ones. Concerning the first one, the most prominent model related to this method is the K-nearest neighbors (KNN). James et al. [
30] demonstrated this model, especially regarding its function as a classifier. The KNN model is a non-parametric approach that predicts a value based on the average of the K closest neighbors. The number of neighbors and the weighting of distances can be adjusted to improve performance. Equation 7 depicts the formula, whereas the algorithm applied description can be found within the Scikit Learn package description [
37].
where
is the Y” value for the i neighbor within the K ones decided by the user.
Similar to the KNN model, other regression models were later created to further sophisticate the usage of the predictor space. Already aforementioned, the tree-based models are included in this scenario, where the model basically tries to stratify or segment the X values into simple regions for calculation performing [
30]. Once this is done, to issue a prediction, the model calculates the mean average based on the region into the model determined by referencing the given attributes of the respective register. Decision trees and random forests are tree-based models that segment the predictor space into regions and predict the response variable based on the average of the samples within each region. Random forests combine multiple decision trees to reduce variance and improve prediction accuracy.
Figure 2 depicts the Random Forest regression concept, where the output of several decision outputs is collected to later issue the final prediction figure. In the present study, both regression model algorithms were based in the Scikit Learn package functions [
38,
39].
As mentioned by James et al. [
30]. the support vector regression (SVR) concept was introduced back in the 1990s and has its popularity growing ever since. When analyzing this approach as a classifier, it can be understood as a generalization of an intuitive concept of the maximal margin classifier”. As a classifier, the most common use terminology is support vector machine (SVM). In summary, the SVM method tries to find the best hyperplane to separate groups based on the maximum error we are willing to accept its margins and certain degrees of freedom. This method sometimes might require changes in remapping the data entry into higher space dimensions. Morettin and Singer [
31] described that, in the case of linear functions, in which the goal was to determine α and β in which
.
was considered the total acceptable error and
represent a certain degree of freedom that can be added to the model. Adding the contact C as a positive figure means a commitment to flattening the function. Considering a linear model, the function to define a prediction for a register with predictor variables is given by a vector
. In this formula,
is equal to
, with
. Finally, the other components are:
as a Lagrange multiplier, and
as a Kernel [
31]. Although complex, the model can be easily applied using Scikit learn algorithm [
41], and are summarized in Equations 8 and 9.
One might ponder that the final model considered in this study is the least evident of all others discussed so far. The method, Multi-Layer Perceptron (MLP), was named as such based on the human brain structure concept [
31]. In brief, the brain can be considered a three-dimensional pack of neurons. Each neuron receives input signals from other neurons in its surroundings through its dendrites and then processes them based on its own manner, issuing later through its axons output signals for other neurons. In 1958, Rosenblatt introduced the concept of the perceptron as a new for supervised learning [
31]. Basically, the perceptron would be a parallel of the neuron, receiving inputs which a later process by it based on an activation function. With respect to this last point, several functions can be considered. Although very interesting, discussions regarding each type of activation function impact fall out of the scope of this work. Once added into a layer with other perceptron and later packed with other layers, the model runs an exercise to balance layers based on the batch of the predictor variable, in an effort to the define the weights of each perceptron activation based in the inputs of their input layers. Therefore, one might point out that the main parameters to be investigated within this method are: the number of layers, perceptrons per layer, and, finally, the number of training sessions. [
42].
Each regression model applied here has its own strengths and weaknesses, and the choice of model depends on the characteristics of the data and the specific goals of the analysis. What is key to point out is that understanding these models and their parameters may enhance the quality of the predictive models and make more informed interpretations of the results.
2.3. Database Split and Model Accuracy Measurement
There might be no dissenters to the view that one should always need to quantify the quality of a certain model prediction. On the other hand, as mentioned by James et al. [
30], some researchers have disregarded a more modern concept with regard to this crucial issue. For instance, previous projects have largely overlooked the general concept of Mean Square Error (MSE) as presented in the Equation 10 below. In this formula, the MSE will be smaller as the MSE gets closer to the n” true responses in the training data set.
Most statistical learning scholars seem to agree [
30] that this can led to the wrong assumption about the model quality. For instance, one might understand the smallest MSE as the most precise model, without understanding that this analysis does reference only the training dataset. James et al [
30] are explicit in their work that the best way to determine a given model accuracy is to analyze its predicted values in comparison to previously unseen test data. In this context, one should use [
30] calculating the average squared prediction error for the test true response.
where
is the real predictor variable of the register i of the test data set, and
is the given predicted value based on the register attributes.
Figure 3 illustrates the concept of the overfitting occurrence, where the increase in the flexibility of the models applied managed to increase the training MSE calculation, whereas decreasing the test MSE prediction accuracy. In the current study, the method to create unseen” data for the trained model was based on the sklearn.model_selection.train_test_split” using a 80% training / 20% test ratio [
43]. The data was also prepared in terms of standardization [
44], which shall be explained after the analysis shown in the next Section. Finally, each model prediction was evaluated using the Scikit Learn score method [
45], which uses the test MSE formula, scaling the number versus the maximum error possible, i.e., normalizing the figure between 0 to 1. All details considered, the overview of all the methods and process within were summarized in the workflow in
Figure 4.