4.1. Motivation and approach overview
Counterfeiting occurs in several channels of online sales, including marketplaces and social media. One widespread illegal activity consists of selling unauthorized goods on specialized fake websites, luring customers with cheap, inferior versions of brand-name products. To increase visibility, such websites are often optimized to appear among the results returned by web search engines in response to a query containing the brand name. Also, for reasons of scale and economy, multiple fake websites are often managed by a single criminal entity. The research on automating web anti-counterfeiting efforts has thus a twofold aim: (i) detecting fake web shops, especially among web search results, and (ii) identifying the affiliate marketing programs behind fake web shops.
The latter problem is conceptually different from the former because it is addressed by clustering techniques, as opposed to classification techniques, and relies on different types of learning features. Whereas classification features are common characteristics of illicit web shops that are generally not shared by legitimate web shops (e.g., large discounts, a lack of contact information, and the use of untraceable payment methods), clustering features need to model the process by which multiple counterfeit websites are created and managed by the same entity. The assumption is that web shops belonging to the same network share similarities in terms of their structure, content, and network, since making truly unique versions of each site does not scale well. However, web shops under the same criminal entity may render very differently, especially if they sell entirely different products, while there may be seemingly similar web shops that are actually unrelated, posing a challenge to clustering algorithms.
While the ability to automatically distinguish between fake and genuine web shops has been well studied (e.g., [
20,
21]) and continues to be actively investigated ([
22,
23,
24]), the subsequent task of recognizing affiliate programs among the detected fake web shops has been less researched to date, although it allows enforcement at scale and brings long-lasting results. One notable exception is [
25], where the authors make use of a conventional clustering algorithm together with a few clustering features, mainly extracted from the HTML and the URLs of the websites. However, they reported limited success. Single or combined clustering algorithms have also been recently applied to find networks of malicious websites in several other domains leveraging similar or novel clustering features (e.g., [
26,
27,
28,
29,
30,
31]. Our work expands on previous research by integrating constraints and multiple clustering algorithms in the process of grouping fraudulent websites into connected networks and by showing how such constraints can be partly acquired in an automatic manner.
One practical difficulty of clustering fraudulent websites is that there are no available datasets with associated feature matrices and therefore base partitions must be constructed from raw data. In addition, the features of interest are heterogeneous and must be treated individually. Assuming the use of distance-based clustering algorithms (although other choices would be possible, such as density-based or spectral), this task can be addressed through a three-step pipeline consisting of feature selection, construction of a similarity matrix, and application of several clustering algorithms (or multiple variants of the same clustering algorithm, or the same variant with multiple features). For the last step, we will use both multiple algorithms and multiple features, thus extending the experimental setting commonly adopted in earlier studies. In this way, we will be able to consider a wider range of variables when measuring the performance of double-constrained consensus clustering for web anti-counterfeiting efforts. The specialized double-constrained consensus clustering architecture is shown in
Figure 4. The partitions are generated from each of the
p clustering features by
q base clustering algorithms, and they are next merged through the
framework. Compared to
Figure 1, we have highlighted the generation of the base partitions and the automatic acquisition of constraints from the input dataset and external data.
4.2. Experiment design and preparation
In this section, we will describe, in turn, the goals of our experiment, the construction of a ground truth dataset, the choice of clustering features and corresponding distance matrices, the selection of base clustering algorithms, and the acquisition of constraints.
4.2.1. Goals
The experiments that we have conducted on grouping illicit web shops had two main objectives. The first was to gain a deeper understanding of the combined effect of features and algorithms on the overall clustering effectiveness, because little work has been done in this domain on evaluating the relative performance of clustering algorithms and features. Most studies use a specific clustering algorithm with a specific combination of features. By contrast, we have analyzed and compared the behavior of individual features across a range of algorithms, and, dually, the behavior of individual algorithms across a range of features. The second goal was to evaluate the effectiveness of plain consensus clustering and constrained consensus clustering for web anti-counterfeiting, which has not been explored so far. This requires computing the base clusterings from the raw set of counterfeit web shops, instead of assuming that the clustering features are known (as in the UCI experiments).
4.2.2. Construction of the ground truth dataset
To the best of our knowledge, there are no available test collections of this kind. The first step was to generate a suitable set of counterfeit websites that holds potential for revealing affiliate programs. We relied on RI.SI.CO. [
23], a machine learning system that can detect fake web shops in search engine results generated in response to brand search queries. The procedure was as follows. We first selected 20 famous luxury brands that are known to be targeted by counterfeiters [
23]. The corresponding (`complicit’) queries, formed by adding `replica’ and `cheap’ to the brand name, were given as an input to RI.SI.CO., which submitted them to three web search engines, collected 6,043 search results (about 100 for each query), and then identified 1,076 suspicious e-commerce webpages in the set of results. These 1,076 webpages were hosted on 302 distinct websites, We next selected one webpage per website as a representative, removing the redundant items. We also deleted the webpages that were no longer accessible (e.g., due to trademark infringement), with 217 items remaining. The automatic classification performed by RI.SI.CO. is mostly accurate but there may be false positives in the set of webpages labeled as fake. To increase the reliability of the results, we had a few web anti-counterfeiting experts manually remove from the remaining webpages those that had been misclassified by RI.SI.CO., eventually ending up with a set of truly illegitimate webpages containing 203 items.
The next step was to group the 203 webpages (websites) in homogeneous clusters, which was performed by the same experts. Their task was facilitated by extracting a set of unique features associated with each counterfeit network (see
Section 4.2.5), which were used to form initial seeds, prior to manual inspection. Given the limited number of items, this effort required on the whole a non negligible but reasonable amount of time. Other strategies to complement manual inspection in the identification of affiliate programs from a larger set of fake web shops have been proposed, such as the heuristic pattern-matching of html content in [
32], and the formulation of the problem as a classification task (with labeled data) in [
25]. The clusters with one or very few elements were then removed by our experts, thus resulting in a set of 121 websites partitioned in six clusters.
A few clusters accounting for the majority of items is a confirmation of the presence of affiliate fake web shops in brand search results, at least for heavily targeted brands and complicit queries. We checked that affiliate web shops were both mono-brand and multi-brand for a specific type of product (e.g., shoes) or even for different types of products (e.g., shoes and jackets). As an illustration,
Figure 5 shows the homepages of two fake web shops with top-selling shoe brands (i.e., Louboutin and Valentino) that were grouped together.
Given the limited lifespan of counterfeit websites, it is essential to get a snapshot of all the relevant information associated with them when they are still functional. In the final step, a set of clustering features (see
Section 4.2.3) was extracted for each domain. The set of 121 domain names with associated clustering features and group information form a ground truth dataset termed CAP (Counterfeit Affiliate Programs).
3 We believe that CAP, although small in size, fills a gap in the research on online anti-counterfeiting.
4.2.3. Clustering features and distance matrices
Various clustering features have been proposed for this or related tasks, usually associated with the structure ([
25,
27,
28,
33]), content ([
26,
29,
30,
31]), and visual appearance of malicious websites ([
26,
34]), or with information about their registration or network infrastructure ([
25,
26,
31]). We followed two main criteria to select the features for the experiments: that they were representative of main feature categories, and that they were present and easily acquirable for almost all web shops in CAP. For instance, we did not use any feature related to website registration because the WHOIS service available to us covered only a small fraction of the sample. Also, unlike [
25], we did not include network features such as name servers or autonomous system numbers because we found that they were very weak clustering signals, while the IP address can be regarded as a very strong signal and used to acquire constraints, as is discussed in
Section 4.2.5. The selected raw features, along with the actual clustering features extracted from them and the similarity function associated with each, are described below.
As a structural feature, we relied on the DOM tree associated with each of the webpages. Following [
33], we encoded the structure of each DOM tree in CAP as a bit string through SimHash fingerprinting, and then computed the pairwise similarities with the Hamming distance. As a visual feature, we chose the website header of the web shops. We created five visual clustering features from the homepage screenshot, corresponding to website headers with a variable number of rows (from 30 to 150, with a step of 30). The similarity matrix for a selected header was found by first extracting, for each website in CAP, an image tensor containing the HSV value of the pixels in the region of interest associated with that header, and by successively computing the Chi-Square distance between the HSV histograms of any pair of elements in CAP. We finally used two novel clustering features related to the specific content items of web shops, namely privacy policy and shipping policy. The rationale is that policies between affiliated fake web shops can be reused to reduce the effort involved in website authorship, often with only minor adjustments, e.g., the name of the site owner. We did not use other possible, and probably relevant, policies (e.g., `payment methods’ and `returns and refunds’) because these pieces of information were not provided in the majority of the elements in CAP. For both features, we extracted the textual content of the policies from each website in CAP, and then measured the pairwise similarity based on the number of shared sentences.
4.2.4. Base clustering algorithms
Moving on to the selection of base clusterings, first of all we would like to note that constrained clustering algorithms available online that accept a user-defined similarity matrix are very rare. As this was an essential prerequisite in the domain at hand, we decided to use base clustering algorithms that cannot take advantage of constraints (unlike former experiments with UCI datasets). On the other hand, it should be noted that the potentially unfair use of constraints in the consensus clustering framework (for the purpose of performance comparison to clustering without constraints) is mitigated by the fact that, as we shall see, in web anti-counterfeiting constraints can be partly acquired automatically.
We relied on the
package in the R statistical programming language. It provides a set of eight distinct agglomerative hierarchical clustering algorithms, with the additional facility that users can define their own similarity matrix (which is, in fact, a `dissimilarity structure’ and requires a a suitable transformation of the format of the similarity matrix). The algorithms differ in the procedure used to select which clusters are to be merged at each step, and will in general produce very different results. The algorithms, described in more detail in the package documentation,
4 are:
Average,
Centroid,
Complete,
Mcquitty,
Median,
Single,
Ward.D, and
Ward.02. The output of these algorithms in
R is a dendrogram. To find the corresponding partition, we then cut each dendrogram into six disjoint subtrees (as the number of clusters in CAP).
4.2.5. Automatic acquisition of constraints
We now turn to the acquisition of constraints. While the clustering features introduced above are only indicative of membership (i.e., two domains sharing a same feature may or may not belong to an affiliate program), in the web anti-counterfeiting field it is sometimes possible to state that two different web domains are in fact linked to the same entity by leveraging certain registration and network information as well as specific content on their websites. The latter type of information can be seen as high-confidence but infrequent features, as opposed to the frequent yet lower-confidence features used as proper clustering features in
Section 4.2.3. In particular, a must-link constraint between two domain names can be created with some certainty when one of the following properties is satisfied.
-
Redirection. Some fake websites redirect the visitor from the initial web domain to one or more additional sites, ultimately resolving the final web page [
32]. This is done either by URL redirection, whereby a fake web page is made available in parallel under more than one URL address, or by search-redirection, in which fake websites hack high ranking websites to redirect to their store based on the user’s search query [
35]. If two fake websites share the same final domain after redirection, they almost certainly belong to one affiliate marketing network.
-
Same IP address. Multiple websites can be hosted on one web server. If two fake websites have the same resolved IP address for their domain names, then it is very likely that they were created by the same entity [
36].
- Same WHOIS Registrant data. Although domain name registrant information (name, email, address) as provided by databases like WHOIS are largely incomplete due to several issues (including privacy regulations), shared registrant data is a clear indication that two fake websites should be linked together because it means that they have been registered by the same legal person (juridical or natural).
- Same website contact information. Fake websites try to resembles genuine websites to increase visitor trust. This includes providing reassuring contact information such as an email address or telephone number, but also a physical address and links to the web pages of physical stores. If two fake websites share that data then they probably belong to one affiliate marketing program.
-
Same Google analytics ID. Third-party analytics services are used by many ecommerce websites to better understand their customers. If two fake websites contain the same analytics ID within their source code, it means that they are reporting to the same analytics account and presumably are part of one affiliation program. Finding matching Google analytics IDs has been used to group illicit websites into connected campaigns [
37].
We automatically acquired the features necessary to assess the abovementioned properties (when available) for any of the 121 websites in CAP, and then, through pairwise comparison, we generated 53 must-links, 32 of which were non-redundant. This is a small fraction of the total number of must-links, but it may be enough to drive the process of constrained consensus clustering and significantly improve performance improvement, as is shown in the next section.
4.3. Results
We tested 64 combinations of clustering methods and features (8x8), and measured the performance of each by
F measure. For the ease of interpretation, the results are shown in two distinct charts, rather than a table.
Figure 6 reports the performance of each of the eight clustering methods across the eight clustering features. We also included the performance of a random partition (independent of feature), for the sake of comparison. The figure clearly suggests that the result of individual methods may change to a great extent as features vary, and that there is no best method across all features, consistent with the observation that any clustering method is not inherently better or worse than another one. More specifically,
Complete had the largest performance range (from 0.36 to 0.76), while
Centroid,
Average,
Single,
Median, and
Mcquitty were relatively more stable with change of features (except for feature
DOM), with
Ward.D and
Ward.02 exhibiting intermediate behavior. Looking at the relative performance of clustering methods,
Centroid,
Average,
Single,
Median, and
Mcquitty obtained more comparable and higher results than
Complete,
Ward.02, and
Ward.D (in that order). Finally, any clustering method was markedly better than the random partition for any clustering feature.
Figure 7 reports the performance of each of the eight clustering features across the eight clustering methods, including the random partition again. Analogous to the behavior of clustering methods, the result of individual features changes as clustering methods vary, and there is no best feature across all methods. Feature
Header120 has the largest performance range, from 0.30 a 0.69, with the other visual features exhibiting similar variations.
DOM was also very unstable, while the two features pertaining to textual content were relatively more stable across clustering methods. In terms of feature comparison,
Privacy and
Shipping usually achieved good results. By contrast, visual features were in general less effective (although
Header90 performed well for some methods), while
DOM was comparatively inferior for some methods but it also achieved the best overall results for three methods.
Figure 6 and
Figure 7 show that the performance of all possible combinations of methods and features varies across a very wide interval, from a minimum of 0.30 to a maximum of 0.85. As there is no way to know in advance which combination will perform better, it is important to find ways to improve the average expected behavior for the set of methods and features at hand. Using consensus clustering and constrained consensus clustering may be an effective strategy, as we will now expound on.
To evaluate
CC and
CCC, the procedure was as follows. For each clustering feature, we used the partitions generated by the eight clustering algorithms as base partitions and computed the
CC partition. We then added the constraints and computed the corresponding
CCC partitions. We used a growing set of constraints, from 10 to 80 (with a step of 10). The full set of 80 constraints used in the experiments was obtained by completing the constraints acquired automatically as described in
Section 4.2.5 with those extracted from CAP using the same method as with UCI datasets. In all, we generated
partitions. Finally, we evaluated the performanceof each with
F measure.
The results are reported in
Table 4. For each clustering feature (listed in the first column), we first show the minimum, maximum, and average performance value obtained by the eight
hclust algorithms for that feature. In the subsequent columns, we report the performance of
CCC using the eight
hclust algorithms (with that feature) as an input to
CCC, as the number of constraints grows from 0 to 80. In particular,
CCC with zero constraints is equivalent to using unconstrained consensus clustering (i.e.,
CC). We also show, in parenthesis, the percentage of improvement of
CCC over the average performance of the eight base methods (for each single feature). Finally, the last row reports the mean performance value (averaged over the set of features) of minimum, maximum, average (and thus means of means), and
CCC (relative to the eight basis algorithms and for any set of constraints, including unconstrained
CC).
The are two main findings. The first is that CC works well, with improved performance over the average result of basis clustering methods ranging from 8% to 26%, depending on which clustering feature we consider. The average improvement over all methods and features was 17% (last row, CC column). Not only are the CC results better than the average results of the basis methods for all the features, but very often they are also equal or close to those obtained by the the best basis method. In particular, CC matches the best result for the Header30, Header90, Privacy, and Shipping features, while it is very slightly inferior to the best method for Header60, Header120, Header150. These observations confirm the effectiveness of CC for the domain at hand.
The second main finding is that the use of constraints within
CC was very effective.
Table 4 shows that the results of
CCC were better than the average performance of the eight basis algorithms for any set of constraints and for any feature, with a peak of 59% improvement over a single feature and 45% improvement over average performance, reached with 80 constraints. Also, and more importantly,
CCC soon outperformed the best basis method. For instance, with 30 constraints,
CCC performed better than the best basis method seven times out of eight. Comparing
CCC to
CC, we see that with a small number of constraints,
CCC was slightly worse than
CC, while with 30+ constraints on
CCC was systematically better than
CC both on average and for individual features. In this range of constraints, the improvement of
CCC over
CC grew monotonically, gaining up to 24% with 80 constraints (from 0.7 to 0.87).
Before concluding this section, we would like to note that, besides using clustering algorithms with individual features, we also tried to combine the single features into one overall feature, by normalizing the distance matrices and taking their mean, as was done in [
38], for instance. The results were unsatisfying, probably due to the different distribution of the values produced by each feature.