Detecting fake web shops with PU learning

Senne Batsleer earned an advanced master’s degree in artificial intelligence at the Catholic University of Leuven this year. He wrote a thesis about detecting fake web shops in the .be domain with machine learning:

"The Detection of Fake Web Shops in the .be zone".

What was the aim of your thesis?

Senne: 'The aim of the thesis was to improve the detection of fake web shops in the .be zone. Online shopping is becoming increasingly popular, including in Belgium. According to a study by Comeos, 72% of Belgians have made online purchases in 2020. Unfortunately, there are also fraudulent web shops that sell counterfeit goods or don't even deliver goods at all.'

'DNS Belgium wants to make the .be zone as safe as possible and is trying to take such fake web shops offline as quickly as possible, before shoppers become victims. There are more than 100,000 .be websites that host a web shop. Verifying all of them manually is consequently impossible.

That is why DNS Belgium has been tracking down fake web shops with artificial intelligence since 2019. Meanwhile, DNS Belgium has already taken more than 3,900 domain names of fake web shops offline. They determined features from the HTML content but also from the registration data of the corresponding domain names for all the websites they screened. They then relied on that dataset to train a Random Forest classifier.’

That sounds good. So why was your help needed after all?

Senne: 'Although these are indeed promising results, the battle against fraudulent web shops is not over yet. We use a test set to assess a supervised classifier. It contains examples that we did not use during training, where we are sure as to whether they are false web shops or not. The Random Forest classifier achieves high precision and a good recall on the test set. But DNS Belgium found that the classifier was actually having more and more trouble detecting new fake web shops and was especially producing many false positives.'

Could you elaborate on that, Senne?

Senne: 'Precision is the percentage of fake web shops that the classifier finds and that actually turn out to be fake. False positives entail reliable websites that the classifier nevertheless marks as false. Recall is the percentage of fake web shops that the classifier actually considers to be fake. So in a certain sense it is the reverse reasoning of precision.'

Why do you think the Random Forest classifier performed worse on real data over time?

Senne: 'I see three possible explanations:

First, it could be that cybercriminals no longer use .be domain names because they know we are trying to track them down.
A second possibility is that the labelled dataset is insufficiently representative for the entire .be zone.
Finally, cybercriminals may have adapted their tactics and the classifier does not generalise enough to recognise these new forms of fake shops.'

How did you get started with this study?

Senne: ‘I first read up on the subject, of course. Then I analysed thoroughly the labelled dataset and the features that DNS Belgium used for the existing classifier. The distributions of the number of images, the number of internal links and the total number of html tags on the websites made me suspect that our dataset contains clusters of very similar fake web shops. It is possible that the classifier remembers only these clusters and thus does not generalise sufficiently.'

'If these clusters are spread across the training and the test set, this could also explain why the classifier scores so well on the test set. To avoid overfitting on these clusters, I first clustered the fake shops and made sure that all domains from the same cluster ended up in the same fold during cross-validation. In this way, we always train on a collection of clusters while the validation is done on another collection of clusters.'

'In practice, we used all the domains in the .be zone on which we found e-commerce technology.'

Senne Batsleer

Did this yield the desired result?

Senne: 'Yes. Based on the silhouette score, I arrived at 152 clusters. We grouped them so that we ended up with ten folds of roughly the same size. Then I randomly added the negative examples to the ten folds and again trained a Random Forest classifier. After tuning the hyperparameters, the recall on the test set dropped from 96% to 75% while the precision increased to 100%. This result confirms that small changes to fake web shops can cause problems for the current classifier, which no longer detects 1 out of 4 new types of fake web shops.'

What else did you try so as to improve the classifier?

Senne: 'Perhaps the most important contribution of my thesis was moving from a supervised machine learning technique where all training examples have a label (fake or not fake) to a technique called Learning from Positive and Unlabelled Data or PU Learning for short.'

'With PU learning, you don't need negative labels. The ability to add domains with unknown labels enables us to increase dramatically the number of domains we train on. In practice, we used all the domains in the .be zone on which we found e-commerce technology.'

'In the widely used Python libraries, there are no implementations of PU learning methods yet. I therefore had to base my implementations on code from research papers. Inspired by recent research on fake web shop detection, I also implemented some additional features.'

How did you compare the different PU learning methods?

Senne: 'I used roughly 80% of the unlabelled data for training and the other 20% for testing. To estimate how the model can generalise to changing tactics of the fake web shops, I used the 20,000 most recent domains as a test set and the older domains for training.'

'For each method, I tuned the hyperparameters and verified manually the 500 domains with the highest score. Here it became immediately apparent that even for an attentive customer it is not always easy to distinguish a fake shop from a legitimate web shop or a web shop that is still under construction. Of the three PU learning classifiers that we trained, Robust Ensemble SVM was able to detect the majority of fake web shops: of the 500 websites that the model identified as suspicious, 58 were ultimately web shops that I myself consider to be fraudulent.'

Is that a big improvement on the supervised learning approach?

Senne: 'That is a very relevant question. Because we use different datasets, it is not easy to compare the two. PU learning performs similarly in terms of precision. In terms of recall, it seems that PU learning generalises new types of fake shops a little better, but there are also fake web shops that were detected by Random Forest and not by PU learning. So using only PU learning is not the solution.'

How do you think DNS Belgium can continue to work on this?

Senne: 'DNS Belgium needs to extend its research to the entire .be zone and not just to the non-labelled data. This dataset contains only domain names on which Wappalyzer finds e-commerce technology. Wappalyzer is not infallible, however. So there are still some fake web shops slipping through the net. DNS Belgium must therefore learn from the entire .be zone and not just from the web shops. One alternative is to train on the web shops and then make predictions on the entire zone.'

'Another alternative is to train on the entire zone (so not only on the web shops). PU learning is probably not ideal for this, as the percentage of fake web shops would then become very small. It might be better to switch to semi-supervised learning (with negative labels as well). Active learning would certainly be interesting here: instead of labelling random domains, the classifier then indicates itself which domains would be the most informative.'

'However, I think the biggest improvement would be to add the recently detected fake web shops to the training data and regularly retrain them. Since the number of fake web shops found is very limited, I think each additional data point can bring about an improvement.'

Thesis: The detection of fake webshops in the .be zone 1.74MB - pdf