News

Can we automate the categorisation of .be websites?

07 July 2023

Update

The code Thomas is using to categorise .be websites is now open source.

Thomas Daniels investigated whether we could classify websites automatically with machine learning (ML) for his master’s thesis. He used artificial intelligence (AI) to feed a computer with examples, which we classified ourselves, so that the self-learning computer could then process other data.

At DNS Belgium, we classify a large number of .be websites every year because we like to know what .be domain names are used for. We moreover include this information every year in our annual report. To classify the websites in a structured manner, we have been using the categorisation model of the Registry -Registrar Data Group of CENTR for about five years, in which 25 categories have been defined (at the first level).

We did this manually up to last year. That is, with a few people who feel like it, we compile a sample of several thousand random .be websites with a few people who have an interest in doing so. We look at what the websites are about and divide them into one of the 25 categories. This is a labour-intensive and far from edifying task, one that we can automate in future with help from Thomas.

How correct are sample and machine?

When we proceed with 'manual' categorisation, we naturally asked ourselves how correct the conclusions drawn from a sample of around 2,000 are, when extrapolated to the 1.3 million .be websites that currently exist.

The sample appears to be statistically large enough, but unfortunately says nothing about non-random subsets of the .be zone such as websites registered with a particular registrar.

For example, via manual categorisation of our sample, we found that 4.85% of all non-low content .be websites  (i.e. websites with rather superficial content) belong in the restaurant/cafe category. According to the model Thomas developed, the figure is 4.13% .  That may seem like a small difference but it corresponds to over 5,500 websites.

It is also not always clear which category a website belongs in. For example, suppose a hotel has a top restaurant that also welcomes non-guests and there is one website for both the hotel and the restaurant. Will you then categorise that site under tourism and accommodation or restaurants?

"At the base is a language model, with which we are all familiar since ChatGPT came onto the scene."

Benefits of machine learning

Our ML model can't solve that, but it works quickly and accurately enough to enable us to categorise all .be websites. Moreover, you don't just get a rating (or percentage) for a particular category, you can also retrieve the full list of all websites in a category.

You can thereby calculate percentages for a particular subset also, e.g. of all domain names registered in 2023. And you can examine correlations with other variables  such as the probability that a domain name will be renewed.

How does machine learning work?

'At the basis of our approach is a language model, as we are all familiar with since the advent of ChatGPT,' Thomas explains. 'The language model is used as part of a larger model that also takes into account the outbound links and some numerical characteristics of the website. Such a model can be thought of as a complex mathematical function you can use to convert input to output (categories in our case). The language model is pre-trained with lots of data (from more than 100 languages) so that it gains a general understanding of how language works.'

The collection of labelled examples is split into a training set - data that you use to train the model - and a (usually smaller) test set to check whether the model works correctly.

Once the model has been trained on the labelled data, you check whether it can generalise and handle data from websites that were not used to train it. We do this on the test set.

"If a computer classifies more than 80% correctly, it is doing better than humans."

We had three people review each website to  make sure we were feeding the model with correct data for one, but also to find out how exactly humans work. In the subsequent step, Thomas looked at the extent to which the results of the model matched the manual categorisation.

Thomas conducted a large number of experiments to ascertain which approach produced better results and found that the combination of three models worked best.

Does a machine do better the humans?

The key question, of course, remains whether an AI-driven self-learning machine classifies more accurately than our employees.

That is still not the case at the moment. 'If a computer classifies more than 80% correctly, it is doing better than humans,' says Thomas. 'If we assume that only categorisation by two or three humans is correct, we currently achieve 75% accuracy. If we moreover assume that the category chosen by one of the three people is correct, the model achieves an accuracy of 85.15%.'

'A machine has the advantage of being able to look at all .be websites, whereas people check a few thousand and magnify the errors they make in them by extrapolating to 1.3 million .be websites.'

The computer will be more accurate on a large scale, but we are never going to have data on that, because we are going to spare our people from screening every .be website.

 

Interested in Thomas' master's thesis?

With this article, we support the United Nations Sustainable Development Goals.