Machine learning is a subject that has been on trend for years, with many of our clients asking us for solutions using this technology. From email triage to image recognition or even recommendation systems, many of our clients can benefit from it.
However, machine learning in artificial intelligence comes with its load of complexity and challenges such as text classification. Concrete applications of text classification might be classification of legal documents, classification of product data, classification of comments in bank transfers, finding an intent in a chatbot, email triage, and so on. These applications might look very different. However, through the glasses of machine learning they are pretty similar. So let’s have a general look at it. The very first question you need to answer when you want to classify text data is:
"Are there texts available which are already classified? In other words, do we have labelled data?"
If the answer is yes, you can happily take the way of supervised learning. If the answer is no, then you might have a moment of hesitation whether going the way of supervised or the way of unsupervised learning. There is a nice lecture of Geoffrey Hinton about autoencoders applied to some Wikipedia articles (see here for the lecture and Figure 1 and Figure 2 for the visualisations of topic clustering) which opens an interesting door to cluster in an unsupervised and in an at least visually convincing way:
Figure 1: 2-d visualisation of topics using principal component analysis
Figure 2: 2-d visualisation of topics using autoencoders
However, if you are not into research and if you want to make things work in a limited timeframe, you really should consider the risks of taking an unsupervised path. In fact, unsupervised learning might get extremely expensive as the outcome is not set beforehand with labels. On the other hand, it’s often not very expensive to do some labeling of the data. Moreover, in many cases even with relatively few labels you get some very decent results. For the remaining part let’s assume that we have labeled data. The next question is then:
"How much labelled data is available?"
We are living in the world of big data but in reality the data sets of a particular problem are often far away from a data set like “All articles in Wikipedia” (see for example here for some other nice data sets ). Furthermore, the data sets – in our case text data – are often very particular in view of wording, patterns, and structures. Since there are more and more cool applications of neural networks which are trained on rather massive amount of data, one is tempted to rephrase the question above into the following:
"Are state-of-the-art neural networks suitable for the problem or is it better to use more classical machine learning tools?"
This question reflects well the dream of AI: If you have as much labelled data and as much computing power as you want, then with the right architecture of a neural network the model should be able to learn everything by itself. Applied for example to email triage this means that such model would learn that a certain part in an email is a footer and it should be discarded when performing the triage as it has no influence on the main message content.. However, the reality is often different and the amount of data available is not sufficient to have a suitable training of the neural network. We refer to this process of feeding the system with additional rule-based information as pre-processing.
Figure 3: First “Inception” model from Google by 2015. Deep neural network for image classification.
Nevertheless, neural networks should not be excluded by default. Even if you do not have enough data, you could for example do some transfer learning which means that you take a pretrained neural network and perform a final training on your own data set. One of the most famous domain of application for transfer learning is image recognition. For example only on Keras you find already around 10 pre-trained models (see also model in Figure 3). Pre-trained models with text, and particularly for Named Entity Recognition, are less available and are usually training on English corpus whereas in Switzerland we usually need French and German trained models.
In conclusion, we found that for training a baseline model with the goal of getting a Minimum Viable Product (MVP), it us usually a better idea to use classical classification machine learning algorithms. Usually they yield very good and even interpretable results in a fair amount of time. In the remaining part of this article we will focus mainly on such methods.
There are in general 3 steps to follow:
- Pre-processing of the data
- Building and training a model
- Model evaluation and class prediction
To obtain satisfying results, these 3 steps are to be followed in an iterative fashion by using the last step output to improve the model design. The last step can give important information on how to improve the model. So you go back to step 1. In the next article, we explain in more detail these 3 steps applied to the E-mail triage use case.