Create a custom text classification model, and update your apps with it

In this codelab, you'll learn how to update the text classification model, built from the original blog spam comments dataset, but enhanced with comments of your own, so you can have a model that works with your data.

Prerequisites

This codelab is part of the Get started with mobile text classification pathway. The codelabs in this pathway are sequential. The app and the model you'll work on should have been built previously, while you were following along with the first pathway. If you haven't yet completed the previous activities, please stop and do so now:

  • Build a basic messaging style app
  • Build a comment spam machine learning model
  • Update your app to use a spam filtering Machine Learning model

What you'll learn

  • How to update the text classification model you built in the Get started with mobile text classification pathway
  • How to customize your model so it blocks the most prevalent spam in your app

What you'll need

  • The messaging app and spam filtering model you observed and built in the previous activities.

You can get the code for this by cloning this repo and loading the app from TextClassificationStep2. You can find this in the TextClassificationOnMobile->Android path.

The finished code is also available for you as TextClassificationStep3.

If you open the messaging app you built and tried this message, it would give a very low spam score:

f111e21903d6fd1f.png

Misspellings like this are a common way to avoid spam filters. And while the message is innocuous, spammers will often add a link in the user ID (instead of the message itself, where having a link might trigger the filters).

In this lab, you'll see how to update the model with new data. When you're done, running with the same sentence will give the result below, where this message is identified as spam!

c96613a0a4d1fef0.png

To train the original model, a dataset was created as a CSV (lmblog_comments.csv) containing almost a thousand comments labelled either spam or not spam. (Open the CSV in any text editor if you want to inspect it.)

The makeup of the CSV is to have the first row describe the columns – here they are labeled commenttext and spam.

Every subsequent row follows this format:

64c0128548e1d082.png

The label to the right is true for spam, and false for not spam. In this case line 3 is considered to be spam.

To add your own comments, for example, if you have a lot of people spamming your site with messages about online trading, all you have to do is add examples of spam comments at the bottom of your site. For example:

online trading can be highly highly effective,true
online trading can be highly effective,true
online trading now,true
online trading here,true
online trading for the win,true

When you're done, save the file with a new name (for example lmblog_comments.csv), and you'll be able to use it to train a new model.

For the rest of this codelab you'll use the example provided, edited and hosted on Google Cloud Storage with the online trading updates. You can change the URL in the code if you want to use your own dataset!

To retrain the model, you can simply re-use the code from earlier (SpamCommentsModelMaker.ipynb), but point it at the new CSV dataset, which is called lmblog_comments_extras.csv. If you want the full notebook with the updated contents, you can find it as SpamCommentsUpdateModelMaker.ipynb.

If you have access to Google Colab, you can launch that directly from here, otherwise get the code from the repo and run it in your notebook environment of choice.

Here's the updated code:

training_data = tf.keras.utils.get_file(fname='comments-spam-extras.csv',   
          origin='https://storage.googleapis.com/laurencemoroney-blog.appspot.com/
                  lmblog_comments_extras.csv', 
          extract=False)

When you train, you should see that the model will still train to a high level of accuracy:

8886033d1f8161c.png

Go through the notebook to download the model, vocab, and labels files. In the next step you'll integrate them in Android.

  1. Open the app in Android studio, and make sure Android is selected at the top of the project explorer.
  2. Find your assets file containing the labels, model, and vocab from the old version of the model file.

91116524e9016ed4.png

  1. Right click the assets folder.
  2. In the menu, select the option to open the folder with your operating system's file manager. (Reveal in Finder on Mac, as shown. It will be Show in Explorer on Windows, and Open in Files or similar on Linux.)

25f63f9629657e85.png

  1. This opens the directory containing the model, vocab, and labels in your operating system's file manager. Copy the new ones you created in the previous step over these.

You don't need to make any changes to the code in your app. Run it and give it a test, and you'll see results like the ones above, where the model has improved to detect the "onllline trading" text scenario.

A finished version of the code is available in the repo as TextClassificationStep3.

You can get the code for this by cloning this repo and loading the app from TextClassificationStep2. You can find this in the TextClassificationOnMobile->Android path.

The finished code is also available for you as TextClassificationStep3.

If you worked through the previous codelab, you'll have an iOS version of TextClassificationStep2 that works with the base model. If you want to start from our existing version, just take that one from the repo. It will work with the first model that was trained on the comment spam data, and you might see results like this:

553b845565b5b822.png

Updating the app to use your new model is really simple. The easiest way is to just go to your file explorer, get the new versions of model.tflite, vocab, and labels.txt and copy them to your project directory.

Once you've done this, your app will work with the new model, and you can try it out. Here's an example of the same sentence, but using the new model:

9031ec260b1857a3.png

That's it! By retraining the model with new data, and adding it to both your Android and iOS apps, you've been able to update their functionality without writing any new code!

Next Steps

This model is just a toy one, trained on only 1000 items of data.

As you explore natural language processing, you may want to work with larger datasets. You could also set up a continuous training pipeline, so when new data comes in and is flagged as spam, it can automatically retrain a model on a backend, and then deploy that model using Firebase Model Hosting.

Your users seamlessly get an updated model without you needing to copy and paste it as an asset, recompile, and redistribute. You could also, for example, use Firebase Remote Config to manage the threshold value for sending spam, instead of the 0.8 that you have now.

There's so many possibilities, and we'll explore these in future codelabs in this course!