By Clément Chastagnol, Head of Data Science, Loïc Petit, Senior Data Engineer, and Jie Lu, Data Scientist at Sidetrade.
Are you a user of Facebook’s multi-lingual aligned word embeddings MUSE? If so, we have good news for you: you can now load them in your code 100x faster, with files that weight 5x less!
What about them MUSE models?
MUSE embeddings are somewhat magical, because they allow you to very quickly build multi-lingual models without having to use a translation model, since the vectors for individual words are aligned. To put things simply, it means that cat in English is represented by the same point as chat in French or gatto in Italian.
In our experience, the performance of the multilingual model, trained on a given language (say English), and applied to data in another language (say French), is not as good as a model directly trained on French data. However, it’s good enough for a first iteration and can be used to bootstrap the annotation of a better training set.
Yes, but can you make them load faster?
While optimizing some of our code making use of the MUSE pre-trained embeddings (which can be downloaded here for 30 languages), we were wondering at ways to speed up the loading time of the models. Each model weights 600 MB on disk and takes around 15 seconds to load.
After trying to modify the code provided by Facebook in the demo notebook without getting any significant gain, we realized that, rather than using the exact files from Facebook, we could change the file format to a binary format.
Since the data is in a space-separated values format, we immediately thought about using msgpack for efficient loading. We already have experience using the msgpack library in Python in our code, after discovering it in a very thorough blog post by Uber on data compression, which highlighted its impressive performance.
So our solution is very simple:
- download the models
- parse them using the original Facebook code
- serialize them in a binary format using msgpack
- reuse them later
Loading the newly created binary format models now takes only 150ms, so a 100x improvement!
Yes, but can you make them lighter?
We also added a conversion in step 3 above, using 16-bits floats instead of the default 64-bits float representation. This makes the models on disk weight around 117MB, instead of 600MB, so a 80% reduction in size.
This comes at a small cost in terms of precision, since float is not an exact representation. However, looking at the distribution of the conversion error below, we find it acceptable. Moreover, on one of our main tasks using these models (text classification using around fifteen classes), there was absolutely no difference in the predictions.
Yes, but can you make them available?
Of course we can! You can find the models for the 30 available languages right here: https://gitlab.com/sidetrade-oss/binary-muse-embeddings
Have fun MUSEing!