Which approach is best for Anomaly Detection: Signal Processing or Machine Learning?
We tested both!
By Viviane Souty, Data Engineer at Sidetrade
Each year, Sidetrade hosts its much-loved Hackathon. It’s an opportunity for Sidetraders across all departments to come together and try to find innovative solutions to problems.
At the latest Hackathon, we conducted experiments in order to figure out the best way to solve the problem of Anomaly Detection. The core question, we tried to answer was; given our dataset, which approach could be more suitable? Signal Processing? Machine Learning? Or both?
So, with our curiosity engaged, we set about experimenting!
Event detection using the Sidetrade Unpaid Invoice Tracker
We decided to use a tool that was easily at our disposal: Sidetrade’s Covid-19 Unpaid Invoice Tracker (Figure 1). This tool was launched in 2020 to track B2B late payment trends during the pandemic.
Based on 162 million invoices of €300 billion B2B transaction across 8.6 million buyer/companies, the tracker measured the proportion of unpaid invoices, both in terms of invoice amount (sum of the invoice amount per day) and in terms of invoice volume (number of invoices per day). It was also possible to filter the data by both country and type of industry.
Prior to the pandemic, there were a certain number of cyclical invoice payment behaviours we could come to expect. For example:
· During the weekends, we expected less financial exchanges
· Some/most of companies prepare their payments for the end of the month or the beginning of the next one
· At the end of the year,
o Most companies carry out their balance sheet for the year, and therefore update their payment/or request their money
o there is also the Christmas’ Holiday
· In summer, there is also many people that take (long) holiday
All of these cyclical behaviours have an impact on invoice payment transactions which can then be seen in the invoice tracker.
We knew that because of the nature of the pandemic, we could expect a number of events to fall outside of usual late payment cyclical behaviours and would provide us with an excellent opportunity to put our anomaly detection experiment to work.
An overview of the experiment
The exploratory study we conducted was three-fold:
1. Firstly, could we find some anomalies within the invoice tracker specifically linked to the COVID-19 crisis?
2. Secondly, could we easily detect these anomalies?
3. And thirdly, could we also find a way to alert the near-start of an anomaly or forecast for future behaviour?
The data we experimented with
For this experiment, we focused on the proportion of unpaid invoices by day and by activity sector in France (Figure 2): sum (amount_to_pay) / sum (order_amount).
We collected the tracker values from 02 Jan 2020 to 27 Aug 2022 for this study. The available activity sectors that were available for French companies were:
· Agro-agriculture
· Construction
· Energy, water, environment
· Finance, Real estate, Insurance
· Health, Life Sciences
· Industries, Manufacturing
· Information, Communication, Science
· Leisure, Accommodation, Catering
· Others
· Public, Teaching, Administration
· Trade, auto/motorcycle repair
· Transportation and Storage
We chose to process the data by activity sector and to apply the explored methods to agro-agriculture, leisure-accommodation-catering, transportation-and-storage and public-teaching-administration (Figure 2).
There were a number of key milestone dates during the pandemic in France that we needed to bear in mind for this experiment:
1. Dec 2019: Sanitary crisis begins to emerge in Asia
2. Mid-March to mi-May 2020: The strictest lockdowns implemented globally, many activities were stopped such as teaching, restaurant, sports… Economic concerns cause global stock market crash
3. Nov. to mid-Dec. 2020: Lockdown
4. April to June 2021: Lockdown
The explored methods: Signal Processing and Machine Learning approaches
We chose two methods to test the feasibility of anomaly detection on the Sidetrade Invoice Tracker:
1. A 1D signal processing approach, using Python, as we can found for seismic event detection.
2. A 1D Machine Learning approach using clustering tools including Dataïku software
1. Signal processing approach
Each time series can be considered as a signal constructed from a multitude of mixed cyclical events (harmonic-based signal). As such, we were able to study these time series using a signal processing approach.
As proposed below, we wanted to use a fake signal in order to explain the method (Figure 3). This fake signal is a build of cyclical variations having yearly (orange), monthly (green) and weekly (red) periods.
A 45-day anomaly and a starting anomaly have also been added to this signal (purple).
A spectrum analysis has the aim of finding recurrent periods, but not the anomalies themselves. It meant that we could find the yearly, monthly and weekly periods on the spectrum of the fake signal. Then we were able to filter the signal accordingly to focus on what we wanted.
The simplest way to find out anomalies is to look for unusual behaviours — i.e. to look for values that are far from the mean. However, what was normal two years ago, might not still be now.
So, we needed to look for unusual behaviours based on what was happening near to the targeted behaviour we were analyzing. That is why we thought that the STA/LTA method might be very relevant in the case of detecting unusual trends in the Unpaid Invoice Tracker.
Indeed, the STA/LTA is the ratio between the Short Time Average (STA) and the Long Time Average (LTA) and can be used with forward or backward sliding windows.
This method is largely used in seismic detection, and possibly in other domains too (by a different name) as it is inexpensive and easy to implement.
The STA/LTA tends to one for pure noise. The STA increases faster than the LTA when an event is recorded, then the STA/LTA also increases (Figure 4). In the present case we applied the method by computing the absolute values of the Unpaid Invoice Tracker on backward sliding windows.
Indeed, backward sliding windows allow near real time detection. However, forward sliding windows would mean waiting for the day i+LTAlength before we can compute the STA/LTA of the ith day, hence disabling near real time processing possibilities.
Once the STA/LTA curve is built, we can use it as a detector by choosing a threshold above, which we have an anomaly for (Fig. 4). We can easily compute each day the STA/LTA value and if it exceeds the threshold, we know that we are facing a new anomaly (Figure 5).
The parameters of this method can be chosen depending on the scale of the anomalies we want to find. For instance, if we want to find one day anomalies within a month, we can choose 1 day and 30 days for the length of the STA and LTA, respectively.
However, if we want to find one week anomalies within a 3-month date range, we’d rather choose 7 days and 90 days for the length of the STA and LTA, respectively.
The anomaly threshold should be chosen depending on the confidence we want in detecting anomalies or on the intensity level of the anomalies we want to detect.
2. Machine Learning approach
The second approach we chose to test was a Machine Learning approach.
At Sidetrade we usually do Proof Of Concept with Dataiku; it is a tool available to everyone here and many people know how to use it. It’s also one that’s very easy to learn too for anyone not already familiar with it.
So, we decided to explore what we can easily do with the platform. We selected two models available in Dataiku; both Clustering models and automatically configured by Dataiku (Auto ML Clustering).
What are Clustering models?
Firstly, we knew that we were working on data with no labels, which is a very common way to start looking at clustering methods to see if we have similar patterns in the data (Figure 6). With no knowledge it gives us some information, in particular a measure of similarity between each data we have.
We also naturally have a notion of outliers when we try to cluster data. Either by considering the data that are at the edge of the groups (those that are the least similar in the group) or by considering a complete group as anomalies.
We had to make the choice to start working on just a few sub-set of fields. Ultimately, we landed on only two:
1. payment_percentage
2. payment_date
We needed to create new dimension, new data by adding some fields and information in the data, it can be features created from existing features, or trying to find some other datas to add them…
If you are working as we were on temporal data, adding cyclical information (cycle (sin), cycle (cos)) is useful because it helps the model to better understand the datas — e.g. how Monday comes after Sunday.
So with the payment_date feature we eventually had:
· payment_date [week cycle (sin)]
· payment_date [week cycle (cos)]
· payment_date [month cycle (sin)]
· payment_date [month cycle (cos)]
· payment_date [quarter cycle (sin)]
· payment_date [quarter cycle (cos)
How did we evaluate the model?
For unsupervised learning:
· visualization, and manually checking some cases
· metrics e.g. Silhouette: A metric of what would have happened in terms of inertia if each point had been assigned to the next-nearest cluster. A silhouette of 1 implies that each point is in the right cluster, and a silhouette of -1 means that each point is in the wrong cluster. It’s better to be closer to 1!
Which models did we use?
It seemed to be the fastest approach to take advantage of the tools offered by Dataiku, and in particular some models (here Auto ML Clustering) that require little configuration to make a POC as quickly as possible.
The two models we worked on were:
· Isolation Forest: Isolation forest is an anomaly detection algorithm. It isolates observations by creating a Random Forest of trees, each splitting samples in different partitions. Anomalies tend to have much shorter paths from the root of the tree.
· KMeans (Selected number of clusters 5 and 3): The k-means algorithm clusters data by trying to separate samples in n groups, minimizing a criterion known as the ‘inertia’ of the groups.
Dataiku provides auto ML models that allow us to automatically adjust the parameters, but of course we can modify them, or add some in the grid search / suggestion, etc.
Model by (activity sector & country) vs global
Using clustering methods can be very time-consuming determine really meaning clusters, you can be fooled very quickly…. So the first models we trained them on was data divided by Activity Sectors and countries. Models were then trained on small datasets, and this is very fast (<5min , we have around 1000 rows in the train dataset).
The results
We tested the approaches proposed above using the same dataset.
Indeed, we’ve selected the Unpaid Invoice Tracker for French industries. Each industries were proceeded as a different InvoiceTracker. Below we propose some examples of the results we obtained.
Signal processing approach
The signal processing approach was tested with a 60-day low pass filter, a sliding window of one day or 30 days for the STA, and sliding window of 2 years or 60 days for the LTA. The two-year length allows you to flatten the yearly cycle (Figure 7 & Figure 8).
The STA/LTA detectors highlight unusual trends in the Unpaid InvoiceTracker (Figure 7). In particular, we were able to observe an increase of the detector which is simultaneous to the first Covid-19 lockdown in France (from mi-March to mid-May 2020, first gray shadow in Figure 7).
These unusual trends might be positive or negative, but we would need to look at the raw signal to investigate this further.
We can also note that the anomalies around the other lockdowns in France (from Nov. to mid-Dec. 2020 and from April to May 2021), did not seem as significant as the anomalies during the first lockdown.
A threshold adjustment might have detected the missing one, or perhaps the second and third lockdowns had less impact on the economy because they were less strict than the first.
We can also observe in the Figure 7 the importance of the sliding window size. We can focus on the Agro-agriculture industry (Figure 8).
Let’s look at the four highest anomalies detected using the two-year LTA window. We see that the beginning of these anomalies is not found at the same date depending on the STA window size.
Anomalies from the one-day detector are earlier than the anomalies from the 30-day detector . Indeed, as the 30-day detector uses more data, the STA is more flattened. We also note that the highest anomaly in the one-day detector is not the same one from the 30-day detector .
On the detector using the 60-day LTA window, only two anomalies on the Agro-industry remain clearly visible. As we reduce the 730-day LTA window to 60 days, we expect that the difference between the STA and the LTA values are smaller and then the STA/LTA values are also smaller (Figure 8).
Results with ML methods:
Isolation Forest
We can see the anomaly score using the Isolation Forest model for Transport sector (Figure 9) and Agriculture sector (Figure 11). Anomalies are detected on the lowest scores. Anomalies are reported on the invoice tracker data (Transport: Figure 10 and Agriculture: Figure 12).
On Agriculture (FR):
Kmeans
With the Kmeans model we can display clusters, which can thus correspond to a succession of patterns, the smallest cluster can be interpreted as the anomaly one (Figure 13 & Figure 14).
In the case of the Transport sector, we can find back the anomalies detected by the Isolation Forest model in cluster 4, especially during the 1st French lockdown (Figure 13). One can note that before 2021, data are well identified as clusters 2 or 4. After that, the model as more difficulties to classify the data
In the case of the Agriculture sector, it is more difficult to interpret the results of the KMeans clustering (Figure 14). There are no clear periods for the clusters, they seem to alternate all the time, showing no clear impact of the lockdowns.
The clusterization can also give additional information when interpreting the tendance of KPIs… we can determinate in which section we are, after and which follow this one. It can also be a start to compare each different model between them.
Conclusion
· We can really see a difference between activity sector
· We can feel a potential in the data and lot of future work
In this ML section we have applied models on raw data, it could be also interesting to combine the two approaches and train model on “cleaned” data = from the Signal Preprocessing approach…
Comparison
Some anomalies can be found using one method or the other. However, we see that the patterns in April 2020, 2021 and 2022 (yellow boxes) have a high anomaly score using the signal processing approach but were not taken by the ML approach surely because they are very similar. Indeed, this is an anomaly if looking at a one-year timeframe because it occurs each year, but not when looking at several years. And the ML approach works better at looking for pattern.
We can also use both methods at the same time to adjust their respective parameters so that they find the same anomalies.
Limitations
Signal processing approach
Both filtering and STA/LTA computation might trigger edges effects if it is not well configurated. This might be correcting, but not in real-time. The proposed signal processing approach cannot be used to find specific patterns. Some signal processing approach exists such as cross-correlation methods in order to search for pattern but they might be more expensive.
ML approach
Add some features : we have used very few features, would be great to have more ! We know we can have some others that can be interesting to add such as some derived features from the invoice payment delay.
Limitations across both approaches
The parameters need to be adjusted. We must design a method to determine where are the beginning and the end of an anomaly once it is detected.
Not enough data in the Unpaid Invoice Tracker, about 2 years among which we have 1 year and a half of COVID-related data
Pros and cons of both approaches
In conclusion, we were pleased with the results of our experiment as well as both methods because we were able to:
1. Find anomalies on the Unpaid Invoice Tracker
2. Detect these anomalies using tools
3. Find the near start of an anomaly in near real-time
In addition, both methods are highly parameterized and able to target specific anomalies. The more models the better in our opinion for detecting anomalies!
There’s a number of different potential use cases for taking this approach including:
· Forecasting: when we know how to detect and understand anomalies, we will eventually be able to predict the behaviour of a KPI during an event
· Apply on other events: war in Ukraine, loss of container, etc.
· Explore further the methods:
o Apply STA/LTA to more complex functions e.g.
§ k-mean characteristic function
§ coefficients of wavelet decomposition so that we can focus on various period of time
· Fraud/intrusion detection (peak of traffic in order to crash a service) → security
· Outliers: manual error when entering an invoice…
We are excited about integrating these findings within our Datalake and potentially be able to add new features around these use-cases in the coming months. Have you tested these methods before? How did it go for you?
Leave a comment and follow us for more insights on Sidetrade’s Tech Hub!