Using Elasticsearch for anomaly detection in time series
Low code alerting for unusual events
I used a 14-day free trial of the Elastic Cloud to try out the anomaly detection capabilities with publicly available data on crime in New York City. A dashboard view of this data, containing 5M arrest records from 2006 to 2021 is also available on the NYC Open Data website. I aggregated the frequency of arrests to the day and borough, and loaded it into my Elastic Cloud instance with a machine learning node enabled.
The Elastic Cloud contains the Elasticsearch 7.16.3 and Kibana products that are customarily used to evaluate time series log data, but can be used with any type of data represented in JSON indexes. Elasticsearch is a search engine based on the Lucene library that was initially open-source with the Apache 2.0 license, and then re-licensed by Elastic NV in early 2021. This license change was related to a dispute with Amazon that has resulted in the Amazon offering being rebranded as OpenSearch. Over time, the Elastic product and Amazon OpenSearch product can be expected to diverge, but at the moment their capabilities are still pretty similar.
My Elastic Cloud consisted of multiple instances and zones, including the machine learning node.
In order to load my NOPD arrest data I used the Data Visualizer import menu to create an index, and an index pattern pointing at this index containing my data. This is an alternative to using logstash that can be accessed within the machine learning product.
The data is processed and schema inferred based upon the data types observed in the data.
Once my data is loaded I can set up a machine learning job. I pick my time range and the metric I want to track. Also I can specify any influencers that I want to consider when an anomaly is detected. These influencers may help explain what has caused the event or what segment of the data may be associated with the event. A variety of menus are available to set up single metric or multi-metric jobs.
I selected a multi-metric job and generated a list of anomalies for a higher than usual frequency of arrests per day. While there are a few anomalies that can be found over time on individual days, the more notable observation is the general decrease in arrest frequency over time. The anomaly detection algorithm identifies trend and seasonality in the data and will account for these patterns when identifying anomalies. It is possible to configure email or other types of alerts to be generated to notify users when anomalies have been detected.
While Elasticsearch/Kibana provides a low-code environment for time series anomaly detection, it can be challenging to use. The ability to configure options using the actual GUI menu is somewhat limited, where the Advanced option allows fuller control. The ability to move results from the machine learning job output to a dashboard is also limited.
One issue with using Elasticsearch/Kibana is that the user persona for the tool is also somewhat inconsistent. The user is someone who desires to use a low-code environment, is capable of evaluating the meaning of anomalies, and is also comfortable reading and editing JSON code, or using “painless code”, to do any data cleaning or feature engineering needed prior to execution of the anomaly detection.
Additional investment in the product to improve ease of use of the GUI will open up the potential user base for this powerful time series analysis product. Better integration between machine learning components and the end-user dashboards would also be helpful.
So, what about the crime? It seems like New York City has never been safer. New lows were reached during the early days of the covid pandemic so it seems reasonable to think that crime may go up again as mobility increases. The multi-year trend of decreasing arrests, though, is quite dramatic. If we can assume that arrest rates are a proxy for crime rates, this indicates great progress has been made over time regarding reducing crime in New York City.