Understanding the Basics of the Bayesian Bucket, How It Works and How to Use It
As a method of statistical inference, Bayesian Inference applies Bayes' theorem to constantly update the probability for a hypothesis based on new evidence or information as it becomes available. Bayesian reasoning can also have interesting applications in cybersecurity and the detection of malicious behavior. In this article, we discuss one such application, but to understand it, we first need to take a step back and understand how cyber attacks are detected today.
Like many intrusion detection systems in use today, such as Fail2ban, the Crowdsec Security Engine, at its core, is built on the leaky bucket algorithm. This algorithm works as follows: Every time there is an event we are trying to detect, say a failed login attempt, the event enters a bucket with a certain capacity and leak speed. If a user fails once but then manages to log in, maybe due to a typo in their password, this event will eventually leak from the bucket, returning it to an empty state. If, however, the user fails multiple times, events keep entering the bucket without giving it time to leak them out, until it reaches capacity and overflows, triggering the Security Engine to report an attack and proceed with remediation.
While a wide range of cyber attacks, from very basic brute force to targeted attacks like log4shell can be detected this way, not all behaviors can be detected using leaky buckets. This is why we are continually improving the detection capabilities of our engine. With CrowdSec Security Engine 1.5, we released the conditional bucket, allowing users to detect more intricate behaviors such as successful brute-force attempts or impossible travel. With the 1.5.3 patch, we are releasing the Bayesian bucket, a new bucket that uses the Bayesian inference under the hood. Let’s jump in and show you how our Bayesian bucket works, and how you can use it to detect more obscure behaviors.
Introduction to Bayesian inference
This section is mainly to bring people who have never done any type of Bayesian inference up to speed on the basics. Feel free to skip to the next section if you already feel confident about your knowledge of the topic.
Bayesian inference fundamentally is about improving predictions based on new information. As an example, imagine you’re a night shift worker who works in a cellar without windows. In the morning, when you leave, you want to know whether it is currently raining outside. Let’s assume you live in Ticino, the sunny part of Switzerland, where it usually rains only about 5% of the time. So, the probability that it is raining is P(it is raining) = 0.05.
Before you leave you see your colleague enter. She has brought her umbrella with her. Assuming she doesn’t just bring her umbrella to work every day, this brings you new information about the weather outside. In mathematical terms, the probability we are now interested in is P(it is raining|colleague brought her umbrella), read as “probability it is raining given that our colleague brought her umbrella”. This is called a conditional probability and it is the bread and butter of Bayesian inference. Using Bayes’ theorem and some knowledge about our colleagues' umbrella habits, we can actually calculate this probability, improving our prediction of whether it is raining or not. This process of applying Bayes’ theorem is called Bayesian inference.
Say our colleague lives in a house with windows, so there's a 95% chance that when she leaves in the morning and it rains, she brings her umbrella. This is the probability P(colleague brought her umbrella|it is raining). Since she works the day shift, she also checks the weather report, so she doesn’t get wet in the evening. Say, there is a 10% chance that she will bring her umbrella even if it is not raining right now. This is the probability P(colleague brought her umbrella|it is not raining).
Bayes theorem now tells us that P(it is raining|colleague brought her umbrella) = P(colleague brought her umbrella|it is raining) * P(it is raining) / (P(colleague brought her umbrella|it is raining) * P(it is raining) + P(colleague brought her umbrella|it is not raining) * P(it is not raining))`.
Plugging in the numbers, we get P(it is raining|colleague brought her umbrella) = 0.33. This means it is 6x times more likely than the baseline probability, called prior, to be raining outside. It's still fairly unlikely, we’re in Ticino after all, but we might start worrying about getting our hair wet when we return home.
The same formula can be iteratively applied to any piece of information we receive. Another colleague might be using their bicycle less often when it rains, so we check if they brought their helmet, etc. Each time, we gradually improve our prediction. The Bayesian bucket implements this procedure, but instead of wondering about the rain, at CrowdSec, we are trying to catch malicious actors and instead of our colleagues' helmets and umbrellas, we look at the traces these malicious actors leave in our logs.
Inside the Bayesian bucket
Under the hood, the Bayesian bucket is a Bernoulli naive Bayes classifier. This is fancy math talk for the fact that we consider all events to be independent binary random variables when conditioned on the class. As this sentence is still confusing as hell, we take it apart step by step.
Independent here roughly means that none of the conditions we feed into the bucket gives us information about the other conditions. As an example, this can happen if we have multiple conditions on the http-user-agent. If we have one condition that the user agent is from a Windows machine and another that it is from a Mac then these two events are not independent as a machine which is a Mac cannot be a machine running Windows at the same time. For most conditions, however, this shouldn’t be an issue and even if there is an unknown correlation you are unaware of, naive Bayes is fairly robust even when the independence condition is violated.
The second restriction is that we only use Boolean conditions. This makes the scenarios very easy to configure but removes use cases where we want to make decisions based on the count of certain signals or based on enum types (such as Windows, Mac, Mobile, and Linux for the user agent). We are working on adding these in a future update.
For the scenario configuration, we use the following YAML format.
We discuss how the parameters should be set in the next section. The bucket internally works like this: Every time an event is poured into the bucket, it iterates through all the conditions specified. Each condition is evaluated against the current contents of the bucket and then the posterior P(evil|condition==state) is calculated using Bayes theorem. This posterior then becomes the prior for the next condition. After all conditions have been evaluated we compare the resulting posterior with the threshold specified in the YAML. If the threshold is higher than the posterior, the posterior is reset to the prior specified in the YAML and the bucket waits for the next event. If the posterior is bigger than the threshold, the bucket overflows, triggering the scenario in the Security Engine.
How to set the parameters for the Bayesian bucket
The first step to configuring a Bayesian bucket is to get a sample set of IPs — let's call them evil IPs — that you want to catch. Get a good look at these sample IPs to see if you can spot patterns. For example, what kind of http-paths do they go to? Next, you should come up with some conditions based on your observations. Think of this as configuring the filter field for normal scenarios, with the conditions supporting the same expressions as that field. These conditions should be elements that differentiate the evil IPs from other IPs you might see.
After you have set your conditions, define a top-level filter as normal, which includes all the conditions. For example, if all your conditions are HTTP-based you could go with
This is to ensure that all the events you want to detect end up in your bucket.
Now that you have specified the filter, you can set your parameters. We define benign IPs as the IPs that pass the filter but shouldn’t trigger the scenario. The parameters are now set by the following formula:
Bayesian prior : #evil_ips/(#evil_ips + #benign_ips)
And for each condition:
Prob given evil : #evil_ips_that_trigger_condition/#evil_ips
Prob given benign : #benign_ips_that_trigger_condition/#benign_ips
Note: It is not recommended to change these parameters unless you have solid statistical reasoning to back it up, as this leads to overfitting.
The threshold should now be set according to your business needs. If you want fewer false positives, you can set the threshold higher. If you want fewer false negatives, you can set it lower. A good starting point is 0.5.
I understand that the topic of Bayesian reasoning is not everyone’s cup of tea — mostly statistics nerds, but I do hope that you enjoyed this quick tour of our new Bayesian feature. If you are intrigued to try it out yourself and still have questions or need some help setting it up, feel free to contact me at email@example.com or ask around in our Discord. Your feedback will also allow me to see what kind of tooling might help our community improve their threat detection.
Download the CrowdSec Security Engine today and give it a go.