When we talk about the curse of dimensionality, we often refer to the difficulty that arises when analyzing and organizing data in high-dimensional spaces.
In this blog, I will talk about a less known problem related to decision trees, which has a lot of similarities to data analytics in high-dimensional spaces.
Decision trees can be explained by this simple three-step process:
- Learn rules from data
- Apply each rule at each node
- Classification is at the leafs of the tree
Assume a simple example: you are asked to create a rule from two input data sources (e.g. temperature and humidity). Each data source measurement is sampled in one of three states (e.g. low, medium and high). The final decision of the reasoning process is a TRUE/FALSE statement that depends on these two inputs.
As the figure above shows, this leads to 18 leaf nodes (red/green dots) and overall 31 nodes for only two variables! We have simplified the example by assuming only subset of AND Boolean functions (see later) for a given combination. For instance, we didn't model (X is low) OR (Y is either medium or high) -> as that would be mean connecting some other branches!
The depth of the tree grows linearly with the number of variables, but the number of branches grows exponentially with the number of states. Decision trees are useful when the number of states per variable is limited (e.g. binary YES/NO) but can become quite overwhelming when the number of states increases. In general case, for people that are interested in little bit of maths:
How many distinct decision trees do we have with 'n' Boolean attributes?
A = number of Boolean functions = number of distinct truth tables with 2^n rows = 2^2^n
e.g.: With 6 Boolean attributes we get 18,446,744,073,709,551,616 !!!
Boolean functions can be expressed using this table
which in the case of n=2 gives 16
More about this you can find in this Berkeley course on AI, topic on decision trees, page 11 and 12
Waylay Rules Engine vs. Decision trees
In order to show the difference between how we model rules in waylay compared to CEP and Decision trees engines, I have taken the example from this blog: “How-to: Build a Complex Event Processing App on Apache Spark and Drools“.
In this example (you can find the blog at this link), the author considers running CEP for sepsis-shock analysis using Spark, HBase and Drools, evaluating the conditions at runtime. Sepsis analysis is described here:
- When a patient presents with two or more SIRS criteria but with hemodynamic stability (i.e. blood pressure at baseline), a clinical assessment must be made to determine the possibility of an infectious etiology.
- If an infection is suspected or confirmed, the patient is diagnosed with Sepsis and a lactate level is obtained to determine the degree of hypoperfusion and inflammation. A lactate level ≥ 4 mmol/L is considered diagnostic for Severe Sepsis, and aggressive management with broad spectrum antibiotics, intravenous fluids, and vasopressors should be initiated (aka EGDT).
- Patients that present a suspected or confirmed infection AND hemodynamic instability should immediately be treated for Septic Shock. While SIRS criteria will likely be present in these patients, aggressive management should not be delayed while waiting for laboratory values such as the WBC or lactate.
This is what the author says (part of the text put in bold by me):
"As you can see from the above picture, a patient would meet SIRS criteria if any two vitals go out of range in a 24-hour window. These vitals arrive at different times, so typically you would use HBase to re-construct a patient state every time we get a new vital reading and then run though the rules. For the sake of this example, lets assume that you are getting all the patient’s vitals in every event, and skip the snapshot/state-building step."
What is interesting is that he already mentioned one issue, which is about data coming in different time intervals. Waylay engine natively solves this problem by eviction policy on the sensor level, but for the moment, let’s stay focused at decision trees issues.
In order to solve the problem the author is using Drools rules expressed in the excel file (you can read more about how this is done at this link):
As described earlier, the excel sheet becomes incredibly big, as you can imagine from the inherited problem of branching trees. What is also not so great is that you start seeing conditions and even the code definitions (not the code itself) inside the excel sheet.
Here is how the author explains the excel, it speaks for itself:
In the above picture:
- All the light red-shaded cells are links back to code.
- All the orange-shaded cells contain the values set after a rule is successfully evaluated.
- All the green-shaded cells contain ranges/values on the incoming data to satisfy a given rule.
- All the blue-shaded cells are names of rules and their conditions.
Only once all that is done, you can start the Spark, HBASE and Drools integration.
How long do you think it takes to implement this use case? How do you test it? How do you visualise rule at runtime?
Download the Waylay rules engine introduction guide
Majority voting in decision trees?
One thing that often gets swept under the carpet is that decision trees have a problem of expressing majority "voting", like in the example of SIRS, where at least 2 out of 4 conditions needed to be TRUE in order to pass the SIRS criteria.
Of course you can make this condition in the code, but then what is the point in doing this visually (or in the excel sheet). From Drools document, here is how you deal with multiple conditions:
So, in the example above, you would still need to code a rule that would trigger if at least 2 out of 4 conditions are crossed.
Also note the warning on the memory requirements for large decision tables - as you might guess it comes from branching...
Creating the same example using WAYLAY
Can we do it better? Let’s do it the Waylay way:
The Waylay engine allows compact representation of logic. Combining variables is simplified by adding aggregation nodes. The relation between variables and their states is expressed via boolean logic. More about it here:
Now let's try the same example. This is our template created in 5 minutes, ready for testing (using test data) and deployment!
This is how we have created a majority rule example, one gate in the rule for SIRS criteria (3x3x3x2=54 combinations for that gate, where one of the nodes is actually a gate - PaC02OrRespRateIssue, with 3x3 aggregate). So this gate already shows 243 combinations!
Another subrule which we need to add combines the current and previous meassurements (Lactic acidosis, SBP <90 or SBP drop ≥ 40 mm Hg of normal):
Taking into account previous meassurements was easily modeled using the function node , with expression
<SBP.parameter>[-1] - <SBP.parameter> and threshold of 40, which was then combined with the threshold crossing on the current meassurement:
Once the rule is ready, we can easily visually inspect different conditions and check if all things are properly modelled:
We can also test the logic using the REST interface, by running the template against the test data set, or simply play the file live and see what happens, like in the video below, where the nurse got a phone call when the patient got in trouble.
And here is the video, enjoy:
How about nodered?
In this blog, we used Drools to show the complexity inhereted in using existing CEP rules engines for decision making. Readers are encouraged to try to model the same Sepsis analysis example using nodered or any other flow engine.