Formula Engine
Abstract
The Formula Engine is a set of tools that provides to evaluate formulas. This document describes the architecture and general behavior of the Formula Engine and the different components that compose it, and the design choices that where made as well as the limitations of the current implementation.
How to read this document
Graph topology
There is a multiple of graph's topology, this section will list them for futur QA purposes. We will consider a formula as a tree graph. Real life graphs are more complex, but we can consider them as trees for demo purposes. An edge can be backward or forward, but not both. It can also be cross or self referring.
Loops are formed by either a backward edge or several cross edges
- simple tree
graph LR;
A((A)) --> B;
B((B)) --> D((D));
B((B)) --> E((E));
A((A)) --> C((C));
C((C)) --> F((F));- forward edge
graph LR;
A((A)) --> B;
B((B)) --> D((D));
B((B)) --> E((E));
A((A)) --> C((C));
C((C)) --> F((F));
A((A)) --> E((E));- backward edge
graph LR;
A((A)) --> B;
B((B)) --> D((D));
B((B)) --> E((E));
A((A)) --> C((C));
C((C)) --> F((F));
F((F)) --> A((A));- cross edge
graph LR;
A((A)) --> B;
B((B)) --> D((D));
B((B)) --> E((E));
A((A)) --> C((C));
C((C)) --> F((F));
E((E)) --> F((F));- self referring edge
graph LR;
A((A)) --> B;
B((B)) --> D((D));
B((B)) --> E((E));
A((A)) --> C((C));
C((C)) --> F((F));
D((D)) --> D((D));How to read the formula graphs
The graphs provided as example respect the following convention:
graph LR;
A((A)) --> B((B));A and B are 2 formulas (on fields A and B), formula A needs the formula B to be evaluated. In other world we need to evaluate formula B before formula A.
graph LR;
A((A))
B((B));
subgraph rule#1
copyB{{copy field B}}
end
A --> copyB
copyB --> BA is a formula with 1 rule (rule#1) that copies field B
graph LR;
A((A))
A --> ifB
subgraph rule#1
ifB{if field B is empty}
ifB -- false --> copy{{copy field B}}
ifB -- true --> set{{set val}}
end
copy-->B((B));A is a formula with 1 rule (rule#1) that if field B is empty, set a value, otherwise copy field B.
Design choices
Granularity
The granularity of the formula engine is the formula itself. This means that the formula engine is not able to evaluate or optimize rules, actions or conditions.
In some cases, if the granularity had been at rule or action/condition level, it would have been possible to run some rules.
Action / condition granularity
In this example we set up a formula on A and B fields
flowchart LR;
A((A))
copyA{{copy field A}}
setValB{{ set value }}
B((B))
Bif{field B is empty}
A-->copyB
subgraph "rule #1 (target A)"
copyB{{copy field B}}
end
subgraph "rule #2 (target B)"
Bif -- true --> copyA
Bif -- false --> setValB
end
B --> Bif;
copyA-->A
copyB-->BIf at run time when evaluating formula on field A (rule#1), we need to evaluate field B first (rule#2), when evaluating field B two scenarios are possible:
- Field B is empty, and we evaluate B = A (back to rule#1), creating a loop (invalid)
- Field B is not empty, and we evaluate 'set value', which is not a loop (valid)
Rules granularity
In this example we set up a formula on A and B fields
flowchart LR;
A((A))
B((B))
rule1[rule#1]
rule2[rule#2]
rule3[rule#3]
A-->rule1
A-->rule2
rule1-->B
B-->rule3
rule3-->AAt run time, we would be able to evaluate rule 2 but not rule 1 or 3.
Formula granularity
Both condition/action and rule granularity were rejected because it would have been too complex to implement (would have needed a DFS at runtime), maintain, and use by the final user. The formula granularity with the same setup as above exclude both A and B formula from evaluation. Following the same use case as above, the effective dependency graph would be:
graph RL;
B((B)) --> A;
A((A)) --> B;This solution is also more secure because it is recursion-less, preventing infinite loops, and faster because it allows some parallelization.
Loop prevention
Circular dependencies are not allowed. If a formula depends on another formula, we are not able to know which one will be executed first.
graph RL;
B((B)) --> A;
A((A)) --> B;In the above example we need field A to evaluate field B, and field B to evaluate field A
In order to avoid circular dependencies, we have to evaluate the formulas graph topological order. If the topological sort has leftovers, it means there is at least one loop in the graph.
Topological sort
The topological sort is done via Khan's algorithm. It is a linear time algorithm that is able to detect loops in the graph. It is however unable to list the different loops
The main idea is to add to a stack and remove all the nodes that have no outgoing edges (ie leaves), from the graph. If the graph has new leaves after the removal, we repeat the operation until the graph does not have more leaves. Any node that is still in the graph is either in a loop or depends directly or indirectly on a loop, they will be called leftovers.
graph LR;
A((A))
B((B))
C((C))
D((D))
A-->B
A-->C
C-->DIn the above example, the topological sort would be: [B,D], [C], [A] with no leftovers.
graph LR;
A((A))
B((B))
C((C))
D((D))
A-->B
A-->C
C-->D
D-->CIn the above example, the topological sort would be: [B] with leftovers [A,C,D], because C and D are dependent on each other, and A is dependent on C.
graph LR;
A((A))
B((B))
C((C))
D((D))
E((E))
F((F))
G((G))
H((H))
A-->C
A-->B
A-->D
A-->E
B-->D
B-->E
E-->F
C-->F
F-->G
F-->H
H-->DIn the above example, the topological sort would be: [G, D], [H], [F], [C,E], [B], [A] with no leftovers.
Self reference exception
If a formula has a self reference, it will not be considered as a dependency. This behavior is intended to allow field initialization.
For example, if we have a formula on A
graph RL
A((A)) --> B;
D --> A;
subgraph rule #1
B{if A is empty};
B -- true --> D{{set value}};
endThe formula engine will not consider A as a dependency, and will evaluate it. The value of A in the condition will be whatever the item.values.A is at runtime
General behavior
Import
After the main process of the import is completed, a formula graph is generated from the database with only not-archived formulas and rules. The formula graph is then explored using Khan algorithm to check if there is a circular dependency. If there is one or multiple circular dependencies, then each formula in the cycle is marked as invalid, and each formula depending on a loop is marked as invalid. Then the topological order is saved on each valid formula.
In case of an error, additional info can be exported in the logs (details about the nodes, the loops etc).
Evaluation
The evaluation of the formula is done in 3 steps:
- Load the formulas structures in the topological order, with the execution tooling, and evaluate them item by item. Formula marked as invalid are skipped.
- Store change in a file
- The file is then read and its data imported in the database When evaluating a self reference formula, the value of the field is the one in the item.values at runtime.
Security considerations
When running a formula we should consider the issue of infinite loops, and its potential DOS attack. The formula engine is designed to prevent infinite loops, and to be able to detect them. The absence of any recursion prevents them.
Performance considerations
The table import has to do post import checks, load all formulas in memory and graph them. This is both memory and CPU heavy but had no impact on evaluation. The algorithm to graph and explore the formulas itself is O(E+V) where E is the number of edges and V the number of vertices (nodes).
When evaluating the formulas the impact on performance is minimal because the topological order is already known. The implementation of khan algorithm outputs steps, each step containing N formulas to evaluate. The formulas are calculated in parallel, and the steps are calculated sequentially.
Limitations
Topological sort maximum size
- Maximum number of steps in the topological sort. Default is 50.
Goings further
- https://en.wikipedia.org/wiki/Topological_sorting
- https://en.wikipedia.org/wiki/Kahn's_algorithm
- https://www.decisionmodels.com/calcsecretsc.htm
- https://en.wikipedia.org/wiki/Tarjan's_strongly_connected_components_algorithm
- https://notes.eddyerburgh.me/data-structures-and-algorithms/algorithms/graph-traversal