The problem of maintaining complex event filters
Deployments of OSS solutions into large organisations are often challenged in the area of ensuring that the right information gets to the right people. Monitoring tools tend to generate large numbers of alarms, and various techniques are used to reduce these to manageable levels – with varying degrees of success.
Using Fault Management Root-Cause-Analysis (RCA) as an example of a modern-day filtering technique for this discussion…
The RCA approach is mainly focussed on the needs of the resolver groups and doesn’t play so well in a Service-oriented Management paradigm. To illustrate this: consider a switch that fails, causing a number of connected servers to become unreachable. The resolver engineers do indeed need to know the cause so that they can fix it, but a service model that is based on an understanding of the service end points needs the symptom event (Server unresponsive) rather than the root cause (Switch down) to key on. This requirement means that root-cause analysis systems tend to need both the root cause and symptoms to be presented to the users when service impact is an output of the system.
This need to include the symptoms means that manually maintained filters are generally needed to strip out the irrelevant events and/or to ensure that any remaining events are routed to the right groups.
This illustrates the problem that the key goal of RCA – to automatically reduce the presented event to the set of events that the operators need to see – is not actually achieved automatically by Fault Management systems. Manually maintained filters are still needed to augment the logic, even when highly intelligent analysis techniques are used. This causes a number of problems, such as:
Lack of proper design. The construction of complex filters using pattern matching and boolean operations is not easy to do well, and is more properly understood as a programming task that should be implemented through a formal design / review / implement / test / document cycle. This cycle is usually ignored and filters are added in a piece-meal manner by non-programmers. This makes mistakes likely and hard to detect. The effect of a badly coded filter is that important events can easily be discarded.
Unclear allocation of ownership. It is often hard for an organisation to know where the responsibility for maintaining these filters lies.
Filter bloat. It is not uncommon for user-maintained filter sets to grow over time to a point where they are both detrimental to the performance of the system, and impossible to debug or maintain.
One alternative is to broaden the scope of the service model to include the service paths – but this is potentially very complex, requiring end-to-end path discovery and detailed knowledge of path protection topologies etc. Another alternative is to use true end-to-end transaction monitoring by way of probes installed in the client systems.
An all together different approach, which is potentially much simpler to maintain is to borrow from the idea of trainable “Spam filters” from the world of email solutions. We shall investigate this approach more fully in subsequent blogs.