« aalto.fi « www.tkk.fi « CSE

Skip to main content





Database Research Group

Research topics (online information filtering)

XML filtering

This research topic pertains to an important category of database applications: content-based delivery of XML documents on the internet. Clients (called subscribers) specify their interests (or profiles) by filters expressed in the XPath language, and XML routers in a network forward published XML documents continuously from data providers (or publishers) to subscribers. Each document obtained by a router is forwarded to a subset of neighboring nodes in the network, and the forwarding decisions will be made according to the subscriptions made by the clients as XPath filters.

The routers thus must solve the filtering problem of XML documents: given a set of XPath expressions (the filters) and a stream of XML documents, determine for each document in the stream the filters that match the document. The number of subscriptions, and thus the set of XPath filters to be evaluated, can be large; therefore it is important that the evaluation is efficient and scalable. Finite automata provide a natural and efficient way to represent and process XPath filters. Constructing a deterministic finite automaton for the entire set of XPath filters is not feasible, because the size of the automaton is often exponential in the combined size of the filters. Therefore, the automata-based solutions either rely on simulation of polynomial-size nondeterministic automata or on lazy construction of deterministic automata.

Our solution to the XML filtering problem is based on the classical Aho-Corasick multiple-pattern-matching automaton. Given a set of linear XPath filters without predicates, a backtracking version of the Aho-Corasick pattern-matching automaton is constructed for the set of all keywords (i.e., maximal substrings of XML elements not broken by descendant operators "//" or wildcards "*" extracted from the filters.

Matching with wildcards becomes easier if we know the schema or DTD of the XML documents to be matched and/or filtered, because then single wildcards in the patterns usually have only a small amount of possible substitutes. We have developed an optimization strategy, called pruning, for this case. Pruning has been shown to yield significant performance gains when applied together with our Aho-Corasick-based filtering algorithm as well as algorithms based on nondeterministic or lazy deterministic automata.


Publications: