Frequent subtree mining - Alchetron, the free social encyclopedia

In computer science, frequent subtree mining is the problem of finding all patterns in a given database whose support (a metric related to its number of occurrences in other subtrees) is over a given threshold. It is a more general form of the maximum agreement subtree problem.

Definition

Frequent subtree mining is the problem of trying to find all of the patterns whose "support" is over a certain user-specified level, where "support" is calculated as the number of trees in a database which have at least one subtree isomorphic to a given pattern.

Formal definition

The problem of frequent subtree mining has been formally defined as:

Given a threshold minfreq, a class of trees C , a transitive subtree relation P ⪯ T between trees P , T ∈ C , a finite set of trees D ⊆ C , the frequent subtree mining problem is the problem of finding all trees P ⊂ C such that no two trees in P are isomorphic and ∀ P ∈ P : f r e q ( P , D ) = ∑ T ∈ D d ( P , T ) ≥ m i n f r e q , where d is an anti-monotone function such that if P ′ ⪯ P then ∀ T ∈ C : d ( P ′ , T ) ≥ d ( P , T ) .

Algorithms

In 2002, Mohammed J. Zaki introduced TreeMiner, an efficient algorithm for solving the frequent subtree mining problem, which used a "scope list" to represent tree nodes and which was contrasted with PatternMatcher, an algorithm based on pattern matching.

Applications

Domains in which frequent subtree mining is useful tend to involve complex relationships between data entities: for instance, the analysis of XML documents often requires frequent subtree mining. Another domain where this is useful is the web usage mining problem: since the actions taken by users when visiting a web site can be recorded and categorized in many different ways, complex databases of trees need to be analyzed with frequent subtree mining. Other domains in which frequent subtree mining is useful include computational biology, RNA structure analysis, pattern recognition, bioinformatics, and analysis of the KEGG GLYCAN database.

Challenges

Checking whether a pattern (or a transaction) supports a given subgraph is an NP-complete problem, since it is an NP-complete instance of the subgraph isomorphism problem. Furthermore, due to combinatorial explosion, according to Lei et al., "mining all frequent subtree patterns becomes infeasible for a large and dense tree database".

References

Frequent subtree mining Wikipedia

(Text) CC BY-SA