The stochastic block model is a generative model for random graphs. This model tends to produce graphs containing communities, subsets characterized by being connected with one another with particular edge densities. For example, edges may be more common within communities than between communities. The stochastic block model is important in statistics, machine learning, and network science, where it serves as a useful benchmark for the task of recovering community structure in graph data.
Contents
Definition
The stochastic block model takes the following parameters:
The edge set is then sampled at random as follows: any two vertices
Special cases
If the probability matrix is a constant, in the sense that
The planted partition model is the special case that the values of the probability matrix
Returning to the general stochastic block model, a model is called strongly assortative if
Typical statistical tasks
Much of the literature on algorithmic community detection addresses three statistical tasks: detection, partial recovery, and exact recovery.
Detection
The goal of detection algorithms is simply to determine, given a sampled graph, whether the graph has latent community structure. More precisely, a graph might be generated, with some known prior probability, from a known stochastic block model, and otherwise from a similar Erdős–Rényi model. The algorithmic task is to correctly identify which of these two underlying models generated the graph.
Partial recovery
In partial recovery, the goal is to approximately determine the latent partition into communities, in the sense of finding a partition that is correlated with the true partition significantly better than a random guess.
Exact recovery
In exact recovery, the goal is to recover the latent partition into communities exactly. The community sizes and probability matrix may be known or unknown.
Statistical lower bounds and threshold behavior
Stochastic block models exhibit a sharp threshold effect reminiscent of percolation thresholds. Suppose that we allow the size
For partial recovery, the appropriate scaling is to take
For exact recovery, the appropriate scaling is to take
Algorithms
In principle, exact recovery can be solved in its feasible range using maximum likelihood, but this amounts to solving a constrained or regularized cut problem such as minimum bisection that is typically NP-complete. Hence, no known efficient algorithms will correctly compute the maximum-likelihood estimate in the worst case.
However, a wide variety of algorithms perform well in the average case, and many high-probability performance guarantees have been proven for algorithms in both the partial and exact recovery settings. Successful algorithms include spectral clustering of the vertices, semidefinite programming, and forms of belief propagation, among others.
Variants
Several variants of the model exist. One minor tweak allocates vertices to communities randomly, according to a categorical distribution, rather than in a fixed partition. More significant variants include the censored block model and the mixed-membership block model.