The iterative proportional fitting procedure (IPFP, also known as biproportional fitting in statistics, RAS algorithm in economics and matrix raking or matrix scaling in computer science) is an iterative algorithm for estimating cell values of a contingency table such that the marginal totals remain fixed and the estimated table decomposes into an outer product.
Contents
- Algorithm 1 classical IPFP
- Algorithm 2 factor estimation
- Algorithm 3 RAS
- Discussion and comparison of the algorithms
- Existence and uniqueness of MLEs
- Goodness of fit
- Interpretation
- Example
- Implementation
- References
IPF is a method that has been "re-invented" many times, e.g. G.U. Yule in 1912 in relation to standardizing cross-tabulations and Kruithof in 1937 in relation to telephone traffic ("Kruithof’s double factor method") , and expanded upon by Deming and Stephan in 1940 (they proposed IPFP as an algorithm leading to a minimizer of the Pearson X-squared statistic, which it does not, and even failed to prove convergence), it has seen various extensions and related research. A rigorous proof of convergence by means of differential geometry is due to Fienberg (1970). He interpreted the family of contingency tables of constant crossproduct ratios as a particular (IJ − 1)-dimensional manifold of constant interaction and showed that the IPFP is a fixed-point iteration on that manifold. Nevertheless, he assumed strictly positive observations. Generalization to tables with zero entries is still considered a hard and only partly solved problem.
An exhaustive treatment of the algorithm and its mathematical foundations can be found in the book of Bishop et al. (1975). The first general proof of convergence, built on non-trivial measure theoretic theorems and entropy minimization, is due to Csiszár (1975). Relatively new results on convergence and error behavior have been published by Pukelsheim and Simeone (2009) . They proved simple necessary and sufficient conditions for the convergence of the IPFP for arbitrary two-way tables (i.e. tables with zero entries) by analysing an
Other general algorithms can be modified to yield the same limit as the IPFP, for instance the Newton–Raphson method and the EM algorithm. In most cases, IPFP is preferred due to its computational speed, numerical stability and algebraic simplicity.
Algorithm 1 (classical IPFP)
Given a two-way (I × J)-table of counts
Choose initial values
Notes:
Algorithm 2 (factor estimation)
Assume the same setting as in the classical IPFP. Alternatively, we can estimate the row and column factors separately: Choose initial values
Setting
Notes:
Obviously, the I-model is a particular case of the Q-model.
Algorithm 3 (RAS)
The Problem: Let
and
Define the diagonalization operator
where
Finally, we obtain
Discussion and comparison of the algorithms
Although RAS seems to be the solution of an entirely different problem, it is indeed identical to the classical IPFP. In practice, one would not implement actual matrix multiplication, since diagonal matrices are involved. Reducing the operations to the necessary ones, it can easily be seen that RAS does the same as IPFP. The vaguely demanded 'similarity' can be explained as follows: IPFP (and thus RAS) maintains the crossproduct ratios, e.i.
since
This property is sometimes called structure conservation and directly leads to the geometrical interpretation of contingency tables and the proof of convergence in the seminal paper of Fienberg (1970).
Nevertheless, direct factor estimation (algorithm 2) is under all circumstances the best way to deal with IPF: Whereas classical IPFP needs
elementary operations in each iteration step (including a row and a column fitting step), factor estimation needs only
operations being at least one order in magnitude faster than classical IPFP.
Existence and uniqueness of MLEs
Necessary and sufficient conditions for the existence and uniqueness of MLEs are complicated in the general case (see), but sufficient conditions for 2-dimensional tables are simple:
If unique MLEs exist, IPFP exhibits linear convergence in the worst case (Fienberg 1970), but exponential convergence has also been observed (Pukelsheim and Simeone 2009). If a direct estimator (i.e. a closed form of
If all observed values are strictly positive, existence and uniqueness of MLEs and therefore convergence is ensured.
Goodness of fit
Checking if the assumption of independence is adequate, one uses the Pearson X-squared statistic
or alternatively the likelihood-ratio test (G-test) statistic
Both statistics are asymptotically
Interpretation
If the rows correspond to different values of property A, and the columns correspond to different values of property B, and the hypothesis of independence is not discarded, the properties A and B are considered independent.
Example
Consider a table of observations (taken from the entry on contingency tables):
For executing the classical IPFP, we first initialize the matrix with ones, leaving the marginals untouched:
Of course, the marginal sums do not correspond to the matrix anymore, but this is fixed in the next two iterations of IPFP. The first iteration deals with the row sums:
Note that, by definition, the row sums always constitute a perfect match after odd iterations, as do the column sums for even ones. The subsequent iteration updates the matrix column-wise:
Now, both row and column sums of the matrix match the given marginals again.
The p-value of this matrix approximates to
Implementation
The R package mipfp (currently in version 3.1) provides a multi-dimensional implementation of the traditional iterative proportional fitting procedure. The package allows the updating of a N-dimensional array with respect to given target marginal distributions (which, in turn can be multi-dimensional).
Python has an equivalent package, ipfn that can be installed via pip. The package supports numpy and pandas input objects.