The block Wiedemann algorithm for computing kernel vectors of a matrix over a finite field is a generalisation of an algorithm due to Don Coppersmith.
Let M be an n × n square matrix over some finite field F, let x b a s e be a random vector of length n , and let x = M x b a s e . Consider the sequence of vectors S = [ x , M x , M 2 x , … ] obtained by repeatedly multiplying the vector by the matrix M ; let y be any other vector of length n , and consider the sequence of finite-field elements S y = [ y ⋅ x , y ⋅ M x , y ⋅ M 2 x … ]
We know that the matrix M has a minimal polynomial; by the Cayley–Hamilton theorem we know that this polynomial is of degree (which we will call n 0 ) no more than n . Say ∑ r = 0 n 0 p r M r = 0 . Then ∑ r = 0 n 0 y ⋅ ( p r ( M r x ) ) = 0 ; so the minimal polynomial of the matrix annihilates the sequence S and hence S y .
But the Berlekamp–Massey algorithm allows us to calculate relatively efficiently some sequence q 0 … q L with ∑ i = 0 L q i S y [ i + r ] = 0 ∀ r . Our hope is that this sequence, which by construction annihilates y ⋅ S , actually annihilates S ; so we have ∑ i = 0 L q i M i x = 0 . We then take advantage of the initial definition of x to say M ∑ i = 0 L q i M i x b a s e = 0 and so ∑ i = 0 L q i M i x b a s e is a hopefully non-zero kernel vector of M .
The natural implementation of sparse matrix arithmetic on a computer makes it easy to compute the sequence S in parallel for a number of vectors equal to the width of a machine word – indeed, it will normally take no longer to compute for that many vectors than for one. If you have several processors, you can compute the sequence S for a different set of random vectors in parallel on all the computers.
It turns out, by a generalization of the Berlekamp–Massey algorithm to provide a sequence of small matrices, that you can take the sequence produced for a large number of vectors and generate a kernel vector of the original large matrix. You need to compute y i ⋅ M t x j for some i = 0 … i max , j = 0 … j max , t = 0 … t max where i max , j max , t max need to satisfy t max > d i max + d j max + O ( 1 ) and y i are a series of vectors of length n; but in practice you can take y i as a sequence of unit vectors and simply write out the first i max entries in your vectors at each time t.