Supriya Ghosh (Editor)

Statistically improbable phrase

Updated on
Edit
Like
Comment
Share on FacebookTweet on TwitterShare on LinkedInShare on Reddit

A statistically improbable phrase (SIP) is a phrase or set of words that occurs more frequently in a document (or collection of documents) than in some larger corpus. Amazon.com uses this concept in determining keywords for a given book or chapter, since keywords of a book or chapter are likely to appear disproportionately within that section. Christian Rudder has also used this concept with data from online dating profiles and Twitter posts to determine the phrases most characteristic of a given race or gender in his book Dataclysm.

Example

In a document about computers, the most common word is likely to be the word "the", but since "the" is the most commonly used word in the English language, it is likely that any given document will have the word "the" used very frequently. However, a word like "program" might occur in the document at a much higher rate than its average rate in the English language. Hence, it is a word unlikely to occur in any given document, but did occur in the document given. "Program" would be a statistically improbable phrase.

The statistically improbable phrases of Darwin's On the Origin of Species are: temperate productions, genera descended, transitional gradations, unknown progenitor, fossiliferous formations, our domestic breeds, modified offspring, doubtful forms, closely allied forms, profitable variations, enormously remote, transitional grades, very distinct species and mongrel offspring.

References

Statistically improbable phrase Wikipedia