Zipf's law

Zipf's law (/zɪf/, German: [t͡sɪpf]) is an empirical law that often holds, approximately, when a list of measured values is sorted in decreasing order. It states that the value of the nth entry is inversely proportional to n.

The best known instance of Zipf's law applies to the frequency table of words in a text or corpus of natural language: ${\text{word frequency}}\propto {\frac {1}{\text{word rank}}}.$ It is usually found that the most common word occurs approximately twice as often as the next common one, three times as often as the third most common, and so on. For example, in the Brown Corpus of American English text, the word "the" is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences (69,971 out of slightly over 1 million). True to Zipf's Law, the second-place word "of" accounts for slightly over 3.5% of words (36,411 occurrences), followed by "and" (28,852).^[2] It is often used in the following form, called Zipf-Mandelbrot law: ${\text{frequency}}\propto {\frac {1}{({\text{rank}}+b)^{a}}}$ where $a,b$ are fitted parameters, with $a\approx 1$ , and $b\approx 2.7$ .^[1]

This law is named after the American linguist George Kingsley Zipf,^[3]^[4]^[5] and is still an important concept in quantitative linguistics. It has been found to apply to many other types of data studied in the physical and social sciences.

In mathematical statistics, the concept has been formalized as the Zipfian distribution: a family of related discrete probability distributions whose rank-frequency distribution is an inverse power law relation. They are related to Benford's law and the Pareto distribution.

Some sets of time-dependent empirical data deviate somewhat from Zipf's law. Such empirical distributions are said to be quasi-Zipfian.

[1]

[2]

[3]

[4]

[5]

Zipf's law

Probability distribution / From Wikipedia, the free encyclopedia

Dear Wikiwand AI, let's keep it short by simply answering these key questions: