Benford’s Law states that numbers from various sources, which we think should have random elements, are distributed anti-intuitively. The leading digits of the numbers are not distributed evenly, instead 30% begin with ‘1’ falling off to 5% which begin with ‘9’. The first digit distribution for leading digits 1 through 9 is (30.1, 17.6, 12.5, 9.7, 7.9, 6.7, 5.8, 5.1, 4.6). One of the examples Benford himself noted was “sizes of river basins”. Is it mysterious that for every basin size that starts with ‘9’, there are 6 or 7 that start with ‘1’?
By the way, there is a clue to the resolution of the mystery in the subject heading. What is the first digit going to be for next 96 entries of that numerical sequence? And again for the next 996 entries when the sequence continues to 998, 999, 1000, 1001, 1002, 1003 … ? And again at 10,000 and 100,000 and … ?
- The leading digits of numbers generated from a random number generator (RNG) are equally distributed ‘1’s through ‘9’s, approximately 11% each.
- Many examples of real-life sets of data, for example “city populations”, are clearly random.
- Benford’s Law says the leading digits for collections of natural data like city populations or sizes of river basins, are 30% for ‘1’s, declining to 5% for ‘9’s.
- So why aren’t they 11% each?
There seem to be two kinds of random. There are RNG randoms and there are natural randoms. What’s up with that?
RNG Randoms … It’s All About That Upper Bound
A typical software random number generator (RNG) produces random numbers from 0 to 1, maybe .019832 then .863817 then .000758 etc. (The numbers are actually pseudo-random, but good enough for our purposes.) Multiply each by one million so they are easier to deal with, our examples becoming 19,832 then 863,817 then 758. This is equivalent to choosing random numbers from the range 0 to 1,000,000. Now generate a large number of such numbers. Count the number that have leading ‘1’s (as in 19,832), then for leading ‘2’s through ‘9’. Each of the 9 collections of numbers have approximately the same number of elements, as you would expect, about 11% for each. That is the kind of random behaviour we expect, although as we will see it is not how numbers are distributed in the real world of counting.
Now consider a set of real-world randoms. Here is a list of the 300 brightest stars. Surely the distances to those stars is random. Count the number of stars whose distance starts with ‘1’. You expect 30-35 or so, but they keep coming and coming and you ultimately arrive at 89 of them, about 30%! Then when you count the ‘9’s, you find they total about 5%. Other examples abound. Mercy mercy bring on the smelling salts. If the distances are random, shouldn’t each digit lead about 11% of the time, like the randoms from our RNG?
What happens to leading digits as we just count? Imagine we take a walk and at each metre completed, note the leading digits of the distance we’ve travelled. We walk 1, 2 .. 99 metres. As we reach the 99-metre mark, the leading digits are equally divided ‘1’ through ‘9’, 11% each. For metres 100 through 199, the number of leading ‘1’s piles up, until at the 199-metre point, fully 55% of the distances we walked begin with a ‘1’. They decline through metre 999 to 11%, then climb to 55% at 1999, and the loop repeats again at 10,000 and at each new power of 10, illustrated below.
For simple counting then, the distribution of leading ‘1’s loops from 11% to 55% and back. The precise distribution depends where you stop walking. These observations lead to the Law of RNG Randoms.
For RNG randoms, the distribution of leading digits is a function of the upper bound of the RNG range. The distributions are even at 11% only if the bound maximum is a power of 10.
A typical RNG generates randoms from 0 to 1 (1 is the zeroth power of 10). In our example above we drew effectively from 0 to 1,000,000, leading digits all appearing at 11% frequency. If we increase the top end of the range to a number not a factor of 10, say 2,000,000, the number of leading ‘1’s will be at least half of that range, settling into the familiar (55.5, 5.5 … 5.5) we first saw when taking our walk. If instead we contract the range to say 400,000 to 499,999, we see zero leading ‘1’s because all the leading digits are ‘4’s. RNG distributions are hugely dependent on range factors. We usually see (and expect) “equal distributions” only because the top of the range used is typically a factor of 10.
(RNG numbers also depend on the base of the arithmetic you use. In base 10, any time the top end of the range passes a power of 10, at 1,000 or 1,000,000,000 or any other, the RNG distribution will begin to re-accumulate leading digits ‘1’. If we used a different base, say 16, they would still re-accumulate in loops, but at powers of 16.)
Here are leading digit distributions for various ranges. Note the leading ‘1’s loop from 11.1 to 55.5.
|0 to 100K||11.1||11.1||11.1||11.1||11.1||11.1||11.1||11.1||11.1|
|0 to 200K||55.5||5.5||5.5||5.5||5.5||5.5||5.5||5.5||5.5|
|0 to 300K||37.0||37.0||3.7||3.7||3.7||3.7||3.7||3.7||3.7|
|0 to 400K||27.8||27.8||27.8||2.8||2.8||2.8||2.8||2.8||2.8|
|0 to 500K||22.2||22.2||22.2||22.2||2.2||2.2||2.2||2.2||2.2|
|0 to 600K||18.5||18.5||18.5||18.5||18.5||1.8||1.8||1.8||1.8|
|0 to 700K||15.8||15.8||15.8||15.8||15.8||15.8||1.6||1.6||1.6|
|0 to 800K||13.9||13.9||13.9||13.9||13.9||13.9||13.9||1.4||1.4|
|0 to 900K||12.3||12.3||12.3||12.3||12.3||12.3||12.3||12.3||1.2|
|0 to 1000K||11.1||11.1||11.1||11.1||11.1||11.1||11.1||11.1||11.1|
|0 to 2000K||55.5||5.5||5.5||5.5||5.5||5.5||5.5||5.5||5.5|
|328K to 789K||0.0||0.0||16.6||21.5||21.5||21.5||19.0||0.0||0.0|
Natural Randoms … the Upper Bound Goes Away
So how do we get from RNG randoms, to natural randoms with the classic Benford distribution? Consider the set of all walks taken today on the planet. Surely those walks have random lengths. One walk will end at 521 metres, another at 2,891 metres, I walk 27 metres to my car, and some guy runs a triple marathon of 120,000 metres. Leading ‘1’s will range from 11% to 55%, depending on how long are the walks. It cannot be too surprising the number settles into the 30% range. It’s just counting. The Benson distribution is simply a logarithmic scale, represented below on a slide rule. The distribution of first digits is the same as the widths of gridlines on these sliders.
Take another real-life measure of some physical dimension, say the surface areas of river basins, in square kilometres. For river basins, the top ot the range for areas is about 5,500,000 (Amazon). Our unit of measure, the square kilometre, is arbitrary. We might prefer the original upper bound expressed as 350,000 square miles, or 1.4 billion acres, or whatever. In addition, our definition of what constitutes a ‘river basin’ is flexible, widening or narrowing the range, so the reality is our real-life random numbers should derive over many ranges 0 to X. In mathematical terms, the distributions of the random numbers must be scale-invariant.
It is clear the upper bound of sets of natural randoms is not a power of 10, therefore the leading digits of natural randoms will not be equal. What is the upper bound? Do they have upper bounds? What is the upper bound for the “surface areas of lakes”? It depends what unit of measurement we use. And what number base we use. And what we decide is a “lake”. And what planet we are on. And don’t forget scale invariance. If we make our unit of measurement approach zero, the upper bound will approach infinity. Gulp. Practically, that upper bound can be anything at all. The term “upper bound” is meaningless as it applies to natural randoms. That leads to the Law of Natural Randoms.
While RNG randoms originate from a single fixed range, Natural randoms originate from all ranges concurrently, and leading ‘1’s appear 30% of the time.
Modelling Natural Randoms
How might we approximate Benford random numbers with a standard RNG generator? What if instead of generating numbers in the range 0 to Some Fixed Number, with each generation we replace the top end with a random number? If the top end random is sufficiently random, we might be able to model an ‘unbounded random’. A sufficiently random upper bound, would provide us with a proxy for all ranges concurrently.
The following table shows results for samples of one million numbers. Rn is defined as an RNG random from range 0 to 100,000. r is a RNG random from range 0 to 1.
|Top of Range||Largest Random||% of ‘1’s
|% of ‘2’s
|% of ‘3’s
|% of ‘9’s
|1||fixed at 57,349||57,349||19.4||19.4||19.4||1.9|
|3||R1 x R2||9.9 x 10 ^ 9||24.1||18.3||14.5||3.4|
|4||R1 x R2 x R3||3.1 x 10 ^ 14||30.2||17.9||12.4||4.6|
|5||R1 ^ 2||9.9 x 10 ^ 9||19.0||14.0||12.3||7.3|
|6||10 ^ (r x 5)||100,000||30.1||17.6||12.5||4.5|
Result #1 can be read as “randoms generated from range 0 to 57,449”. The distribution of leading digits is skewed towards ‘1’ through ‘5’ (less for ‘5’ than others) because the higher 5-digit numbers do not come into play.
#2 is read “randoms generated from range 0 to 100,000”. Since the top of range is a power of 10, this is exactly what a typical RNG produces.
#3 is read “randoms generated from range 0 to <some random number (0 to 100,000) >”. This starts to ‘naturally randomize’ the randoms, making them less like RNG randoms, and more like Benford randoms, but doesn’t fully do the job. The frequency of ‘1’s is 24%, higher than the RNG 11%, but short of the Benford 30%.
#4 is read “randoms generated from range 0 to <random number (0 to 100,000) x random number (0 to 100,000) >”. This is very close to the pure Benford distribution. Note the huge ‘largest random’. You can lower that number by using a smaller range than 0 to 100,000 for the RNG.
#5 is read “randoms generated from range 0 to <random number (0 to 100,000) squared>”. This is slightly ‘less naturally random’ than #4, with another huge largest random.
#6 is read “randoms generated from range 0 to <10 to the power of <random number (0 to 1) x 5>”. This is an good proxy for Benford randoms, and the randoms stay within the range you choose, here the largest random will be 105 or 100,000.
So by making the top of the range of our RNG sufficiently random, we can produce sets of numbers that mimic the all concurrent ranges of the natural randoms.
Implications for Sets of Data, Random or Not
Nobody should be shocked that first digits for sets of random data have the classic Benford distribution. Nobody should be surprised the data is scale invariant. It’s just counting.
If the sample is sufficiently large, and the data does not Benford, then the data is not random, or the range is not sufficiently broad. Weights of all Englishmen will not Benford, although weights of all living creatures in England will. In this list of allegedly ‘naturally random’ sources, a few look suspect. I don’t know what “Design” constitutes. The sample of 560 is large enough, but the leading digit distributions seem to approach randomness, but not reach it. The “Design” distributions:
Many of those sources are not truly random. The upper range of the data just happens to occur after a buildup of ‘1’s, for example atomic weights, which have a range of about 1 to 200. Disproportionately high ‘1’s will occur any time the top bound of the data just turns the corner on another power of 10. Weights of Englishmen would behave similarly, if measured in pounds.
The atomic weights distribution:
- The random numbers generated by Random Number Generators (RNG), are distributed differently than random numbers found in our counting system or in nature.
- Law of RNG Randoms. For RNG randoms, the distribution of leading digits is a function of the upper bound of the RNG range. The distributions are even at 11%, only if the bound maximum is a power of 10 (which it usually is).
- Law of Natural Randoms. While RNG randoms originate from a single fixed range, natural randoms originate essentially from all ranges concurrently, and leading ‘1’s appear 30% of the time.
- Counting in a base 10 arithmetic uses lots of ‘1’s! Just plain counting, leading ‘1’s loop from 11% to 55% and settle in the 30% range. Benford’s Law is not mysterious. It is not cosmically significant. If the set of data is broad enough and random, it will Benford. If it does not Benford, then either the range is too narrow, the sample is too small, or the data is not random.
- At least some ‘miraculous’ Benford cases, like atomic weights, are not good examples despite having a surfeit of leading ‘1’s. They are sets of data that just happen to have an upper bound that is accumulating‘1’s, perhaps in the 200 range (weights of Englishmen in pounds) or the 20,000 range (cost of used cars in the USA).
- You can produce sets of numbers which have the Benford distributions with a RNG, if you use a sufficiently random top of range from which to generate your randoms. For example generate your random from the range 0 to (Random1 x Random2).