probability - Making 400k random choices from 400k samples seems to always end up with 63% distinct choices, why?

Sunday, 26 August 2018

probability - Making 400k random choices from 400k samples seems to always end up with 63% distinct choices, why?

I have a very simple simulation program, the sequence is:

Create an array of 400k elements

Use a PRNG to pick an index, and mark the element (repeat 400k times)

Count number of marked elements.

An element may be picked more than once, but counted as only one "marked element".

The PRNG is properly seeded. No matter how many times I run the simulation, I always end up getting around 63% (252k) marked elements.

What is the math behind this? Or was there a fault in my PRNG?

Answer

No, your program is correct. The probability that a particular element is not marked
at all, is about $\frac{1}{e}$ . This comes from the poisson-distribution which is
a very well approximation for large samples (400k is very large). So $1-\frac{1}{e}$
is the fraction of marked elements.

Blog

Sunday, 26 August 2018