Thursday 14 August 2014

probability - Variation of coupon collector problem with distinct coupons in each type



I am interested in yet another variation of the coupon collector problem. Specifically, we have an urn with $n_i$ "distinct" coupons of type $i$, where $i = 1,\ldots, K$ (for example, a type can be thought of as a color but within that color the coupons are numbered). The goal is to sample sufficiently with replacement to get a collection having at least $m_i$ distinct coupons from type $i$ for $i = 1,\ldots, K$ (any $m_i$ out of the $n_i$ different coupons within that type would work). Note that this is different from the variation in:



N. Shank and H. Yang, "Coupon Collector Problem for Non-Uniform Coupons and Random Quotas," Electronic J. of Combinatorics 20(2) 2013




since the coupons here are numbered and the analysis needs to ensure the $m_i$ coupons from type $i$ are distinct. So the question is:



For a given number of samples $n$, what is the probability that we achieve the requirement above (i.e., obtain $m_i$ distinct coupons from the types $i = 1,\ldots,K$). Upper and lower bounds would also be useful. Also, what is the expected value of the number of samples $E[N]$ to achieve this requirement?


Answer



I use the same notation as the cited article (so that the results can be directly compared), so
consider an urn which for $i\in\{1,\ldots n\}$ contains $x_i\geq 1$ distinguishable coupons of type $i$, altogether
$X_n:=x_1+\ldots+x_n$ coupons.



Coupons are drawn with replacement from this urn until (for each $i$) $m_i$ (where $1\leq m_i\leq x_i$) mutually different coupons

of type $i$ have been drawn.



Let $m:=(m_1,\ldots,m_n)$ and $T(m)$ the random time at which this happens, and let $p_{x_i}(m_i,t):=\sum_{k=0}^{m_i-1}{x_i \choose k}\,t^k$.



Let $Y_1(k),\ldots,Y_n(k)$ be the random variables $Y_i(k):=$ number of different coupons of type $i$'' that have been drawn
at "time" $k$.



I use generating functions and start from the following basic



Proposition

The generating function of (the joint distribution of)
$Y_1(k),\ldots,Y_n(k)$ is given by:

\begin{equation*}
\mathbb{E}\, t_1^{Y_1(k)}\ldots t_n^{Y_n(k)}=\frac{k!} {X_n^k}[t^k] (1+(e^t-1)\,t_1)^{x_1}\cdot\ldots\cdot(1+(e^t-1)\,t_n)^{x_n}
\end{equation*}



Proof Let $j_1+\ldots +j_m\leq k$.
Clearly
$$\mathbb{P}(Y_1(k)=j_1,\ldots,Y_n(k)=j_n)=\frac{1}{X_n^k}\cdot {x_1 \choose j_1}\cdots {x_m \choose j_m}\cdot Sur(k,j_1+\ldots+j_m)$$
where $Sur(k,r)$ denotes the number of surjective mappings from $\{1,\ldots,k\}$ onto $\{1,\ldots,r\}$. It is known that

$Sur(k,r)=k!\,[t^k]\,(e^t-1)^r$ (since a such a surjective mapping corresponds uniquely to an ordered partition of
$\{1,\ldots,k\}$ into $r$ non-empty subsets, and $e^t-1$ is the exponential generating function for non-empty sets). The assertion
about the g.f. follows. End of proof.



(I) The distribution of $T(m)$



Since $\{\,T(m)\leq k\}=\{\,Y_1(k)\geq m_1,\ldots,Y_n(k)\geq m_n\,\}$ the above gives
\begin{equation*}
\mathbb{P}(T(m)\leq k) =\frac{k!} {X_n^k}[t^k] (e^{tx_1} -p_{x_1}(m_1,e^t-1))\cdot\ldots\cdot(e^{tx_n} -p_{x_n}(m_n,(e^t-1)\,)
\end{equation*}




From g.f. to probability:
we have to compute \begin{align*}\mathbb{P}(\,Y_1(k)\geq m_1,\ldots,Y_n(k)\geq m_n\,)
&=\sum_{j_1\geq m_1,\ldots, j_n\geq m_n} \mathbb{P}(Y_1(k)=j_1,\ldots,Y_n(k)=j_n)\\
&=\sum_{j_1\geq m_1,\ldots, j_n\geq m_n} [t_1^{j_1}\ldots t_n^{j_n}]\mathbb{E}\, t_1^{Y_1(k)}\ldots t_n^{Y_n(k)}
\end{align*}
Since the function after $[t^k]$ is a product of factors each containing only one of the $t_i$ variables we can treat these factors individually.
E.g. to account for $\mathbb{P}(Y_1(k)\geq m_1,...)$ we have to sum up all the coefficients $[t_1^j]$ with $j\geq m_1$ of the first factor.
We may equivalently sum up all coefficients (i.e. put $t_1=1$) and subtract the sum of the first $m_1$ coefficients. Doing that
"inside" (and leaving the $t^k$ extraction "outside" (coefficients may be extracted in arbitrary order)) we arrive at the factor $e^{tx_1}-p_{x_1}(m_1,e^t-1)$, etc..




(II) The expectation of $T(m)$



Finally, using $\mathbb{E} T(m)=\sum_{k\geq 0} \mathbb{P}(T(m)>k)$ and writing ${k! \over X_n^k} =X_n\,\int_0^\infty s^k e^{-X_ns}\,ds$ leads to



$$\mathbb{E}(T(m))=X_n\int_0^\infty \big(1-\prod_{i=1}^n \left(1-p_{x_i}(m_i,e^s-1)\,e^{-x_i s}\right)\big)\,ds$$


No comments:

Post a Comment

real analysis - How to find $lim_{hrightarrow 0}frac{sin(ha)}{h}$

How to find $\lim_{h\rightarrow 0}\frac{\sin(ha)}{h}$ without lhopital rule? I know when I use lhopital I easy get $$ \lim_{h\rightarrow 0}...