Author: Barak Oshri


Problem Setting

<aside> 👉 An internet company places sensors over a map and monitors traffic. This is too much data to store for further analysis.

The company wants to estimate the number of unique users with property $P$ over the network.


We call this estimation problem the $DistinctOnSubpopulation_p$ problem, or $Distinct_p$ problem. When $P$ is all items, then this is $DistinctElements$, or just $Distinct$. But $P$ can be any arbitrary predicate:

Importantly, we don't specify $P$ ahead of time.

<aside> 💡 Motivation for Sisu. We want to summarize count distinct objects for growing database tables and then do set operations on them, such as count distinct of intersections, to infer join key entity relationships.



Identify a method for combining samples from each sensor so that for any property $P$ we can derive an unbiased estimate of $Distinct_p$, and the variance is bounded by that of the individual sensors' sampling variance. This method is called the Theta-Sketch Framework.