Author: Barak Oshri

## Introduction

### Problem Setting

<aside>
ðŸ‘‰ An internet company places sensors over a map and monitors traffic. This is too much data to store for further analysis.

The company wants to estimate the number of unique users with property $P$ over the network.

</aside>

We call this estimation problem the $DistinctOnSubpopulation_p$ problem, or $Distinct_p$ problem. When $P$ is all items, then this is $DistinctElements$, or just $Distinct$. But $P$ can be any arbitrary predicate:

- Number of people in an age range
- Specified country
- People who accessed certain websites in a time period

Importantly, we don't specify $P$ ahead of time.

<aside>
ðŸ’¡ **Motivation for Sisu.** We want to summarize count distinct objects for growing database tables and then do set operations on them, such as count distinct of intersections, to infer join key entity relationships.

</aside>

**Goal**

Identify a method for combining samples from each sensor so that for any property $P$ we can derive an unbiased estimate of $Distinct_p$, and the variance is bounded by that of the individual sensors' sampling variance. This method is called the *Theta-Sketch Framework*.

- The technical condition for unbiasedness that each sensor must satisfy is called $1$-
*Goodness*.
- If the sampling procedure also satisfies
*monotonicity*, then the variance of the output satisfies the desired bound.