A Framework for Estimating Stream Expression Cardinalities

Author: Barak Oshri

Introduction

Problem Setting

<aside> 👉 An internet company places sensors over a map and monitors traffic. This is too much data to store for further analysis.

The company wants to estimate the number of unique users with property $P$ over the network.

</aside>

We call this estimation problem the $DistinctOnSubpopulation_p$ problem, or $Distinct_p$ problem. When $P$ is all items, then this is $DistinctElements$, or just $Distinct$. But $P$ can be any arbitrary predicate:

Number of people in an age range
Specified country
People who accessed certain websites in a time period

Importantly, we don't specify $P$ ahead of time.

<aside> 💡 Motivation for Sisu. We want to summarize count distinct objects for growing database tables and then do set operations on them, such as count distinct of intersections, to infer join key entity relationships.

</aside>

Goal

Identify a method for combining samples from each sensor so that for any property $P$ we can derive an unbiased estimate of $Distinct_p$, and the variance is bounded by that of the individual sensors' sampling variance. This method is called the Theta-Sketch Framework.

The technical condition for unbiasedness that each sensor must satisfy is called $1$-Goodness.
If the sampling procedure also satisfies monotonicity, then the variance of the output satisfies the desired bound.