DataSketches functions#
Apache DataSketches is a high-performance
library of stochastic streaming algorithms (sketch algorithms) that produce
compact probabilistic summaries called sketches. A sketch is a small, stateful
data structure that processes massive data as a stream and can provide
approximate answers with mathematical guarantees much faster than traditional
exact methods. DataSketches functions allow querying these serialized sketches
from Trino. Support for the
Theta Sketch framework
is available through theta_sketch_union() and
theta_sketch_cardinality(), typically used to replace expensive
COUNT(DISTINCT ...) aggregations when sketches are precomputed and stored.
Configuration#
Because the DataSketches functions are provided by a connector, they are not available by default. To enable them, you must configure a catalog properties file to register the functions with the specified catalog name.
Create a catalog properties file etc/catalog/datasketches.properties that
references the datasketches connector:
connector.name=datasketches
The DataSketches functions are available with the theta schema name. For the
preceding example, the functions use the datasketches.theta catalog and
schema prefix.
To avoid needing to reference the functions with their fully qualified name,
configure the sql.path SQL environment
property in the config.properties file to
include the catalog and schema prefix:
sql.path=datasketches.theta
Configure multiple catalogs to use the same functions with different DataSketches configurations. In this case, the functions must be referenced using their fully qualified name, rather than relying on the SQL path.
Note
Trino does not create new sketches. Build Theta sketches upstream (for example,
in Spark, Hive, or Pig using the Apache DataSketches Theta APIs) and store the
serialized sketch bytes as a VARBINARY column. The Trino functions operate
on serialized Theta sketches only; other sketch families are not supported.
Functions#
- theta_sketch_union(sketch[, nominal_entries, seed]) varbinary#
Returns a serialized sketch as
varbinary, which is a merged collection of sketches. The optionalnominal_entriesandseedparameters let you specify non-default sketch size and seed when merging sketches created with custom settings.
- theta_sketch_cardinality(sketch) double#
Returns the estimated value of the sketch.
- theta_sketch_cardinality(sketch, seed) double
Returns the estimated value of the sketch using the supplied
seed. Use this when the sketch was created with a non-default seed.
Examples#
The following query reads precomputed customer sketches from
tpch.sf100000.orders, unions them per order date, and produces an approximate
distinct customer count alongside exact spend. Using sketches avoids a heavy
COUNT(DISTINCT ...) over billions of rows while retaining predictable error
bounds.
SELECT
o_orderdate AS date,
theta_sketch_cardinality(theta_sketch_union(o_custkey_sketch)) AS unique_user_count,
SUM(o_totalprice) AS user_spent
FROM tpch.sf100000.orders
GROUP BY o_orderdate;
For comparison, the exact equivalent requires the raw keys and a costly distinct aggregation:
SELECT
o_orderdate AS date,
COUNT(DISTINCT o_custkey) AS unique_user_count,
SUM(o_totalprice) AS user_spent
FROM tpch.sf100000.orders_raw_keys
GROUP BY o_orderdate;