DataSketches functions#

Apache DataSketches is a high-performance library of stochastic streaming algorithms (sketch algorithms) that produce compact probabilistic summaries called sketches. A sketch is a small, stateful data structure that processes massive data as a stream and can provide approximate answers with mathematical guarantees much faster than traditional exact methods. DataSketches functions allow querying these serialized sketches from Trino. Support for the Theta Sketch framework is available through theta_sketch_union() and theta_sketch_cardinality(), typically used to replace expensive COUNT(DISTINCT ...) aggregations when sketches are precomputed and stored.

Configuration#

Because the DataSketches functions are provided by a connector, they are not available by default. To enable them, you must configure a catalog properties file to register the functions with the specified catalog name.

Create a catalog properties file etc/catalog/datasketches.properties that references the datasketches connector:

connector.name=datasketches

The DataSketches functions are available with the theta schema name. For the preceding example, the functions use the datasketches.theta catalog and schema prefix.

To avoid needing to reference the functions with their fully qualified name, configure the sql.path SQL environment property in the config.properties file to include the catalog and schema prefix:

sql.path=datasketches.theta

Configure multiple catalogs to use the same functions with different DataSketches configurations. In this case, the functions must be referenced using their fully qualified name, rather than relying on the SQL path.

Note

Trino does not create new sketches. Build Theta sketches upstream (for example, in Spark, Hive, or Pig using the Apache DataSketches Theta APIs) and store the serialized sketch bytes as a VARBINARY column. The Trino functions operate on serialized Theta sketches only; other sketch families are not supported.

Functions#

theta_sketch_union(sketch[, nominal_entries, seed]) varbinary#

Returns a serialized sketch as varbinary, which is a merged collection of sketches. The optional nominal_entries and seed parameters let you specify non-default sketch size and seed when merging sketches created with custom settings.

theta_sketch_cardinality(sketch) double#

Returns the estimated value of the sketch.

theta_sketch_cardinality(sketch, seed) double

Returns the estimated value of the sketch using the supplied seed. Use this when the sketch was created with a non-default seed.

Examples#

The following query reads precomputed customer sketches from tpch.sf100000.orders, unions them per order date, and produces an approximate distinct customer count alongside exact spend. Using sketches avoids a heavy COUNT(DISTINCT ...) over billions of rows while retaining predictable error bounds.

SELECT
  o_orderdate AS date,
  theta_sketch_cardinality(theta_sketch_union(o_custkey_sketch)) AS unique_user_count,
  SUM(o_totalprice) AS user_spent
FROM tpch.sf100000.orders
GROUP BY o_orderdate;

For comparison, the exact equivalent requires the raw keys and a costly distinct aggregation:

SELECT
  o_orderdate AS date,
  COUNT(DISTINCT o_custkey) AS unique_user_count,
  SUM(o_totalprice) AS user_spent
FROM tpch.sf100000.orders_raw_keys
GROUP BY o_orderdate;