Liquid Clustering and CLUSTER BY clause

japila-books · Feb 2, 2024 · 2cc7ee5 · 2cc7ee5
1 parent b361d20
commit 2cc7ee5
Show file tree

Hide file tree

Showing 5 changed files with 113 additions and 3 deletions.
diff --git a/docs/liquid-clustering/ClusterByParserUtils.md b/docs/liquid-clustering/ClusterByParserUtils.md
@@ -0,0 +1,3 @@
+# ClusterByParserUtils
+
+`ClusterByParserUtils` is...FIXME
diff --git a/docs/liquid-clustering/ClusterByPlan.md b/docs/liquid-clustering/ClusterByPlan.md
@@ -0,0 +1,31 @@
+---
+title: ClusterByPlan
+---
+
+# ClusterByPlan Leaf Logical Operator
+
+`ClusterByPlan` is a `LeafNode` ([Spark SQL]({{ book.spark_sql }}/logical-operators/LeafNode)).
+
+`ClusterByPlan` is used to create a [ClusterByParserUtils](ClusterByParserUtils.md).
+
+!!! note "To be removed"
+    [They say the following]({{ delta.github }}/spark/src/main/scala/org/apache/spark/sql/delta/skipping/clustering/temp/ClusterBySpec.scala#L74-L75):
+
+    > This class will be removed when we integrate with OSS Spark's CLUSTER BY implementation.
+    >
+    > See https://github.com/apache/spark/pull/42577
+
+## Creating Instance
+
+`ClusterByPlan` takes the following to be created:
+
+* <span id="clusterBySpec"> [ClusterBySpec](ClusterBySpec.md)
+* <span id="startIndex"> Start Index
+* <span id="stopIndex"> Stop Index
+* <span id="parenStartIndex"> `parenStartIndex`
+* <span id="parenStopIndex"> `parenStopIndex`
+* <span id="ctx"> Antlr's `ParserRuleContext`
+
+`ClusterByPlan` is created when:
+
+* `DeltaSqlAstBuilder` is requested to [parse CLUSTER BY clause](../sql/DeltaSqlAstBuilder.md#visitClusterBy)
diff --git a/docs/liquid-clustering/ClusterBySpec.md b/docs/liquid-clustering/ClusterBySpec.md
@@ -1,13 +1,64 @@
 # ClusterBySpec
 
-## toProperty { #toProperty }
+!!! note "To be removed"
+    [They say the following]({{ delta.github }}/spark/src/main/scala/org/apache/spark/sql/delta/skipping/clustering/temp/ClusterBySpec.scala#L35-L36):
+
+    > This class will be removed when we integrate with OSS Spark's CLUSTER BY implementation.
+    >
+    > See https://github.com/apache/spark/pull/42577
+
+## Creating Instance
+
+`ClusterBySpec` takes the following to be created:
+
+* <span id="columnNames"> Column names (`NamedReference`s)
+
+`ClusterBySpec` is created when:
+
+* `DeltaCatalog` is requested to [convertTransforms](../DeltaCatalog.md#convertTransforms)
+* `ClusterBySpec` is requested to [apply](#apply) and [fromProperty](#fromProperty)
+* `DeltaSqlAstBuilder` is requested to [parse CLUSTER BY clause](../sql/DeltaSqlAstBuilder.md#visitClusterBy)
+
+## Creating ClusterBySpec { #apply }
+
+```scala
+apply[_: ClassTag](
+  columnNames: Seq[Seq[String]]): ClusterBySpec
+```
+
+`apply` creates a [ClusterBySpec](#creating-instance) for the given `columnNames` (converted to `FieldReference`s).
+
+!!! note "No usage found"
+
+## (Re)Creating ClusterBySpec from Table Property { #fromProperty }
+
+```scala
+fromProperty(
+  columns: String): ClusterBySpec
+```
+
+`fromProperty` creates a [ClusterBySpec](#creating-instance) for the given `columns` (being a JSON-ified `ClusterBySpec`).
+
+!!! note
+    `fromProperty` does the opposite to [toProperty](#toProperty).
+
+---
+
+`fromProperty` is used when:
+
+* `ClusteredTableUtilsBase` is requested for a [ClusterBySpec](ClusteredTableUtilsBase.md#getClusterBySpecOptional)
+
+## Converting ClusterBySpec to Table Property { #toProperty }
 
 ```scala
 toProperty(
   clusterBySpec: ClusterBySpec): (String, String)
 ```
 
-`toProperty`...FIXME
+`toProperty` gives a pair of [clusteringColumns](ClusteredTableUtilsBase.md#clusteringColumns) and the given `ClusterBySpec` (in JSON format).
+
+!!! note
+    `toProperty` does the opposite to [fromProperty](#fromProperty).
 
 ---
 

diff --git a/docs/liquid-clustering/index.md b/docs/liquid-clustering/index.md
@@ -6,10 +6,17 @@ subtitle: Clustered Tables
 
 # Liquid Clustering
 
-**Liquid Clustering** is...FIXME
+**Liquid Clustering** is an optimization technique in Delta Lake that...FIXME
 
 !!! info "Not Recommended for Production Use"
     1. A clustered table is currently in preview and is disabled by default.
     1. A clustered table is not recommended for production use (e.g., unsupported incremental clustering).
 
+Liquid Clustering is used for delta table that were created with `CLUSTER BY` clause.
+
 Liquid Clustering is controlled using [spark.databricks.delta.clusteredTable.enableClusteringTablePreview](../configuration-properties/index.md#spark.databricks.delta.clusteredTable.enableClusteringTablePreview) configuration property.
+
+## Limitations
+
+1. Liquid Clustering cannot be used with partitioning (`PARTITIONED BY`)
+1. Liquid Clustering cannot be used with bucketing (`CLUSTERED BY INTO BUCKETS`)
diff --git a/docs/sql/DeltaSqlAstBuilder.md b/docs/sql/DeltaSqlAstBuilder.md
@@ -39,6 +39,24 @@ visitClone(
 
 `visitClone` creates a [CloneTableStatement](../commands/clone/CloneTableStatement.md) logical operator.
 
+## visitClusterBy { #visitClusterBy }
+
+```scala
+visitClusterBy(
+  ctx: ClusterByContext): LogicalPlan
+```
+
+`visitClusterBy` creates a [ClusterByPlan](../liquid-clustering/ClusterByPlan.md) (with a [ClusterBySpec](../liquid-clustering/ClusterBySpec.md)) for `CLUSTER BY` clause.
+
+```sql
+CLUSTER BY (interleave, [interleave]*)
+```
+
+`interleave`s are the column names to cluster by.
+
+!!! note
+    `CLUSTER BY` is similar to `ZORDER BY` syntax-wise.
+
 ## visitDescribeDeltaHistory { #visitDescribeDeltaHistory }
 
 ```scala