Skip to content

Commit

Permalink
Liquid Clustering and CLUSTER BY clause
Browse files Browse the repository at this point in the history
  • Loading branch information
jaceklaskowski committed Feb 2, 2024
1 parent b361d20 commit 2cc7ee5
Show file tree
Hide file tree
Showing 5 changed files with 113 additions and 3 deletions.
3 changes: 3 additions & 0 deletions docs/liquid-clustering/ClusterByParserUtils.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# ClusterByParserUtils

`ClusterByParserUtils` is...FIXME
31 changes: 31 additions & 0 deletions docs/liquid-clustering/ClusterByPlan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
---
title: ClusterByPlan
---

# ClusterByPlan Leaf Logical Operator

`ClusterByPlan` is a `LeafNode` ([Spark SQL]({{ book.spark_sql }}/logical-operators/LeafNode)).

`ClusterByPlan` is used to create a [ClusterByParserUtils](ClusterByParserUtils.md).

!!! note "To be removed"
[They say the following]({{ delta.github }}/spark/src/main/scala/org/apache/spark/sql/delta/skipping/clustering/temp/ClusterBySpec.scala#L74-L75):

> This class will be removed when we integrate with OSS Spark's CLUSTER BY implementation.
>
> See https://github.com/apache/spark/pull/42577

## Creating Instance

`ClusterByPlan` takes the following to be created:

* <span id="clusterBySpec"> [ClusterBySpec](ClusterBySpec.md)
* <span id="startIndex"> Start Index
* <span id="stopIndex"> Stop Index
* <span id="parenStartIndex"> `parenStartIndex`
* <span id="parenStopIndex"> `parenStopIndex`
* <span id="ctx"> Antlr's `ParserRuleContext`

`ClusterByPlan` is created when:

* `DeltaSqlAstBuilder` is requested to [parse CLUSTER BY clause](../sql/DeltaSqlAstBuilder.md#visitClusterBy)
55 changes: 53 additions & 2 deletions docs/liquid-clustering/ClusterBySpec.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,64 @@
# ClusterBySpec

## toProperty { #toProperty }
!!! note "To be removed"
[They say the following]({{ delta.github }}/spark/src/main/scala/org/apache/spark/sql/delta/skipping/clustering/temp/ClusterBySpec.scala#L35-L36):

> This class will be removed when we integrate with OSS Spark's CLUSTER BY implementation.
>
> See https://github.com/apache/spark/pull/42577

## Creating Instance

`ClusterBySpec` takes the following to be created:

* <span id="columnNames"> Column names (`NamedReference`s)

`ClusterBySpec` is created when:

* `DeltaCatalog` is requested to [convertTransforms](../DeltaCatalog.md#convertTransforms)
* `ClusterBySpec` is requested to [apply](#apply) and [fromProperty](#fromProperty)
* `DeltaSqlAstBuilder` is requested to [parse CLUSTER BY clause](../sql/DeltaSqlAstBuilder.md#visitClusterBy)

## Creating ClusterBySpec { #apply }

```scala
apply[_: ClassTag](
columnNames: Seq[Seq[String]]): ClusterBySpec
```

`apply` creates a [ClusterBySpec](#creating-instance) for the given `columnNames` (converted to `FieldReference`s).

!!! note "No usage found"

## (Re)Creating ClusterBySpec from Table Property { #fromProperty }

```scala
fromProperty(
columns: String): ClusterBySpec
```

`fromProperty` creates a [ClusterBySpec](#creating-instance) for the given `columns` (being a JSON-ified `ClusterBySpec`).

!!! note
`fromProperty` does the opposite to [toProperty](#toProperty).

---

`fromProperty` is used when:

* `ClusteredTableUtilsBase` is requested for a [ClusterBySpec](ClusteredTableUtilsBase.md#getClusterBySpecOptional)

## Converting ClusterBySpec to Table Property { #toProperty }

```scala
toProperty(
clusterBySpec: ClusterBySpec): (String, String)
```

`toProperty`...FIXME
`toProperty` gives a pair of [clusteringColumns](ClusteredTableUtilsBase.md#clusteringColumns) and the given `ClusterBySpec` (in JSON format).

!!! note
`toProperty` does the opposite to [fromProperty](#fromProperty).

---

Expand Down
9 changes: 8 additions & 1 deletion docs/liquid-clustering/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,17 @@ subtitle: Clustered Tables

# Liquid Clustering

**Liquid Clustering** is...FIXME
**Liquid Clustering** is an optimization technique in Delta Lake that...FIXME

!!! info "Not Recommended for Production Use"
1. A clustered table is currently in preview and is disabled by default.
1. A clustered table is not recommended for production use (e.g., unsupported incremental clustering).

Liquid Clustering is used for delta table that were created with `CLUSTER BY` clause.

Liquid Clustering is controlled using [spark.databricks.delta.clusteredTable.enableClusteringTablePreview](../configuration-properties/index.md#spark.databricks.delta.clusteredTable.enableClusteringTablePreview) configuration property.

## Limitations

1. Liquid Clustering cannot be used with partitioning (`PARTITIONED BY`)
1. Liquid Clustering cannot be used with bucketing (`CLUSTERED BY INTO BUCKETS`)
18 changes: 18 additions & 0 deletions docs/sql/DeltaSqlAstBuilder.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,24 @@ visitClone(

`visitClone` creates a [CloneTableStatement](../commands/clone/CloneTableStatement.md) logical operator.

## visitClusterBy { #visitClusterBy }

```scala
visitClusterBy(
ctx: ClusterByContext): LogicalPlan
```

`visitClusterBy` creates a [ClusterByPlan](../liquid-clustering/ClusterByPlan.md) (with a [ClusterBySpec](../liquid-clustering/ClusterBySpec.md)) for `CLUSTER BY` clause.

```sql
CLUSTER BY (interleave, [interleave]*)
```

`interleave`s are the column names to cluster by.

!!! note
`CLUSTER BY` is similar to `ZORDER BY` syntax-wise.

## visitDescribeDeltaHistory { #visitDescribeDeltaHistory }

```scala
Expand Down

0 comments on commit 2cc7ee5

Please sign in to comment.