Skip to content

Commit

Permalink
Liquid clustering, CREATE TABLE and DomainMetadata
Browse files Browse the repository at this point in the history
  • Loading branch information
jaceklaskowski committed Feb 3, 2024
1 parent a949a5e commit 5eab298
Show file tree
Hide file tree
Showing 6 changed files with 134 additions and 6 deletions.
15 changes: 15 additions & 0 deletions docs/DomainMetadata.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# DomainMetadata

`DomainMetadata` is an [Action](Action.md) that...FIXME

## Creating Instance

`DomainMetadata` takes the following to be created:

* <span id="domain"> Domain Name
* <span id="configuration"> Configuration
* <span id="removed"> `removed` flag

`DomainMetadata` is created when:

* `JsonMetadataDomain` is requested to `toDomainMetadata`
65 changes: 65 additions & 0 deletions docs/commands/create-table/CreateDeltaTableCommand.md
Original file line number Diff line number Diff line change
Expand Up @@ -142,6 +142,58 @@ handleCreateTable(

`handleCreateTable`...FIXME

#### Executing Post-Commit Updates { #runPostCommitUpdates }

```scala
runPostCommitUpdates(
sparkSession: SparkSession,
txnUsedForCommit: OptimisticTransaction,
deltaLog: DeltaLog,
tableWithLocation: CatalogTable): Unit
```

??? warning "Procedure"
`runPostCommitUpdates` is a procedure (returns `Unit`) so _what happens inside stays inside_ (paraphrasing the [former advertising slogan of Las Vegas, Nevada](https://idioms.thefreedictionary.com/what+happens+in+Vegas+stays+in+Vegas)).

`runPostCommitUpdates` prints out the following INFO message to the logs:

```text
Table is path-based table: [tableByPath]. Update catalog with mode: [operation]
```

`runPostCommitUpdates` requests the given [DeltaLog](#deltaLog) to [update](../../SnapshotManagement.md#update).

`runPostCommitUpdates` [updates the catalog](#updateCatalog).

In the end, when `delta.universalFormat.enabledFormats` table property contains `iceberg`, `runPostCommitUpdates` requests the `UniversalFormatConverter` to `convertSnapshot`.

#### Updating Table Catalog { #updateCatalog }

```scala
updateCatalog(
spark: SparkSession,
table: CatalogTable,
snapshot: Snapshot,
didNotChangeMetadata: Boolean): Unit
```

??? warning "Procedure"
`updateCatalog` is a procedure (returns `Unit`) so _what happens inside stays inside_ (paraphrasing the [former advertising slogan of Las Vegas, Nevada](https://idioms.thefreedictionary.com/what+happens+in+Vegas+stays+in+Vegas)).

??? note "`didNotChangeMetadata` Not Used"

`updateCatalog` prints out the following INFO message to the logs:

```text
Table is path-based table: [tableByPath]. Update catalog with mode: [operation]
```

`updateCatalog` requests the given [DeltaLog](#deltaLog) to [update](../../SnapshotManagement.md#update).

`updateCatalog` [updates the catalog](#updateCatalog).

In the end, when `delta.universalFormat.enabledFormats` table property contains `iceberg`, `updateCatalog` requests the `UniversalFormatConverter` to `convertSnapshot`.

## Provided Metadata { #getProvidedMetadata }

```scala
Expand Down Expand Up @@ -171,3 +223,16 @@ Metadata | Value
`getProvidedMetadata` is used when:

* `CreateDeltaTableCommand` is requested to [handleCreateTable](#handleCreateTable) and [replaceMetadataIfNecessary](#replaceMetadataIfNecessary)

## Logging

Enable `ALL` logging level for `org.apache.spark.sql.delta.commands.CreateDeltaTableCommand` logger to see what happens inside.

Add the following line to `conf/log4j2.properties`:

```text
logger.CreateDeltaTableCommand.name = org.apache.spark.sql.delta.commands.CreateDeltaTableCommand
logger.CreateDeltaTableCommand.level = all
```

Refer to [Logging](../../logging.md).
15 changes: 15 additions & 0 deletions docs/commands/create-table/index.md
Original file line number Diff line number Diff line change
@@ -1 +1,16 @@
# CREATE TABLE

```sql
CREATE TABLE (IF NOT EXISTS)? [table]...

(CREATE OR)? REPLACE TABLE [table]...
```

=== "SQL"

```sql
CREATE TABLE IF NOT EXISTS delta_table
USING delta
AS
SELECT * FROM values(1,2,3)
```
2 changes: 1 addition & 1 deletion docs/liquid-clustering/ClusteredTableUtilsBase.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ title: ClusteredTableUtils

## clusteringColumns { #PROP_CLUSTERING_COLUMNS }

`ClusteredTableUtilsBase` defines `clusteringColumns` value for the clustering columns.
`ClusteredTableUtilsBase` defines `clusteringColumns` value for the name of the table property with the clustering columns of a delta table.

`clusteringColumns` is used when:

Expand Down
42 changes: 37 additions & 5 deletions docs/liquid-clustering/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,43 @@ subtitle: Clustered Tables
1. A clustered table is currently in preview and is disabled by default.
1. A clustered table is not recommended for production use (e.g., unsupported incremental clustering).

Liquid Clustering is used for delta table that were created with `CLUSTER BY` clause.

Liquid Clustering is controlled using [spark.databricks.delta.clusteredTable.enableClusteringTablePreview](../configuration-properties/index.md#spark.databricks.delta.clusteredTable.enableClusteringTablePreview) configuration property.

The clustering columns of a delta table are stored in a table catalog (as an extra table property).
Liquid Clustering can be enabled using [spark.databricks.delta.clusteredTable.enableClusteringTablePreview](../configuration-properties/index.md#spark.databricks.delta.clusteredTable.enableClusteringTablePreview) configuration property.

```sql
SET spark.databricks.delta.clusteredTable.enableClusteringTablePreview=true
```

Liquid Clustering can be applied to delta tables that were created with `CLUSTER BY` clause.

```sql
CREATE TABLE IF NOT EXISTS delta_table
USING delta
CLUSTER BY (id)
AS
SELECT * FROM values 1, 2, 3 t(id)
```

The clustering columns of a delta table are stored (_persisted_) in a table catalog (as [clusteringColumns](ClusteredTableUtilsBase.md#clusteringColumns) table property).

```sql
DESC EXTENDED delta_table
```

```text
+----------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
|col_name |data_type |comment|
+----------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
|id |int |NULL |
| | | |
|# Detailed Table Information| | |
|Name |spark_catalog.default.delta_table | |
|Type |MANAGED | |
|Location |file:/Users/jacek/dev/oss/spark/spark-warehouse/delta_table | |
|Provider |delta | |
|Owner |jacek | |
|Table Properties |[clusteringColumns=[["id"]],delta.feature.clustering=supported,delta.feature.domainMetadata=supported,delta.minReaderVersion=1,delta.minWriterVersion=7]| |
+----------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
```

## Limitations

Expand Down
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -185,6 +185,7 @@ nav:
- AddCDCFile: AddCDCFile.md
- AddFile: AddFile.md
- CommitInfo: CommitInfo.md
- DomainMetadata: DomainMetadata.md
- FileAction: FileAction.md
- Metadata: Metadata.md
- Protocol: Protocol.md
Expand Down

0 comments on commit 5eab298

Please sign in to comment.