Skip to content

Commit

Permalink
Log Cleanup
Browse files Browse the repository at this point in the history
  • Loading branch information
jaceklaskowski committed Jan 28, 2024
1 parent 44e814b commit ec1c4d1
Show file tree
Hide file tree
Showing 10 changed files with 142 additions and 92 deletions.
25 changes: 24 additions & 1 deletion docs/FileNames.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ deltaFile(
version: Long): Path
```

`deltaFile` creates a `Path` ([Hadoop]({{ hadoop.api }}/Path.html)) to a file in the `path` directory.
`deltaFile` creates a `Path` ([Apache Hadoop]({{ hadoop.api }}/org/apache/hadoop/fs/Path.html)) to a file in the `path` directory.

The format of the file is as follows:

Expand All @@ -34,6 +34,29 @@ Examples:
* `SnapshotManagement` is requested for the [LogSegment for a given version](SnapshotManagement.md#getLogSegmentForVersion) (and [validateDeltaVersions](SnapshotManagement.md#validateDeltaVersions))
* [DESCRIBE DETAIL](commands/describe-detail/index.md) command is executed (and [describeDeltaTable](commands/describe-detail/DescribeDeltaDetailCommand.md#describeDeltaTable))

## Creating Hadoop Path To Compacted Delta File { #compactedDeltaFile }

```scala
compactedDeltaFile(
path: Path,
fromVersion: Long,
toVersion: Long): Path
```

!!! note "Not used"

`compactedDeltaFile` creates a `Path` ([Apache Hadoop]({{ hadoop.api }}/org/apache/hadoop/fs/Path.html)) to a file in the `path` directory.

The format of the file is as follows:

```text
[fromVersion with leading 0s, up to 20 digits].[toVersion with leading 0s, up to 20 digits].compacted.json
```

Examples:

* `00000000000000000001.00000000000000012345.compacted.json`

<!---
## Review Me
Expand Down
73 changes: 0 additions & 73 deletions docs/MetadataCleanup.md

This file was deleted.

31 changes: 19 additions & 12 deletions docs/checkpoints/Checkpoints.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,47 +15,51 @@ dataPath: Path

Hadoop [Path]({{ hadoop.api }}/org/apache/hadoop/fs/Path.html) to the data directory of the [delta table](#self)

### <span id="doLogCleanup"> doLogCleanup
### Cleaning Up Expired Logs { #doLogCleanup }

```scala
doLogCleanup(): Unit
doLogCleanup(
snapshotToCleanup: Snapshot): Unit
```

Performs log cleanup (to remove stale log files)
??? warning "Procedure"
`doLogCleanup` is a procedure (returns `Unit`) so _what happens inside stays inside_ (paraphrasing the [former advertising slogan of Las Vegas, Nevada](https://idioms.thefreedictionary.com/what+happens+in+Vegas+stays+in+Vegas)).

Performs [log cleanup](../log-cleanup/index.md)

See:

* [DeltaLog](../DeltaLog.md#doLogCleanup)
* [MetadataCleanup](../log-cleanup/MetadataCleanup.md#doLogCleanup)

Used when:

* `Checkpoints` is requested to [checkpointAndCleanUpDeltaLog](#checkpointAndCleanUpDeltaLog)
* `Checkpoints` is requested to [checkpoint](#checkpoint) (and [checkpointAndCleanUpDeltaLog](#checkpointAndCleanUpDeltaLog))

### <span id="logPath"> logPath
### logPath { #logPath }

```scala
logPath: Path
```

Hadoop [Path]({{ hadoop.api }}/org/apache/hadoop/fs/Path.html) to the log directory of the [delta table](#self)

### <span id="metadata"> Metadata
### Metadata

```scala
metadata: Metadata
```

[Metadata](../Metadata.md) of the [delta table](#self)

### <span id="snapshot"> snapshot
### snapshot

```scala
snapshot: Snapshot
```

[Snapshot](../Snapshot.md) of the [delta table](#self)

### <span id="store"> store
### store

```scala
store: LogStore
Expand All @@ -75,7 +79,7 @@ store: LogStore

* [Loading checkpoint metadata in](#loadMetadataFromFile)

## <span id="checkpoint"> Checkpointing
## Checkpointing { #checkpoint }

```scala
checkpoint(): Unit
Expand All @@ -87,7 +91,7 @@ checkpoint(

`checkpoint` requests the [LogStore](../DeltaLog.md#store) to [overwrite](../storage/LogStore.md#write) the [_last_checkpoint](#LAST_CHECKPOINT) file with the JSON-encoded checkpoint metadata.

In the end, `checkpoint` [cleans up the expired logs](../MetadataCleanup.md#doLogCleanup) (if enabled).
In the end, `checkpoint` [cleans up the expired logs](../log-cleanup/MetadataCleanup.md#doLogCleanup) (if enabled).

---

Expand All @@ -103,6 +107,9 @@ checkpointAndCleanUpDeltaLog(
snapshotToCheckpoint: Snapshot): Unit
```

??? warning "Procedure"
`checkpointAndCleanUpDeltaLog` is a procedure (returns `Unit`) so _what happens inside stays inside_ (paraphrasing the [former advertising slogan of Las Vegas, Nevada](https://idioms.thefreedictionary.com/what+happens+in+Vegas+stays+in+Vegas)).

`checkpointAndCleanUpDeltaLog` does the following (in the order):

1. [writeCheckpointFiles](#writeCheckpointFiles)
Expand Down Expand Up @@ -164,7 +171,7 @@ lastCheckpoint: Option[CheckpointMetaData]
`lastCheckpoint` is used when:

* `SnapshotManagement` is requested to [load the latest snapshot](../SnapshotManagement.md#getSnapshotAtInit)
* `MetadataCleanup` is requested to [listExpiredDeltaLogs](../MetadataCleanup.md#listExpiredDeltaLogs)
* `MetadataCleanup` is requested to [listExpiredDeltaLogs](../log-cleanup/MetadataCleanup.md#listExpiredDeltaLogs)

### loadMetadataFromFile { #loadMetadataFromFile }

Expand Down
4 changes: 4 additions & 0 deletions docs/log-cleanup/.pages
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
title: Log Cleanup
nav:
- index.md
- ...
73 changes: 73 additions & 0 deletions docs/log-cleanup/MetadataCleanup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# MetadataCleanup

`MetadataCleanup` is an abstraction of [metadata cleaners](#implementations) that can [clean up](#doLogCleanup) expired checkpoints and delta logs of a [delta table](#self).

<span id="self"></span>
`MetadataCleanup` requires to be used with [DeltaLog](../DeltaLog.md) (or subtypes) only.

## Implementations

* [DeltaLog](../DeltaLog.md)

## Table Properties

### enableExpiredLogCleanup { #enableExpiredLogCleanup }

`MetadataCleanup` uses [delta.enableExpiredLogCleanup](../table-properties/DeltaConfigs.md#ENABLE_EXPIRED_LOG_CLEANUP) table property to control [log cleanup](#doLogCleanup).

### logRetentionDuration { #deltaRetentionMillis }

`MetadataCleanup` uses [delta.logRetentionDuration](../table-properties/DeltaConfigs.md#LOG_RETENTION) table property for [cleanUpExpiredLogs](#cleanUpExpiredLogs) (to determine `fileCutOffTime`).

## Cleaning Up Expired Logs { #doLogCleanup }

??? note "Checkpoints"

```scala
doLogCleanup(): Unit
```

`doLogCleanup` is part of the [Checkpoints](../checkpoints/Checkpoints.md#doLogCleanup) abstraction.

`doLogCleanup` [cleanUpExpiredLogs](#cleanUpExpiredLogs) when [enabled](#enableExpiredLogCleanup).

### cleanUpExpiredLogs { #cleanUpExpiredLogs }

```scala
cleanUpExpiredLogs(): Unit
```

`cleanUpExpiredLogs` calculates a `fileCutOffTime` based on the [current time](../DeltaLog.md#clock) and the [logRetentionDuration](#deltaRetentionMillis) table property.

`cleanUpExpiredLogs` prints out the following INFO message to the logs:

```text
Starting the deletion of log files older than [date]
```

`cleanUpExpiredLogs` [finds the expired delta logs](#listExpiredDeltaLogs) (based on the `fileCutOffTime`) and deletes the files (using Hadoop's [FileSystem.delete]({{ hadoop.api }}/org/apache/hadoop/fs/FileSystem.html#delete(org.apache.hadoop.fs.Path,%20boolean)) non-recursively). `cleanUpExpiredLogs` counts the files deleted (and uses it in the summary INFO message).

In the end, `cleanUpExpiredLogs` prints out the following INFO message to the logs:

```text
Deleted [numDeleted] log files older than [date]
```

### Finding Expired Log Files { #listExpiredDeltaLogs }

```scala
listExpiredDeltaLogs(
fileCutOffTime: Long): Iterator[FileStatus]
```

`listExpiredDeltaLogs` [loads the most recent checkpoint](../checkpoints/Checkpoints.md#lastCheckpoint) if available.

If the last checkpoint is not available, `listExpiredDeltaLogs` returns an empty iterator.

`listExpiredDeltaLogs` requests the [LogStore](../DeltaLog.md#store) for the [paths](../storage/LogStore.md#listFrom) (in the same directory) that are (lexicographically) greater or equal to the ``0``th checkpoint file (per [checkpointPrefix](../FileNames.md#checkpointPrefix) format) of the [checkpoint](../FileNames.md#isCheckpointFile) and [delta](../FileNames.md#isDeltaFile) files in the [log directory](../DeltaLog.md#logPath).

In the end, `listExpiredDeltaLogs` creates a `BufferingLogDeletionIterator` that...FIXME

## Logging

Enable `ALL` logging level for the [Implementations](#implementations) logger to see what happens inside.
12 changes: 12 additions & 0 deletions docs/log-cleanup/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
---
hide:
- toc
---

# Log Cleanup

**Log Cleanup** (_Metadata Cleanup_) is used to remove expired metadata log files in the [transaction log](../DeltaLog.md) of a delta table.

Log Cleanup can be executed as part of [table checkpointing](../checkpoints/Checkpoints.md#doLogCleanup) when enabled using [delta.enableExpiredLogCleanup](../table-properties/DeltaConfigs.md#delta.enableExpiredLogCleanup) table property.

Expired log files are specified as older than [delta.logRetentionDuration](../table-properties/DeltaConfigs.md#logRetentionDuration) table property.
2 changes: 1 addition & 1 deletion docs/storage/LogStore.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ Used when:
* `Checkpoints` is requested to [findLastCompleteCheckpoint](../checkpoints/Checkpoints.md#findLastCompleteCheckpoint)
* `DeltaHistoryManager` is requested to [getEarliestDeltaFile](../DeltaHistoryManager.md#getEarliestDeltaFile), [getEarliestReproducibleCommit](../DeltaHistoryManager.md#getEarliestReproducibleCommit) and [getCommits](../DeltaHistoryManager.md#getCommits)
* `DeltaLog` is requested to [getChanges](../DeltaLog.md#getChanges)
* `MetadataCleanup` is requested to [listExpiredDeltaLogs](../MetadataCleanup.md#listExpiredDeltaLogs)
* `MetadataCleanup` is requested to [listExpiredDeltaLogs](../log-cleanup/MetadataCleanup.md#listExpiredDeltaLogs)
* `SnapshotManagement` is requested to [listFrom](../SnapshotManagement.md#listFrom)
* `DelegatingLogStore` is requested to [listFrom](DelegatingLogStore.md#listFrom)
* `DeltaFileOperations` utility is used to [listUsingLogStore](../DeltaFileOperations.md#listUsingLogStore)
Expand Down
2 changes: 1 addition & 1 deletion docs/table-properties/DeltaConfig.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ In the end, `fromMetaData` converts the text representation to the proper type u
* `Checkpoints` utility is used to [buildCheckpoint](../checkpoints/Checkpoints.md#buildCheckpoint)
* `DeltaErrors` utility is used to [logFileNotFoundException](../DeltaErrors.md#logFileNotFoundException)
* `DeltaLog` is requested for [checkpointInterval](../DeltaLog.md#checkpointInterval) and [deletedFileRetentionDuration](../DeltaLog.md#tombstoneRetentionMillis) table properties, and to [assert a table is not read-only](../DeltaLog.md#assertRemovable)
* `MetadataCleanup` is requested for the [enableExpiredLogCleanup](../MetadataCleanup.md#enableExpiredLogCleanup) and the [deltaRetentionMillis](../MetadataCleanup.md#deltaRetentionMillis)
* `MetadataCleanup` is requested for the [enableExpiredLogCleanup](../log-cleanup/MetadataCleanup.md#enableExpiredLogCleanup) and the [deltaRetentionMillis](../log-cleanup/MetadataCleanup.md#deltaRetentionMillis)
* `OptimisticTransactionImpl` is requested to [commit](../OptimisticTransactionImpl.md#commit)
* `Snapshot` is requested for the [numIndexedCols](../Snapshot.md#numIndexedCols)

Expand Down
10 changes: 7 additions & 3 deletions docs/table-properties/DeltaConfigs.md
Original file line number Diff line number Diff line change
Expand Up @@ -198,14 +198,18 @@ Used when:
* `Protocol` is requested to [assertTablePropertyConstraintsSatisfied](../Protocol.md#assertTablePropertyConstraintsSatisfied)
* `DeletionVectorUtils` is requested to [deletionVectorsWritable](../deletion-vectors/DeletionVectorUtils.md#deletionVectorsWritable)

### <span id="ENABLE_EXPIRED_LOG_CLEANUP"> enableExpiredLogCleanup { #enableExpiredLogCleanup }
### <span id="ENABLE_EXPIRED_LOG_CLEANUP"><span id="enableExpiredLogCleanup"> enableExpiredLogCleanup { #delta.enableExpiredLogCleanup }

**delta.enableExpiredLogCleanup**

Whether to clean up expired log files and checkpoints
Controls [Log Cleanup](../log-cleanup/index.md)

Default: `true`

Used when:

* `MetadataCleanup` is requested for [whether to clean up expired log files and checkpoints](../log-cleanup/MetadataCleanup.md#enableExpiredLogCleanup)

### <span id="ENABLE_FULL_RETENTION_ROLLBACK"> enableFullRetentionRollback { #enableFullRetentionRollback }

**delta.enableFullRetentionRollback**
Expand Down Expand Up @@ -239,7 +243,7 @@ Examples: `2 weeks`, `365 days` (`months` and `years` are not accepted)

Used when:

* `MetadataCleanup` is requested for the [deltaRetentionMillis](../MetadataCleanup.md#deltaRetentionMillis)
* `MetadataCleanup` is requested for the [deltaRetentionMillis](../log-cleanup/MetadataCleanup.md#deltaRetentionMillis)

### <span id="MIN_READER_VERSION"> minReaderVersion { #minReaderVersion }

Expand Down
2 changes: 1 addition & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -166,6 +166,7 @@ nav:
- developer-api.md
- ... | generated-columns/**.md
- installation.md
- ... | log-cleanup/**.md
- LIMIT Pushdown:
- limit-pushdown/index.md
- Logging: logging.md
Expand Down Expand Up @@ -201,7 +202,6 @@ nav:
- SnapshotManagement: SnapshotManagement.md
- SnapshotDescriptor.md
- ReadChecksum: ReadChecksum.md
- MetadataCleanup: MetadataCleanup.md
- VerifyChecksum: VerifyChecksum.md
- Optimistic Transactions:
- OptimisticTransaction: OptimisticTransaction.md
Expand Down

0 comments on commit ec1c4d1

Please sign in to comment.