Skip to content

Commit

Permalink
Improve ACK docs (#1866)
Browse files Browse the repository at this point in the history
  • Loading branch information
moscicky authored Jun 10, 2024
1 parent 0c7e112 commit 5c14ec2
Show file tree
Hide file tree
Showing 10 changed files with 90 additions and 22 deletions.
4 changes: 2 additions & 2 deletions docs/docs/configuration/buffer-persistence.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Publishing buffer persistence
# Publishing buffer persistence [deprecated]

Hermes Frontend API has option to register callbacks triggered during different phases of message lifetime:

Expand All @@ -15,7 +15,7 @@ to disk. Map structure is continuously persisted to disk, as it is stored in off

When Hermes Frontend starts up it scans filesystem in search of existing persisted map. If found, it is read and any
persisted events are sent to Message Store. This way recovering after crash is fully automatic. If Hermes process or
server crashes, nothing is lost.
server crashes, events that were flushed to disk are recovered.

There is additional protection against flooding subscribers with outdated events. When reading events from persisted
storage, Hermes filters out messages older than N hours, where N is a system parameter and is set to 3 days by default.
Expand Down
65 changes: 52 additions & 13 deletions docs/docs/user/publishing.md
Original file line number Diff line number Diff line change
Expand Up @@ -134,22 +134,21 @@ Failure statuses:
Each topic can define level of acknowledgement (ACK):

* leader ACK - only one Kafka node (leader) needs to acknowledge reception of message
* all ACK - all nodes that hold copy of message need to acknowledge reception of message
* all ACK - at least [min.insync.replicas](https://kafka.apache.org/documentation/#brokerconfigs_min.insync.replicas) nodes must acknowledge reception of message

For most of the topic leader ACK is enough. This guarantees roughly 99.999..% reception rate. Only in rare cases, during
Kafka cluster rebalancing or nodes outage Kafka might confirm that message was received, while it was not saved and it
will be lost.
ACK configuration has the following consequences:

What does it mean in practice? Numbers differ per case and they are affected by multiple factors like frequency of
rebalancing taking place on Kafka clusters, Kafka version etc. In our production environment using ACK leader means we falsely
believe message was received by Kafka once per 20 million events. This is a very rough estimate that should show you
the scale, if you need numbers to base your decision on - please conduct own measurements.
- with `ACK leader` message writes are replicated asynchronously, thus the acknowledgment latency will be low. However, message write may be lost
when there is a topic leadership change - e.g. due to rebalance or broker restart.
- with `ACK all` messages writes are synchronously replicated to replicas. Write acknowledgement latency will be much higher than with leader ACK,
it will also have higher variance due to tail latency. However, messages will be persisted as long as the whole replica set does not go down simultaneously.

If you need 100% guarantee that message was saved, force all replicas to send ACK. The downside of this is much longer
response times, they tend to vary a lot as well. Thanks to Hermes buffering (described in paragraphs below), we are able
to guarantee some sane response times to our clients even in *ACK all* mode.
Publishers are advised to select topic ACK level based on their latency and durability requirements.

## Buffering
Hermes also provides a feature called Buffering (described in paragraphs below) which provides consistent write latency
despite long Kafka response times. Note that, however, this mode may decrease message durability for `ACK all` setting.

## Buffering [deprecated]

Hermes administrator can set maximum time, for which Hermes will wait for Kafka acknowledgment. By default, it is set to
65ms. After that time, **202** response is sent to client. Event is kept in Kafka producer buffer and it's delivery will
Expand All @@ -161,14 +160,54 @@ Kafka is back online.

### Buffer persistence

By default events are buffered in memory only. This raises the question about what happens in case of Hermes node failure
By default, events are buffered in memory only. This raises the question about what happens in case of Hermes node failure
(or force kill of process). Hermes Frontend API exposes callbacks that can be used to implement persistence model of
buffered events.

Default implementation uses [OpenHFT ChronicleMap](https://github.com/OpenHFT/Chronicle-Map) to persist unsent messages
to disk. Map structure is continuously persisted to disk, as it is stored in offheap memory as
[memory mapped file](https://en.wikipedia.org/wiki/Memory-mapped_file).

Using buffering with ACK all setting means that durability of events may be lowered when **202** status code is received. If Hermes instance
is killed before message is spilled to disk or the data on disk becomes corrupted, the message is gone. Thus `ACK all` with **202** status code
is similar to `ACK leader` because a single node failure could cause the message be lost.

### Deprecation notice
The buffering mechanism in Hermes is considered deprecated and is set to be removed in the future.

## Remote DC fallback

Hermes supports a remote datacenter fallback mechanism for [multi datacenter deployments](https://hermes-pubsub.readthedocs.io/en/latest/configuration/kafka-and-zookeeper/#multiple-kafka-and-zookeeper-clusters).

Fallback is configured on per topic basis, using a `fallbackToRemoteDatacenterEnabled` property:

```http request
PUT /topics/my.group.my-topic
{
"fallbackToRemoteDatacenterEnabled": true,
}
```

Using this setting automatically disables buffering mechanism for a topic.

When using this setting for a topic, Hermes will try to send a message to a local datacenter Kafka first and will fall back to remote datacenter Kafka
if the local send fails.

Hermes also provides a speculative fallback mechanism which will send messages to remote Kafka if the local Kafka is not responding in a timely manner.
Speculative send is performed after `frontend.kafka.fail-fast-producer.speculativeSendDelay` elapses.

When using remote DC fallback, Hermes attempts to send a message to Kafka for the duration of `frontend.handlers.maxPublishRequestDuration` property. If after
`maxPublishRequestDuration` Hermes has not received an acknowledgment from Kafka, it will respond with **500** status code to the client.

Table below summarizes remote fallback configuration options:

| Option | Scope | Default value |
|--------------------------------------------------------|--------|---------------|
| fallbackToRemoteDatacenterEnabled | topic | false |
| frontend.kafka.fail-fast-producer.speculativeSendDelay | global | 250ms |
| frontend.handlers.maxPublishRequestDuration | global | 500ms |

## Partition assignment
`Partition-Key` header can be used by publishers to specify Kafka `key` which will be used for partition assignment for a message. This will ensure
that all messages with given `Partition-Key` will be sent to the same Kafka partition.
11 changes: 10 additions & 1 deletion hermes-console/src/components/console-alert/ConsoleAlert.vue
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@
title?: string;
text: string;
type: 'error' | 'success' | 'warning' | 'info';
link?: string;
linkDescription?: string;
}>();
</script>

Expand All @@ -15,7 +17,14 @@
:type="props.type"
border="start"
:icon="icon ?? `\$${type}`"
></v-alert>
>
<a
v-if="link != null && linkDescription != null"
:href="link"
target="_blank"
>{{ linkDescription }}</a
>
</v-alert>
</template>
<style scoped lang="scss"></style>
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ function initializeForm(form: Ref<TopicForm>): void {
trackingEnabled: false,
contentType: loadedConfig.value.topic.defaults.contentType,
maxMessageSize: defaultMaxMessageSize,
ack: loadedConfig.value.topic.defaults.ack,
ack: '',
schema: '',
};
}
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ export interface FormValidators {
retentionTimeDuration: FieldValidator<number>[];
maxMessageSize: FieldValidator<number>[];
offlineRetentionTime: FieldValidator<number>[];
ack: FieldValidator<string>[];
}

export interface RawDataSources {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ function formValidators(): FormValidators {
retentionTimeDuration: [required(), min(0), max(7)],
maxMessageSize: [required(), min(0)],
offlineRetentionTime: [required(), min(0)],
ack: [required()],
};
}

Expand Down
3 changes: 2 additions & 1 deletion hermes-console/src/dummy/topic-form.ts
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ export const dummyTopicFormValidator = {
retentionTimeDuration: [required(), min(0), max(7)],
maxMessageSize: [required(), min(0)],
offlineRetentionTime: [required(), min(0)],
ack: [required()],
};

export const dummyContentTypes = [
Expand Down Expand Up @@ -123,7 +124,7 @@ export const dummyInitializedTopicForm = {
trackingEnabled: false,
contentType: dummyAppConfig.topic.defaults.contentType,
maxMessageSize: defaultMaxMessageSize,
ack: dummyAppConfig.topic.defaults.ack,
ack: '',
schema: '',
};

Expand Down
8 changes: 4 additions & 4 deletions hermes-console/src/i18n/en-US/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -327,10 +327,10 @@ const en_US = {
modificationDate: 'Modification date',
tooltips: {
acknowledgement:
'Specifies the strength of guarantees that acknowledged message was indeed persisted. In ' +
'"Leader" mode ACK is required only from topic leader, which is fast and gives 99.99999% guarantee. It might ' +
'be not enough when cluster is unstable. "All" mode means message needs to be saved on all replicas before ' +
'sending ACK, which is quite slow but gives 100% guarantee that message has been persisted.',
'Specifies the strength of guarantees that acknowledged message was indeed persisted. ' +
'With `ACK leader` message writes are replicated asynchronously, thus the acknowledgment latency will be low. However, message write may be lost when there is a topic leadership change - e.g. due to rebalance or broker restart. ' +
'With `ACK all` messages writes are synchronously replicated to replicas. Write acknowledgement latency will be much higher than with leader ACK,' +
' it will also have higher variance due to tail latency. However, messages will be persisted as long as the whole replica set does not go down simultaneously.',
retentionTime:
'For how many hours/days message is available for subscribers after being published.',
authorizedPublishers:
Expand Down
6 changes: 6 additions & 0 deletions hermes-console/src/i18n/en-US/topic-form.ts
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,12 @@ const messages = {
days: 'DAYS',
},
ack: 'Kafka ACK level',
ackHelpTitle: 'ACK level is very important',
ackHelpText:
'Set ACK level according to your durability and latency requirements, see: ',
ackHelpLink:
'https://hermes-pubsub.readthedocs.io/en/latest/user/publishing/#acknowledgment-level',
ackHelpLinkDescription: 'ACK docs.',
contentType: 'Content type',
maxMessageSize: {
label: 'Max message size',
Expand Down
11 changes: 11 additions & 0 deletions hermes-console/src/views/topic/topic-form/TopicForm.vue
Original file line number Diff line number Diff line change
Expand Up @@ -218,8 +218,19 @@
:label="$t('topicForm.fields.retentionTime.duration')"
/>
<console-alert
:title="$t('topicForm.fields.ackHelpTitle')"
:text="$t('topicForm.fields.ackHelpText')"
type="info"
class="mb-4"
:link="$t('topicForm.fields.ackHelpLink')"
:link-description="$t('topicForm.fields.ackHelpLinkDescription')"
>
</console-alert>
<select-field
v-model="form.ack"
:rules="validators.ack"
:label="$t('topicForm.fields.ack')"
:items="dataSources.ackModes"
/>
Expand Down

0 comments on commit 5c14ec2

Please sign in to comment.