Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compression: additional limit for effective dict #12087

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

stevemilk
Copy link
Contributor

@stevemilk stevemilk marked this pull request as draft September 25, 2024 13:56

m.RLock()
if _, ok := usedPatterns[p.code]; !ok && len(usedPatterns) >= effectiveDictLimit {
continue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jump outside of mutex section - it's deadlock

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh sry, just notice you fix it

@AskAlexSharov
Copy link
Collaborator

AskAlexSharov commented Sep 27, 2024

compression ratio seems impacted:

2.4G	v1-060700-060800-transactions.seg.compressed_s1_dict_8k_min20_max128_sof1_dictInt256k_old
3.1G	v1-060700-060800-transactions.seg.compressed_dict_8k_min20_max128_sof1_dictInt128k_new

i did debug deeply yet. not sure - do we drop "part of dictionary" which has low-score or also can drop high-scored patterns.

i used next command:

go run ./cmd/erigon snapshots uncompress /erigon-data/snapshots/v1-060700-060800-transactions.seg | DictReducerSoftLimit=1000000 MinPatternLen=20 MaxPatternLen=128 SamplingFactor=1 MaxDictPatterns=262144 EffectiveDictLimit=65536 go run ./cmd/erigon snapshots compress --datadir=/erigon-data/erigon3/  /erigon-backup/v1-060700-060800-transactions.seg.compressed_s1_dict_8k_min20_max128_sof1_dictInt256k_new --pprof --pprof.port=6061 > ~/log3.txt 2>&1 &

@stevemilk
Copy link
Contributor Author

MaxDictPatterns = MaxDictPatterns * 2
EffectiveDictLimit = MaxDictPatterns / 2

With above config, init-dict will include more patterns than before, and may exclude some high-score patterns during dict reduction because patterns are hit randomly, low-score pattern may occupy the space of effective dict in advance.

will re-consider the solution.

@stevemilk stevemilk marked this pull request as draft September 30, 2024 02:04
@stevemilk
Copy link
Contributor Author

The previous version incorrectly included some unused patterns
I'v updated , however, the compression ratio will still be impacted to some extent.

Here's the reason:
In the current design, init-dict is reduced to reduced-dict after compression, retaining only the used patterns. For example, if len(init-dict) = 1024 and len(reduced-dict) = 512, with a compression ratio of 0.3, setting effectiveDictLimit to 512 keeps the ratio the same. However, if effectiveDictLimit is less than len(reduced-dict), some patterns will be dropped, potentially increasing the compression ratio.

I believe effectiveDictLimit serves as a hard limit to prevent the resulting dictionary from becoming too large, rather than a means to improve the compression ratio. I conducted some investigation but would appreciate your insights or suggestions for a better solution. @AskAlexSharov

@AskAlexSharov
Copy link
Collaborator

Tnx. I will trst another day

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants