dmu_objset: replace `dnode_hash` impl with `cityhash4` #16483

Harry-Chen · 2024-08-27T08:01:36Z

Motivation and Context

As mentioned in PR #16131, replacing CRC-based hash with cityhash4 could slightly improve the performance by eliminating memory access. Replacing the algorithm is safe since the hash result is not persisted.

Moreover, cityhash4 yields more instructions and larger stack frame on 32-bit arches. Replacing it with specialized versions (i.e. setting some arguments to 0) can also provide boost.

See: #16131

Description

replace CRC-based hash with cityhash4.
add specialized version cityhash2 and cityhash3.
replace invocations to cityhash4 with specialized versions whenever possible.

Please see commit message for more details.

How Has This Been Tested?

Compiled on my machine. Let's see the ZTS results on CI.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

rincebrain · 2024-08-27T20:43:23Z

I have no particular objection, I'm just wondering if you have a test case where you saw this showing up a lot in your performance bottlenecks?

Harry-Chen · 2024-08-28T04:38:34Z

@rincebrain I did not run a benchmark. But I can do a micro-benchmark if needed. And there is no obvious evidence (e.g. profiling result) indicating this hashing is a bottleneck either.

I am doing this just because @amotin mentioned it, and it did make the code cleaner. As for performance, the replacement would eliminate all memory access to the CRC constant table, as well as emitting fewer instructions. So intuitively I think it would provide some (maybe slight) boost, if observable.

rincebrain · 2024-08-30T03:33:38Z

This seems like a fascinating question to ask @allanjude with #16486 and #16487. (Mostly because I'm suspecting that with that work he's hitting workloads with enough scale to notice a difference.)

amotin

It looks much cleaner for me, but I slightly worry about number of 64bit multiplications in cityhash4() and their time on older CPUs. I wonder if we could create reduced versions of cityhash4() for smaller number of arguments, since I see few other use cases not using all 4.

Harry-Chen · 2024-09-05T16:51:00Z

It looks much cleaner for me, but I slightly worry about number of 64bit multiplications in cityhash4() and their time on older CPUs. I wonder if we could create reduced versions of cityhash4() for smaller number of arguments, since I see few other use cases not using all 4.

Thanks for your review. Do you mean performing some manual "specialization" on cityhash4 by replacing some variables with 0? I think if we can make cityhash4 a fully inline function, maybe the compiler is smart enough to do this for us?

amotin · 2024-09-05T16:58:24Z

Do you mean performing some manual "specialization" on cityhash4 by replacing some variables with 0?

Yes.

I think if we can make cityhash4 a fully inline function, maybe the compiler is smart enough to do this for us?

Yes, I think modern compilers could be clever enough for that. The only question is whether I want it to be inlined everywhere. Though it might be not that bad.

Harry-Chen · 2024-09-05T17:12:36Z

Yes, I think modern compilers could be clever enough for that. The only question is whether I want it to be inlined everywhere. Though it might be not that bad.

Just have a quick (and rough) try on compiler explorer. On x86_64 with gcc 14, specializing cityhash2(x, y) with cityhash4(x, y, 0, 0) reduces imul insn count from 6 to 4 (with total #insn from 28 to 21). When using -m32 it reduces 12 imul to 8 (#insn 102 -> 67). Additionally, the specialized version largely reduces stack memory access on 32-bit platforms. On other architectures (e.g. arm, riscv) situations are similar.

So I think we indeed need a specialized cityhash2, or even cityhash1. The question is now whether we should inline the hash function everywhere (code size will slightly grow, but the compiler can always do its best), or provide and export different symbol names.

amotin · 2024-09-05T17:26:45Z

The question is now whether we should inline the hash function everywhere (code size will slightly grow, but the compiler can always do its best), or provide and export different symbol names.

I don't think the code size should be very important. But effect on stack size of the calling function could be verified.

Harry-Chen · 2024-09-05T17:57:56Z

But effect on stack size of the calling function could be verified.

Per my understanding, inlining calls to a function will always lead to less (or at least equal) stack usage than caller + callee combined. Inlining only has (potential) negative effects on compilation speed, code size and i-cache.

amotin · 2024-09-06T15:08:37Z

inlining calls to a function will always lead to less (or at least equal) stack usage than caller + callee combined

It may be true in some cases due to compiler's better ability to optimize things. But it leaks on-stack variables of the child function, that would exist only during its execution, into the parent, where they may increase stack usage for other children the parent calls.

Harry-Chen · 2024-09-06T16:54:39Z

But it leaks on-stack variables of the child function, that would exist only during its execution, into the parent, where they may increase stack usage for other children the parent calls.

Thanks for explaining, I understand what you mean now. It seems not that easy to diagnosis stack usage in compiling kernel modules, especially due to the inability to add -fstack-usage into CFAGS. Decomping is also annoying since gcc has already performed a lot of inlining, so tracing down certain symbols is not straightforward.

I don't think the code size should be very important.

If so, I can just create and export specialized cityhash1 and cityhash2.

amotin · 2024-09-06T17:06:36Z

If so, I can just create and export specialized cityhash1 and cityhash2.

I would not object. And may be cityhash3 also.

Specializing cityhash4 on 32-bit architectures can reduce the size of stack frames as well as instruction count. This is a tiny but useful optimization, since some callsites invoke it frequently. When specializing into 1/2/3/4-arg versions, the stack usage (in bytes) on some 32-bit arches are listed as follows: - x86: 32, 32, 32, 40 - arm-v7a: 20, 20, 28, 36 - riscv: 0, 0, 0, 16 - power: 16, 16, 16, 32 - mipsel: 8, 8, 8, 24 Same tendency applies to the count of instructions. Therefore 1-arg version is defined as a macro to the 2-arg one. On all 64-bit arches, the differences are negligible. See more discussion at openzfs#16483. Acked-by: Alexander Motin <[email protected]> Signed-off-by: Shengqi Chen <[email protected]>

So that we can get actual benefit from last commit. See more discussion at openzfs#16483. Acked-by: Alexander Motin <[email protected]> Signed-off-by: Shengqi Chen <[email protected]>

Specializing cityhash4 on 32-bit architectures can reduce the size of stack frames as well as instruction count. This is a tiny but useful optimization, since some callsites invoke it frequently. When specializing into 1/2/3/4-arg versions, the stack usage (in bytes) on some 32-bit arches are listed as follows: - x86: 32, 32, 32, 40 - arm-v7a: 20, 20, 28, 36 - riscv: 0, 0, 0, 16 - power: 16, 16, 16, 32 - mipsel: 8, 8, 8, 24 Same tendency applies to the count of instructions. Therefore 1-arg version is defined as a macro to the 2-arg one. On all 64-bit arches, the differences are negligible. See more discussion at openzfs#16483. Acked-by: Alexander Motin <[email protected]> Signed-off-by: Shengqi Chen <[email protected]>

So that we can get actual benefit from last commit. See more discussion at openzfs#16483. Acked-by: Alexander Motin <[email protected]> Signed-off-by: Shengqi Chen <[email protected]>

Specializing cityhash4 on 32-bit architectures can reduce the size of stack frames as well as instruction count. This is a tiny but useful optimization, since some callsites invoke it frequently. When specializing into 1/2/3/4-arg versions, the stack usage (in bytes) on some 32-bit arches are listed as follows: - x86: 32, 32, 32, 40 - arm-v7a: 20, 20, 28, 36 - riscv: 0, 0, 0, 16 - power: 16, 16, 16, 32 - mipsel: 8, 8, 8, 24 Same tendency applies to the count of instructions. Therefore 1-arg version is defined as a macro to the 2-arg one. On all 64-bit arches, the differences are negligible. See more discussion at openzfs#16483. Acked-by: Alexander Motin <[email protected]> Signed-off-by: Shengqi Chen <[email protected]>

So that we can get actual benefit from last commit. See more discussion at openzfs#16483. Acked-by: Alexander Motin <[email protected]> Signed-off-by: Shengqi Chen <[email protected]>

Harry-Chen · 2024-09-07T15:39:07Z

@amotin I modified the PR to include specialization versions of cityhash4. It passes build check and should not affect any ztest. Please review again when available.

include/cityhash.h

Specializing cityhash4 on 32-bit architectures can reduce the size of stack frames as well as instruction count. This is a tiny but useful optimization, since some callers invoke it frequently. When specializing into 1/2/3/4-arg versions, the stack usage (in bytes) on some 32-bit arches are listed as follows: - x86: 32, 32, 32, 40 - arm-v7a: 20, 20, 28, 36 - riscv: 0, 0, 0, 16 - power: 16, 16, 16, 32 - mipsel: 8, 8, 8, 24 And each actual argument (even if passing 0) contributes evenly to the number of multiplication instructions generated: - x86: 9, 12, 15 ,18 - arm-v7a: 6, 8, 10, 12 - riscv / power: 12, 18, 20, 24 - mipsel: 9, 12, 15, 19 On 64-bit architectures, the tendencies are similar. But both stack sizes and instruction counts are significantly smaller thus negligible. See more discussion at openzfs#16483. Acked-by: Alexander Motin <[email protected]> Signed-off-by: Shengqi Chen <[email protected]>

So that we can get actual benefit from last commit. See more discussion at openzfs#16483. Acked-by: Alexander Motin <[email protected]> Signed-off-by: Shengqi Chen <[email protected]>

Harry-Chen · 2024-09-09T13:04:17Z

@amotin After re-evaluating, I decided to add all 1/2/3-arg versions of cityhash, since reducing each argument could save some more 64b multiplications and should be worthwhile.

Also @behlendorf would you please review and consider merging?

amotin · 2024-09-09T15:23:30Z

module/zfs/dmu_objset.c

-	crc ^= (osv>>14) ^ (obj>>24);
-
-	return (crc);
+	return (cityhash2((uint64_t)os, obj));


Suggested change

return (cityhash2((uint64_t)os, obj));

return (cityhash2((uintptr_t)os, obj));

This is what done at dbuf_hash(). I worry compiler may not like conversion from 32bit pointer to 64bit integer.

Even if we keep it uintptr_t, it would be cast to uint64_t when passing to cityhash2. And compilers will not issue any warning on this cast even with -Wall -Wextra -pedantic.

There are architectures with 128bit pointers. And I wonder if this code may produce there "cast from pointer to integer of different size". Though I won't argue more about this, not my strongest area. Leave to somebody else. Just think coherency with dbuf_hash() would be good.

Thanks for pointing out the 128b pointer case. I have changed it back and left two explicit casts in the code.

Specializing cityhash4 on 32-bit architectures can reduce the size of stack frames as well as instruction count. This is a tiny but useful optimization, since some callers invoke it frequently. When specializing into 1/2/3/4-arg versions, the stack usage (in bytes) on some 32-bit arches are listed as follows: - x86: 32, 32, 32, 40 - arm-v7a: 20, 20, 28, 36 - riscv: 0, 0, 0, 16 - power: 16, 16, 16, 32 - mipsel: 8, 8, 8, 24 And each actual argument (even if passing 0) contributes evenly to the number of multiplication instructions generated: - x86: 9, 12, 15 ,18 - arm-v7a: 6, 8, 10, 12 - riscv / power: 12, 18, 20, 24 - mipsel: 9, 12, 15, 19 On 64-bit architectures, the tendencies are similar. But both stack sizes and instruction counts are significantly smaller thus negligible. See more discussion at openzfs#16483. Acked-by: Alexander Motin <[email protected]> Signed-off-by: Shengqi Chen <[email protected]>

So that we can get actual benefit from last commit. See more discussion at openzfs#16483. Acked-by: Alexander Motin <[email protected]> Signed-off-by: Shengqi Chen <[email protected]>

mcmilk · 2024-09-16T16:51:41Z

I like the idea, but without benchmarks I would not say yes here.
Maybe you can write some small test utility for testing.

I will test compile this then on PPC64 and AARCH64 also and publish the results.
If this PR has some time... I can maybe write some utility myself. But I don't know when I get to do this.

Harry-Chen · 2024-09-16T18:03:01Z

@mcmilk I wrote a really small & simple micro-benchmark here: https://gist.github.com/Harry-Chen/0d1eee1a4d8a7685b50c31316ca75671.

And here's some rough test results (order: crc64, cityhash2, cityhash4, unit: us) on some of my devices:

x86_64 (i9-14900K): 0.668, 0.623, 0.623
x86_64 (compiler explorer): 0.934, 0.544, 0.542
i686 (i9-14900K with -m32): 3.768, 0.812, 0.818 (crc64 is weirdly slow)
i686 (compiler explorer with -m32): 1.740, 1.640, 1.639
aarch64 (Kunpeng 920): 1.528, 0.946, 0.946
ppc64el (POWER8NVL): 1.489, 1.406, 1.406

To me it is safe to conclude cityhash does make hashing faster (if observable). Unfortunately I do not have more platforms (e.g. armhf, mipsel, riscv64) to test, on which specializing cityhash with <4 arguments might help further.

gmelikov · 2024-09-16T19:15:40Z

Just for fun ran it on x86_64 amd ryzen 7840u,
best run: crc64 0.658us, city2 0.383us, city4 0.383us

gmelikov@minime:/tmp/test$ ./hash_bench
[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 0.986us, confidence interval +- 1.555265%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 0.406us, confidence interval +- 0.142701%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 0.406us, confidence interval +- 0.141682%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.
gmelikov@minime:/tmp/test$ ./hash_bench
[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 0.827us, confidence interval +- 2.392147%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 0.407us, confidence interval +- 0.312449%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 0.406us, confidence interval +- 0.316596%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.
gmelikov@minime:/tmp/test$ ./hash_bench
[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 0.658us, confidence interval +- 0.198083%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 0.383us, confidence interval +- 0.331032%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 0.383us, confidence interval +- 0.315814%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.

amotin · 2024-09-16T19:27:04Z

I wonder why is the difference between city2 and city4 is so insignificant. It would be good to verify that inlining optimization does what is expected, now that all the functions are in the same file and compiler has ability to inline more. May be some noinline parameters could be fair.

amotin · 2024-09-16T19:47:44Z

Yea. At least building with Clang on FreeBSD dnode_hash_city2() and dnode_hash_city4() appeared identical until addition of __noinline to cityhash*() fixed the optimization (AMD EPYC 7402):

mav@srv:~/10% ./hash_bench
[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 1.071us, confidence interval +- 1.403153%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 0.610us, confidence interval +- 0.054205%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 0.715us, confidence interval +- 1.291311%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.

mav@srv:~/10% ./hash_bench
[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 1.001us, confidence interval +- 0.045569%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 0.611us, confidence interval +- 0.055383%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 0.645us, confidence interval +- 0.936527%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.
mav@srv:~/10% ./hash_bench
[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 1.000us, confidence interval +- 0.047267%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 0.610us, confidence interval +- 0.056379%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 0.645us, confidence interval +- 1.383868%)
[==========] 3 benchmarks ran.

The difference may be not huge, but reproducible. And on 32bits is predictably even bigger:

mav@srv:~/10% ./hash_bench
[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 1.134us, confidence interval +- 0.154648%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 1.319us, confidence interval +- 1.332820%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 1.536us, confidence interval +- 0.477121%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.
mav@srv:~/10% ./hash_bench
[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 1.129us, confidence interval +- 0.057105%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 1.241us, confidence interval +- 0.643688%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 1.532us, confidence interval +- 0.456539%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.
mav@srv:~/10% ./hash_bench
[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 1.136us, confidence interval +- 0.815983%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 1.281us, confidence interval +- 1.198090%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 1.564us, confidence interval +- 0.764337%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.

mcmilk · 2024-09-16T20:40:28Z

Thank you for the benchmark 👍

On my local Ryzen 7 PRO 5850U, fixed to 1,4GHz it looks like this:

[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 2.163us, confidence interval +- 0.028878%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 1.221us, confidence interval +- 0.026899%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 1.221us, confidence interval +- 0.024906%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.
[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 2.163us, confidence interval +- 0.029074%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 1.221us, confidence interval +- 0.029488%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 1.221us, confidence interval +- 0.032090%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.

On ARM Neoverse-N1 it looks like this:

[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 1.409us, confidence interval +- 2.150333%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 1.079us, confidence interval +- 0.106291%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 1.074us, confidence interval +- 0.153203%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.
[mcmilk@lea ~/src/hash]$ ./a.out 
[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 1.397us, confidence interval +- 0.093639%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 1.077us, confidence interval +- 0.118394%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 1.072us, confidence interval +- 0.164758%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.

On PPC64LE (POWER10, pvr 0080 0200) - these is no big difference:

[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 0.979us, confidence interval +- 0.092565%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 0.924us, confidence interval +- 0.100753%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 0.924us, confidence interval +- 0.099303%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.
[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 0.979us, confidence interval +- 0.092653%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 0.924us, confidence interval +- 0.100670%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 0.924us, confidence interval +- 0.102227%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.

So yes, it looks okay.

rincebrain · 2024-09-16T21:24:23Z

Just for fun:

$ ./hash_bench
[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 6.005us, confidence interval +- 2.068187%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 4.971us, confidence interval +- 0.022014%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 5.015us, confidence interval +- 2.424526%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.
$ grep Model /proc/cpuinfo
Model		: Raspberry Pi 4 Model B Rev 1.4

Harry-Chen · 2024-09-17T03:33:32Z

Yea. At least building with Clang on FreeBSD dnode_hash_city2() and dnode_hash_city4() appeared identical until addition of __noinline to cityhash*() fixed the optimization (AMD EPYC 7402):

Thanks! I did not read the emitted assembly of different tests very carefully. So yes, specializing cityhash does some good stuff.

behlendorf · 2024-09-18T18:51:49Z

Good stuff, thanks for providing the micro benchmark for this. If you can rebase this on master to resolve the conflict it should be good to go.

As mentioned in PR openzfs#16131, replacing CRC-based hash with cityhash4 could slightly improve the performance by eliminating memory access. Replacing algorightm is safe since the hash result is not persisted. See: openzfs#16131 Signed-off-by: Shengqi Chen <[email protected]>

Specializing cityhash4 on 32-bit architectures can reduce the size of stack frames as well as instruction count. This is a tiny but useful optimization, since some callers invoke it frequently. When specializing into 1/2/3/4-arg versions, the stack usage (in bytes) on some 32-bit arches are listed as follows: - x86: 32, 32, 32, 40 - arm-v7a: 20, 20, 28, 36 - riscv: 0, 0, 0, 16 - power: 16, 16, 16, 32 - mipsel: 8, 8, 8, 24 And each actual argument (even if passing 0) contributes evenly to the number of multiplication instructions generated: - x86: 9, 12, 15 ,18 - arm-v7a: 6, 8, 10, 12 - riscv / power: 12, 18, 20, 24 - mipsel: 9, 12, 15, 19 On 64-bit architectures, the tendencies are similar. But both stack sizes and instruction counts are significantly smaller thus negligible. See more discussion at openzfs#16483. Acked-by: Alexander Motin <[email protected]> Signed-off-by: Shengqi Chen <[email protected]>

So that we can get actual benefit from last commit. See more discussion at openzfs#16483. Acked-by: Alexander Motin <[email protected]> Signed-off-by: Shengqi Chen <[email protected]>

Harry-Chen · 2024-09-19T04:31:17Z

I've rebased. Seems now CI takes much longer time...

Specializing cityhash4 on 32-bit architectures can reduce the size of stack frames as well as instruction count. This is a tiny but useful optimization, since some callers invoke it frequently. When specializing into 1/2/3/4-arg versions, the stack usage (in bytes) on some 32-bit arches are listed as follows: - x86: 32, 32, 32, 40 - arm-v7a: 20, 20, 28, 36 - riscv: 0, 0, 0, 16 - power: 16, 16, 16, 32 - mipsel: 8, 8, 8, 24 And each actual argument (even if passing 0) contributes evenly to the number of multiplication instructions generated: - x86: 9, 12, 15 ,18 - arm-v7a: 6, 8, 10, 12 - riscv / power: 12, 18, 20, 24 - mipsel: 9, 12, 15, 19 On 64-bit architectures, the tendencies are similar. But both stack sizes and instruction counts are significantly smaller thus negligible. Reviewed by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tino Reichardt <[email protected]> Signed-off-by: Shengqi Chen <[email protected]> Closes #16131 Closes #16483

So that we can get actual benefit from last commit. Reviewed by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tino Reichardt <[email protected]> Signed-off-by: Shengqi Chen <[email protected]> Closes #16131 Closes #16483

Harry-Chen force-pushed the dnode_cityhash branch 2 times, most recently from a74282e to 45b276f Compare August 27, 2024 08:17

behlendorf requested a review from amotin August 27, 2024 19:48

behlendorf added the Status: Code Review Needed Ready for review and testing label Aug 27, 2024

amotin approved these changes Sep 5, 2024

View reviewed changes

Harry-Chen force-pushed the dnode_cityhash branch from 45b276f to 82abe67 Compare September 7, 2024 14:30

Harry-Chen force-pushed the dnode_cityhash branch from 82abe67 to c9072e2 Compare September 7, 2024 14:42

Harry-Chen force-pushed the dnode_cityhash branch from c9072e2 to e0f7ed0 Compare September 7, 2024 15:00

amotin approved these changes Sep 8, 2024

View reviewed changes

include/cityhash.h Outdated Show resolved Hide resolved

include/cityhash.h Outdated Show resolved Hide resolved

Harry-Chen force-pushed the dnode_cityhash branch from e0f7ed0 to 2f96aa2 Compare September 9, 2024 12:53

amotin reviewed Sep 9, 2024

View reviewed changes

Harry-Chen force-pushed the dnode_cityhash branch from 2f96aa2 to af3228c Compare September 10, 2024 17:50

mcmilk approved these changes Sep 16, 2024

View reviewed changes

behlendorf approved these changes Sep 18, 2024

View reviewed changes

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Sep 18, 2024

Harry-Chen added 3 commits September 19, 2024 10:52

cityhash: replace invocations with specialized versions when possible

e958069

So that we can get actual benefit from last commit. See more discussion at openzfs#16483. Acked-by: Alexander Motin <[email protected]> Signed-off-by: Shengqi Chen <[email protected]>

Harry-Chen force-pushed the dnode_cityhash branch from af3228c to e958069 Compare September 19, 2024 02:55

behlendorf closed this in 1c35206 Sep 19, 2024

Harry-Chen deleted the dnode_cityhash branch September 24, 2024 07:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dmu_objset: replace `dnode_hash` impl with `cityhash4` #16483

dmu_objset: replace `dnode_hash` impl with `cityhash4` #16483

Harry-Chen commented Aug 27, 2024 •

edited

Loading

rincebrain commented Aug 27, 2024

Harry-Chen commented Aug 28, 2024

rincebrain commented Aug 30, 2024 •

edited

Loading

amotin left a comment •

edited

Loading

Harry-Chen commented Sep 5, 2024

amotin commented Sep 5, 2024

Harry-Chen commented Sep 5, 2024

amotin commented Sep 5, 2024

Harry-Chen commented Sep 5, 2024 •

edited

Loading

amotin commented Sep 6, 2024

Harry-Chen commented Sep 6, 2024

amotin commented Sep 6, 2024

Harry-Chen commented Sep 7, 2024

Harry-Chen commented Sep 9, 2024

amotin Sep 9, 2024

Harry-Chen Sep 10, 2024

amotin Sep 10, 2024

Harry-Chen Sep 10, 2024

mcmilk commented Sep 16, 2024

Harry-Chen commented Sep 16, 2024 •

edited

Loading

gmelikov commented Sep 16, 2024 •

edited

Loading

amotin commented Sep 16, 2024

amotin commented Sep 16, 2024 •

edited

Loading

mcmilk commented Sep 16, 2024

rincebrain commented Sep 16, 2024

Harry-Chen commented Sep 17, 2024

behlendorf commented Sep 18, 2024

Harry-Chen commented Sep 19, 2024

	return (cityhash2((uint64_t)os, obj));
	return (cityhash2((uintptr_t)os, obj));

dmu_objset: replace dnode_hash impl with cityhash4 #16483

dmu_objset: replace dnode_hash impl with cityhash4 #16483

Conversation

Harry-Chen commented Aug 27, 2024 • edited Loading

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

rincebrain commented Aug 27, 2024

Harry-Chen commented Aug 28, 2024

rincebrain commented Aug 30, 2024 • edited Loading

amotin left a comment • edited Loading

Choose a reason for hiding this comment

Harry-Chen commented Sep 5, 2024

amotin commented Sep 5, 2024

Harry-Chen commented Sep 5, 2024

amotin commented Sep 5, 2024

Harry-Chen commented Sep 5, 2024 • edited Loading

amotin commented Sep 6, 2024

Harry-Chen commented Sep 6, 2024

amotin commented Sep 6, 2024

Harry-Chen commented Sep 7, 2024

Harry-Chen commented Sep 9, 2024

amotin Sep 9, 2024

Choose a reason for hiding this comment

Harry-Chen Sep 10, 2024

Choose a reason for hiding this comment

amotin Sep 10, 2024

Choose a reason for hiding this comment

Harry-Chen Sep 10, 2024

Choose a reason for hiding this comment

mcmilk commented Sep 16, 2024

Harry-Chen commented Sep 16, 2024 • edited Loading

gmelikov commented Sep 16, 2024 • edited Loading

amotin commented Sep 16, 2024

amotin commented Sep 16, 2024 • edited Loading

mcmilk commented Sep 16, 2024

rincebrain commented Sep 16, 2024

Harry-Chen commented Sep 17, 2024

behlendorf commented Sep 18, 2024

Harry-Chen commented Sep 19, 2024

dmu_objset: replace `dnode_hash` impl with `cityhash4` #16483

dmu_objset: replace `dnode_hash` impl with `cityhash4` #16483

Harry-Chen commented Aug 27, 2024 •

edited

Loading

rincebrain commented Aug 30, 2024 •

edited

Loading

amotin left a comment •

edited

Loading

Harry-Chen commented Sep 5, 2024 •

edited

Loading

Harry-Chen commented Sep 16, 2024 •

edited

Loading

gmelikov commented Sep 16, 2024 •

edited

Loading

amotin commented Sep 16, 2024 •

edited

Loading