Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dmu_objset: replace dnode_hash impl with cityhash4 #16483

Closed
wants to merge 3 commits into from

Conversation

Harry-Chen
Copy link
Contributor

@Harry-Chen Harry-Chen commented Aug 27, 2024

Motivation and Context

As mentioned in PR #16131, replacing CRC-based hash with cityhash4 could slightly improve the performance by eliminating memory access. Replacing the algorithm is safe since the hash result is not persisted.

Moreover, cityhash4 yields more instructions and larger stack frame on 32-bit arches. Replacing it with specialized versions (i.e. setting some arguments to 0) can also provide boost.

See: #16131

Description

  1. replace CRC-based hash with cityhash4.
  2. add specialized version cityhash2 and cityhash3.
  3. replace invocations to cityhash4 with specialized versions whenever possible.

Please see commit message for more details.

How Has This Been Tested?

Compiled on my machine. Let's see the ZTS results on CI.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@Harry-Chen Harry-Chen force-pushed the dnode_cityhash branch 2 times, most recently from a74282e to 45b276f Compare August 27, 2024 08:17
@behlendorf behlendorf requested a review from amotin August 27, 2024 19:48
@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Aug 27, 2024
@rincebrain
Copy link
Contributor

I have no particular objection, I'm just wondering if you have a test case where you saw this showing up a lot in your performance bottlenecks?

@Harry-Chen
Copy link
Contributor Author

@rincebrain I did not run a benchmark. But I can do a micro-benchmark if needed. And there is no obvious evidence (e.g. profiling result) indicating this hashing is a bottleneck either.

I am doing this just because @amotin mentioned it, and it did make the code cleaner. As for performance, the replacement would eliminate all memory access to the CRC constant table, as well as emitting fewer instructions. So intuitively I think it would provide some (maybe slight) boost, if observable.

@rincebrain
Copy link
Contributor

rincebrain commented Aug 30, 2024

This seems like a fascinating question to ask @allanjude with #16486 and #16487. (Mostly because I'm suspecting that with that work he's hitting workloads with enough scale to notice a difference.)

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks much cleaner for me, but I slightly worry about number of 64bit multiplications in cityhash4() and their time on older CPUs. I wonder if we could create reduced versions of cityhash4() for smaller number of arguments, since I see few other use cases not using all 4.

@Harry-Chen
Copy link
Contributor Author

It looks much cleaner for me, but I slightly worry about number of 64bit multiplications in cityhash4() and their time on older CPUs. I wonder if we could create reduced versions of cityhash4() for smaller number of arguments, since I see few other use cases not using all 4.

Thanks for your review. Do you mean performing some manual "specialization" on cityhash4 by replacing some variables with 0? I think if we can make cityhash4 a fully inline function, maybe the compiler is smart enough to do this for us?

@amotin
Copy link
Member

amotin commented Sep 5, 2024

Do you mean performing some manual "specialization" on cityhash4 by replacing some variables with 0?

Yes.

I think if we can make cityhash4 a fully inline function, maybe the compiler is smart enough to do this for us?

Yes, I think modern compilers could be clever enough for that. The only question is whether I want it to be inlined everywhere. Though it might be not that bad.

@Harry-Chen
Copy link
Contributor Author

Yes, I think modern compilers could be clever enough for that. The only question is whether I want it to be inlined everywhere. Though it might be not that bad.

Just have a quick (and rough) try on compiler explorer. On x86_64 with gcc 14, specializing cityhash2(x, y) with cityhash4(x, y, 0, 0) reduces imul insn count from 6 to 4 (with total #insn from 28 to 21). When using -m32 it reduces 12 imul to 8 (#insn 102 -> 67). Additionally, the specialized version largely reduces stack memory access on 32-bit platforms. On other architectures (e.g. arm, riscv) situations are similar.

So I think we indeed need a specialized cityhash2, or even cityhash1. The question is now whether we should inline the hash function everywhere (code size will slightly grow, but the compiler can always do its best), or provide and export different symbol names.

@amotin
Copy link
Member

amotin commented Sep 5, 2024

The question is now whether we should inline the hash function everywhere (code size will slightly grow, but the compiler can always do its best), or provide and export different symbol names.

I don't think the code size should be very important. But effect on stack size of the calling function could be verified.

@Harry-Chen
Copy link
Contributor Author

Harry-Chen commented Sep 5, 2024

But effect on stack size of the calling function could be verified.

Per my understanding, inlining calls to a function will always lead to less (or at least equal) stack usage than caller + callee combined. Inlining only has (potential) negative effects on compilation speed, code size and i-cache.

@amotin
Copy link
Member

amotin commented Sep 6, 2024

inlining calls to a function will always lead to less (or at least equal) stack usage than caller + callee combined

It may be true in some cases due to compiler's better ability to optimize things. But it leaks on-stack variables of the child function, that would exist only during its execution, into the parent, where they may increase stack usage for other children the parent calls.

@Harry-Chen
Copy link
Contributor Author

But it leaks on-stack variables of the child function, that would exist only during its execution, into the parent, where they may increase stack usage for other children the parent calls.

Thanks for explaining, I understand what you mean now. It seems not that easy to diagnosis stack usage in compiling kernel modules, especially due to the inability to add -fstack-usage into CFAGS. Decomping is also annoying since gcc has already performed a lot of inlining, so tracing down certain symbols is not straightforward.

I don't think the code size should be very important.

If so, I can just create and export specialized cityhash1 and cityhash2.

@amotin
Copy link
Member

amotin commented Sep 6, 2024

If so, I can just create and export specialized cityhash1 and cityhash2.

I would not object. And may be cityhash3 also.

Harry-Chen added a commit to Harry-Chen/zfs that referenced this pull request Sep 7, 2024
Specializing cityhash4 on 32-bit architectures can reduce the size
of stack frames as well as instruction count. This is a tiny but
useful optimization, since some callsites invoke it frequently.

When specializing into 1/2/3/4-arg versions, the stack usage
(in bytes) on some 32-bit arches are listed as follows:

- x86: 32, 32, 32, 40
- arm-v7a: 20, 20, 28, 36
- riscv: 0, 0, 0, 16
- power: 16, 16, 16, 32
- mipsel: 8, 8, 8, 24

Same tendency applies to the count of instructions.
Therefore 1-arg version is defined as a macro to the 2-arg one.

On all 64-bit arches, the differences are negligible.

See more discussion at openzfs#16483.

Acked-by: Alexander Motin <[email protected]>
Signed-off-by: Shengqi Chen <[email protected]>
Harry-Chen added a commit to Harry-Chen/zfs that referenced this pull request Sep 7, 2024
So that we can get actual benefit from last commit.

See more discussion at openzfs#16483.

Acked-by: Alexander Motin <[email protected]>
Signed-off-by: Shengqi Chen <[email protected]>
Harry-Chen added a commit to Harry-Chen/zfs that referenced this pull request Sep 7, 2024
Specializing cityhash4 on 32-bit architectures can reduce the size
of stack frames as well as instruction count. This is a tiny but
useful optimization, since some callsites invoke it frequently.

When specializing into 1/2/3/4-arg versions, the stack usage
(in bytes) on some 32-bit arches are listed as follows:

- x86: 32, 32, 32, 40
- arm-v7a: 20, 20, 28, 36
- riscv: 0, 0, 0, 16
- power: 16, 16, 16, 32
- mipsel: 8, 8, 8, 24

Same tendency applies to the count of instructions.
Therefore 1-arg version is defined as a macro to the 2-arg one.

On all 64-bit arches, the differences are negligible.

See more discussion at openzfs#16483.

Acked-by: Alexander Motin <[email protected]>
Signed-off-by: Shengqi Chen <[email protected]>
Harry-Chen added a commit to Harry-Chen/zfs that referenced this pull request Sep 7, 2024
So that we can get actual benefit from last commit.

See more discussion at openzfs#16483.

Acked-by: Alexander Motin <[email protected]>
Signed-off-by: Shengqi Chen <[email protected]>
Harry-Chen added a commit to Harry-Chen/zfs that referenced this pull request Sep 7, 2024
Specializing cityhash4 on 32-bit architectures can reduce the size
of stack frames as well as instruction count. This is a tiny but
useful optimization, since some callsites invoke it frequently.

When specializing into 1/2/3/4-arg versions, the stack usage
(in bytes) on some 32-bit arches are listed as follows:

- x86: 32, 32, 32, 40
- arm-v7a: 20, 20, 28, 36
- riscv: 0, 0, 0, 16
- power: 16, 16, 16, 32
- mipsel: 8, 8, 8, 24

Same tendency applies to the count of instructions.
Therefore 1-arg version is defined as a macro to the 2-arg one.

On all 64-bit arches, the differences are negligible.

See more discussion at openzfs#16483.

Acked-by: Alexander Motin <[email protected]>
Signed-off-by: Shengqi Chen <[email protected]>
Harry-Chen added a commit to Harry-Chen/zfs that referenced this pull request Sep 7, 2024
So that we can get actual benefit from last commit.

See more discussion at openzfs#16483.

Acked-by: Alexander Motin <[email protected]>
Signed-off-by: Shengqi Chen <[email protected]>
@Harry-Chen
Copy link
Contributor Author

@amotin I modified the PR to include specialization versions of cityhash4. It passes build check and should not affect any ztest. Please review again when available.

include/cityhash.h Outdated Show resolved Hide resolved
include/cityhash.h Outdated Show resolved Hide resolved
Harry-Chen added a commit to Harry-Chen/zfs that referenced this pull request Sep 9, 2024
Specializing cityhash4 on 32-bit architectures can reduce the size
of stack frames as well as instruction count. This is a tiny but
useful optimization, since some callers invoke it frequently.

When specializing into 1/2/3/4-arg versions, the stack usage
(in bytes) on some 32-bit arches are listed as follows:

- x86: 32, 32, 32, 40
- arm-v7a: 20, 20, 28, 36
- riscv: 0, 0, 0, 16
- power: 16, 16, 16, 32
- mipsel: 8, 8, 8, 24

And each actual argument (even if passing 0) contributes evenly
to the number of multiplication instructions generated:

- x86: 9, 12, 15 ,18
- arm-v7a: 6, 8, 10, 12
- riscv / power: 12, 18, 20, 24
- mipsel: 9, 12, 15, 19

On 64-bit architectures, the tendencies are similar. But both stack
sizes and instruction counts are significantly smaller thus negligible.

See more discussion at openzfs#16483.

Acked-by: Alexander Motin <[email protected]>
Signed-off-by: Shengqi Chen <[email protected]>
Harry-Chen added a commit to Harry-Chen/zfs that referenced this pull request Sep 9, 2024
So that we can get actual benefit from last commit.

See more discussion at openzfs#16483.

Acked-by: Alexander Motin <[email protected]>
Signed-off-by: Shengqi Chen <[email protected]>
@Harry-Chen
Copy link
Contributor Author

@amotin After re-evaluating, I decided to add all 1/2/3-arg versions of cityhash, since reducing each argument could save some more 64b multiplications and should be worthwhile.

Also @behlendorf would you please review and consider merging?

crc ^= (osv>>14) ^ (obj>>24);

return (crc);
return (cityhash2((uint64_t)os, obj));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return (cityhash2((uint64_t)os, obj));
return (cityhash2((uintptr_t)os, obj));

This is what done at dbuf_hash(). I worry compiler may not like conversion from 32bit pointer to 64bit integer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if we keep it uintptr_t, it would be cast to uint64_t when passing to cityhash2. And compilers will not issue any warning on this cast even with -Wall -Wextra -pedantic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are architectures with 128bit pointers. And I wonder if this code may produce there "cast from pointer to integer of different size". Though I won't argue more about this, not my strongest area. Leave to somebody else. Just think coherency with dbuf_hash() would be good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out the 128b pointer case. I have changed it back and left two explicit casts in the code.

Harry-Chen added a commit to Harry-Chen/zfs that referenced this pull request Sep 10, 2024
Specializing cityhash4 on 32-bit architectures can reduce the size
of stack frames as well as instruction count. This is a tiny but
useful optimization, since some callers invoke it frequently.

When specializing into 1/2/3/4-arg versions, the stack usage
(in bytes) on some 32-bit arches are listed as follows:

- x86: 32, 32, 32, 40
- arm-v7a: 20, 20, 28, 36
- riscv: 0, 0, 0, 16
- power: 16, 16, 16, 32
- mipsel: 8, 8, 8, 24

And each actual argument (even if passing 0) contributes evenly
to the number of multiplication instructions generated:

- x86: 9, 12, 15 ,18
- arm-v7a: 6, 8, 10, 12
- riscv / power: 12, 18, 20, 24
- mipsel: 9, 12, 15, 19

On 64-bit architectures, the tendencies are similar. But both stack
sizes and instruction counts are significantly smaller thus negligible.

See more discussion at openzfs#16483.

Acked-by: Alexander Motin <[email protected]>
Signed-off-by: Shengqi Chen <[email protected]>
Harry-Chen added a commit to Harry-Chen/zfs that referenced this pull request Sep 10, 2024
So that we can get actual benefit from last commit.

See more discussion at openzfs#16483.

Acked-by: Alexander Motin <[email protected]>
Signed-off-by: Shengqi Chen <[email protected]>
@mcmilk
Copy link
Contributor

mcmilk commented Sep 16, 2024

I like the idea, but without benchmarks I would not say yes here.
Maybe you can write some small test utility for testing.

I will test compile this then on PPC64 and AARCH64 also and publish the results.
If this PR has some time... I can maybe write some utility myself. But I don't know when I get to do this.

@Harry-Chen
Copy link
Contributor Author

Harry-Chen commented Sep 16, 2024

@mcmilk I wrote a really small & simple micro-benchmark here: https://gist.github.com/Harry-Chen/0d1eee1a4d8a7685b50c31316ca75671.

And here's some rough test results (order: crc64, cityhash2, cityhash4, unit: us) on some of my devices:

  • x86_64 (i9-14900K): 0.668, 0.623, 0.623
  • x86_64 (compiler explorer): 0.934, 0.544, 0.542
  • i686 (i9-14900K with -m32): 3.768, 0.812, 0.818 (crc64 is weirdly slow)
  • i686 (compiler explorer with -m32): 1.740, 1.640, 1.639
  • aarch64 (Kunpeng 920): 1.528, 0.946, 0.946
  • ppc64el (POWER8NVL): 1.489, 1.406, 1.406

To me it is safe to conclude cityhash does make hashing faster (if observable). Unfortunately I do not have more platforms (e.g. armhf, mipsel, riscv64) to test, on which specializing cityhash with <4 arguments might help further.

@gmelikov
Copy link
Member

gmelikov commented Sep 16, 2024

Just for fun ran it on x86_64 amd ryzen 7840u,
best run: crc64 0.658us, city2 0.383us, city4 0.383us

gmelikov@minime:/tmp/test$ ./hash_bench
[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 0.986us, confidence interval +- 1.555265%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 0.406us, confidence interval +- 0.142701%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 0.406us, confidence interval +- 0.141682%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.
gmelikov@minime:/tmp/test$ ./hash_bench
[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 0.827us, confidence interval +- 2.392147%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 0.407us, confidence interval +- 0.312449%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 0.406us, confidence interval +- 0.316596%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.
gmelikov@minime:/tmp/test$ ./hash_bench
[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 0.658us, confidence interval +- 0.198083%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 0.383us, confidence interval +- 0.331032%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 0.383us, confidence interval +- 0.315814%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.

@amotin
Copy link
Member

amotin commented Sep 16, 2024

I wonder why is the difference between city2 and city4 is so insignificant. It would be good to verify that inlining optimization does what is expected, now that all the functions are in the same file and compiler has ability to inline more. May be some noinline parameters could be fair.

@amotin
Copy link
Member

amotin commented Sep 16, 2024

Yea. At least building with Clang on FreeBSD dnode_hash_city2() and dnode_hash_city4() appeared identical until addition of __noinline to cityhash*() fixed the optimization (AMD EPYC 7402):

mav@srv:~/10% ./hash_bench
[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 1.071us, confidence interval +- 1.403153%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 0.610us, confidence interval +- 0.054205%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 0.715us, confidence interval +- 1.291311%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.

mav@srv:~/10% ./hash_bench
[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 1.001us, confidence interval +- 0.045569%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 0.611us, confidence interval +- 0.055383%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 0.645us, confidence interval +- 0.936527%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.
mav@srv:~/10% ./hash_bench
[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 1.000us, confidence interval +- 0.047267%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 0.610us, confidence interval +- 0.056379%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 0.645us, confidence interval +- 1.383868%)
[==========] 3 benchmarks ran.

The difference may be not huge, but reproducible. And on 32bits is predictably even bigger:

mav@srv:~/10% ./hash_bench
[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 1.134us, confidence interval +- 0.154648%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 1.319us, confidence interval +- 1.332820%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 1.536us, confidence interval +- 0.477121%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.
mav@srv:~/10% ./hash_bench
[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 1.129us, confidence interval +- 0.057105%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 1.241us, confidence interval +- 0.643688%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 1.532us, confidence interval +- 0.456539%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.
mav@srv:~/10% ./hash_bench
[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 1.136us, confidence interval +- 0.815983%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 1.281us, confidence interval +- 1.198090%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 1.564us, confidence interval +- 0.764337%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.

@mcmilk
Copy link
Contributor

mcmilk commented Sep 16, 2024

Thank you for the benchmark 👍

On my local Ryzen 7 PRO 5850U, fixed to 1,4GHz it looks like this:

[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 2.163us, confidence interval +- 0.028878%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 1.221us, confidence interval +- 0.026899%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 1.221us, confidence interval +- 0.024906%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.
[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 2.163us, confidence interval +- 0.029074%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 1.221us, confidence interval +- 0.029488%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 1.221us, confidence interval +- 0.032090%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.

On ARM Neoverse-N1 it looks like this:

[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 1.409us, confidence interval +- 2.150333%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 1.079us, confidence interval +- 0.106291%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 1.074us, confidence interval +- 0.153203%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.
[mcmilk@lea ~/src/hash]$ ./a.out 
[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 1.397us, confidence interval +- 0.093639%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 1.077us, confidence interval +- 0.118394%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 1.072us, confidence interval +- 0.164758%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.

On PPC64LE (POWER10, pvr 0080 0200) - these is no big difference:

[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 0.979us, confidence interval +- 0.092565%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 0.924us, confidence interval +- 0.100753%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 0.924us, confidence interval +- 0.099303%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.
[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 0.979us, confidence interval +- 0.092653%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 0.924us, confidence interval +- 0.100670%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 0.924us, confidence interval +- 0.102227%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.

So yes, it looks okay.

@rincebrain
Copy link
Contributor

Just for fun:

$ ./hash_bench
[==========] Running 3 benchmarks.
[ RUN      ] dnode_hash.crc64
[       OK ] dnode_hash.crc64 (mean 6.005us, confidence interval +- 2.068187%)
[ RUN      ] dnode_hash.city2
[       OK ] dnode_hash.city2 (mean 4.971us, confidence interval +- 0.022014%)
[ RUN      ] dnode_hash.city4
[       OK ] dnode_hash.city4 (mean 5.015us, confidence interval +- 2.424526%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.
$ grep Model /proc/cpuinfo
Model		: Raspberry Pi 4 Model B Rev 1.4

@Harry-Chen
Copy link
Contributor Author

Yea. At least building with Clang on FreeBSD dnode_hash_city2() and dnode_hash_city4() appeared identical until addition of __noinline to cityhash*() fixed the optimization (AMD EPYC 7402):

Thanks! I did not read the emitted assembly of different tests very carefully. So yes, specializing cityhash does some good stuff.

@behlendorf
Copy link
Contributor

Good stuff, thanks for providing the micro benchmark for this. If you can rebase this on master to resolve the conflict it should be good to go.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Sep 18, 2024
As mentioned in PR openzfs#16131, replacing CRC-based hash with cityhash4
could slightly improve the performance by eliminating memory access.
Replacing algorightm is safe since the hash result is not persisted.

See: openzfs#16131

Signed-off-by: Shengqi Chen <[email protected]>
Specializing cityhash4 on 32-bit architectures can reduce the size
of stack frames as well as instruction count. This is a tiny but
useful optimization, since some callers invoke it frequently.

When specializing into 1/2/3/4-arg versions, the stack usage
(in bytes) on some 32-bit arches are listed as follows:

- x86: 32, 32, 32, 40
- arm-v7a: 20, 20, 28, 36
- riscv: 0, 0, 0, 16
- power: 16, 16, 16, 32
- mipsel: 8, 8, 8, 24

And each actual argument (even if passing 0) contributes evenly
to the number of multiplication instructions generated:

- x86: 9, 12, 15 ,18
- arm-v7a: 6, 8, 10, 12
- riscv / power: 12, 18, 20, 24
- mipsel: 9, 12, 15, 19

On 64-bit architectures, the tendencies are similar. But both stack
sizes and instruction counts are significantly smaller thus negligible.

See more discussion at openzfs#16483.

Acked-by: Alexander Motin <[email protected]>
Signed-off-by: Shengqi Chen <[email protected]>
So that we can get actual benefit from last commit.

See more discussion at openzfs#16483.

Acked-by: Alexander Motin <[email protected]>
Signed-off-by: Shengqi Chen <[email protected]>
@Harry-Chen
Copy link
Contributor Author

I've rebased. Seems now CI takes much longer time...

behlendorf pushed a commit that referenced this pull request Sep 19, 2024
Specializing cityhash4 on 32-bit architectures can reduce the size
of stack frames as well as instruction count. This is a tiny but
useful optimization, since some callers invoke it frequently.

When specializing into 1/2/3/4-arg versions, the stack usage
(in bytes) on some 32-bit arches are listed as follows:

- x86: 32, 32, 32, 40
- arm-v7a: 20, 20, 28, 36
- riscv: 0, 0, 0, 16
- power: 16, 16, 16, 32
- mipsel: 8, 8, 8, 24

And each actual argument (even if passing 0) contributes evenly
to the number of multiplication instructions generated:

- x86: 9, 12, 15 ,18
- arm-v7a: 6, 8, 10, 12
- riscv / power: 12, 18, 20, 24
- mipsel: 9, 12, 15, 19

On 64-bit architectures, the tendencies are similar. But both stack
sizes and instruction counts are significantly smaller thus negligible.

Reviewed by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tino Reichardt <[email protected]>
Signed-off-by: Shengqi Chen <[email protected]>
Closes #16131
Closes #16483
behlendorf pushed a commit that referenced this pull request Sep 19, 2024
So that we can get actual benefit from last commit.

Reviewed by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tino Reichardt <[email protected]>
Signed-off-by: Shengqi Chen <[email protected]>
Closes #16131
Closes #16483
@Harry-Chen Harry-Chen deleted the dnode_cityhash branch September 24, 2024 07:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants