[BUG] performance test is failing following timestamp sync change sof#8480 #1153

btian1 · 2024-01-16T02:19:30Z

Describe the bug
A clear and concise description of what the bug is.
CI build performance test is failing

To Reproduce
use local mtrace and dmesg file can reproduce as well

Reproduction Rate
100%.

Expected behavior
A clear and concise description of what you expected to happen.

Impact
CI performance build test should pass.

Environment

Branch name and commit hash of the 2 repositories: sof (firmware/topology) and linux (kernel driver).
- Kernel: since ww02.2
- SOF: since ww02.2
Name of the topology file
- Topology: default
Name of the platform(s) on which the bug is observed.
- Platform: mtl

btian1 · 2024-01-16T03:04:27Z

passed: https://sof-ci.ostc.intel.com/#/result/planresultdetail/36559
failed: https://sof-ci.ostc.intel.com/#/result/planresultdetail/36620

passed mtrace logging:

failed mtrace logging:

Current performance testing are requesting skip trace until find trace timestamp between [0, 1] as the start
of valid mtrace performance logging, due to current sof-test test case may collect some traces belonging to previous
test case due to mtrace is configured in deferred mode.

So once Rander aligned kernel and DSP timestamp, DSP timestamp only restart(from 0) on DSP power off/on, this means
only the first test case(or test case require DSP restart) have timestamp with [0,1].

with current option, until mtrace logging end, there may no timestamp between [0,1], so raise this exception and cause performance test failed.

Current solution is remove this option and python script will directly try to find logs include "comp_copy" as valid module performance logging, then collect and calculate its performance data.

This is also working with previous mtrace logs, no other negative impact observed so far.

btian1 · 2024-01-16T03:06:42Z

PR submitted for review and discuss:

lib.sh: perf_analyze(): drop --skip-to-first-trace #1150

marc-hb · 2024-01-16T17:35:04Z

Thanks for collecting logs. Next time please avoid screenshots: https://docs.zephyrproject.org/latest/develop/getting_started/index.html#use-copy-paste

marc-hb · 2024-01-16T20:02:35Z

Thanks for collecting these logs and describing the change, this is great progress.

So once Rander aligned kernel and DSP timestamp, DSP timestamp only restart(from 0) on DSP power off/on
this means only the first test case(or test case require DSP restart) have timestamp with [0,1].

The initial timestamp sync message (0x44000000|0x31400008) is sent only every time the DSP boots. This sync message can be seen in all the logs you shared. So we can see that the DSP boots in all the logs you shared. Yet the timestamp does not always restarts from zero. So no: "DSP boot" != "timestamp from 0"; it's more complicated.

I investigated this a bit and I think I found the key fact and key difference that you have been constantly missing so far: booting from IMR.

Current solution is remove this option and python script will directly try to find logs include "comp_copy" as valid module performance logging, then collect and calculate its performance data.

Very interesting, so you're saying that the --skip-trace option was ALREADY useless, even BEFORE Rander's timestamp change? This would be another, very important fact. @aiChaoSONG was --skip-trace always useless from the beginning? If not, what other change made --skip-trace already useless before Rander's timestamp change?

aiChaoSONG · 2024-01-17T02:28:14Z

@aiChaoSONG was --skip-trace always useless from the beginning? If not, what other change made --skip-trace already useless before Rander's timestamp change?

The several lines of trace at the beginning in a test belongs to the previous test, that's why we(or maybe I) want to skip them. Before Rander's patch, there's a clear division you can tell start from where, the traces belong to current test, but after Rander's patch, you can't.
In CI, we set --skip-to-first-trace, so all traces are skipped after Rander's patch, that's why the test case failed. My proposal is to not use the --skip-to-first-trace option here

sof-test/case-lib/lib.sh

Line 969 in bc3ced5

    
           perf_cmd="sof_perf_analyzer.py --skip-to-first-trace --kmsg=$LOG_ROOT/dmesg.txt --out2html $LOG_ROOT/sof_perf.html $LOG_ROOT/mtrace.txt"

, so we have some trace to analyze, though, several lines belong to previous test, but it doesn't harm.

marc-hb · 2024-01-17T02:41:06Z

Thanks @aiChaoSONG , this is useful. So you added the --skip-to-first-trace just for trying to be very conservative / risk-averse but you never saw this option actually pollute test results? Or barely? Or only in people's local workspace but not in CI? Thanks for sharing.

I had a look at these but I couldn't tell:

but after Rander's patch, you can't.

I think I finally found why. Too bad I had to analyze this myself.

marc-hb · 2024-01-17T02:51:17Z

but after Rander's patch, you can't.

Actually, we could/should probably use something like this

ASoC: SOF: debug: Introduce debug option to test DSP ops linux#4726

This would help not just with the logs, it would also ensure the FW itself is measured in a "clean" state.

I mean in the future, we obviously need a shorter-term solution now.

aiChaoSONG · 2024-01-17T03:04:29Z

@marc-hb If test case is run by developer manually, before Rander's patch, timestamp is always start from 0, so with/without this option won't make any difference. But after Rander's patch, it will fail because the timestamp may not start from 0, and due to the mechanism we use to skip trace, no trace will left for analysis in the end. So removing the use of --skip-to-first-trace is a simple way to make the test case functioning, you see, we don't have good solution to remove traces from last test after Rander's change

marc-hb · 2024-01-17T03:10:55Z

we don't have good solution to remove traces from last test after Rander's change

That's the impression I got, thanks for confirming. But the main question is still: how much pollution will this typically cause? If too much / too often, is there some trick we could use to reduce it?

Another option could be to disable IMR boot when testing performance, I think it shows the zero timestamp again for a few lines. Any idea how hard disabling IMR would that be?

aiChaoSONG · 2024-01-17T05:29:55Z

how much pollution will this typically cause?

Not that much. Normally the residue traces are related to trigger stop IPC, dma put and component reset (as you can see). There could be 10 ~ 20 lines, but they won't affect performance measurement, because we focus on trace lines that are within the component log context and printed by comp_copy function in performance measurement. So, even with the traces from previous test, it is all good.

Any idea how hard disabling IMR would that be?

We have a kernel option sof_debug, one of its bit is used to control boot from IMR or not. But we have to set it only for performance test in deployment, which is more complex.

I think remove the use of --skip-to-first-trace is still the fastest and simplest way to workaround the current dilemma with barely negative impact.

The --skip-to-first-trace option was added as a precaution to exclude log pollution from before the start of a test but how much pollution it excludes was never assessed; we just don't know. Hopefully not much? Replace, see below. mtrace timestamps used to return to zero on every DSP boot, which --skip-to-first-trace depends on. After SOF commit 76b0979c6453 ("sync time between host and fw for logging"), mtrace timestamps still return to zero (briefly) but only on "cold" DSP boots, not on when booting from IMR. Booting from IMR has now been enabled by default in most configurations. That combination breaks --skip-to-first-trace. The only practical short-term solution is to turn off --skip-to-first-trace and take the risk of old logs polluting the test results, which is what this commit does. If results pollution proves to be a problem, we could in the longer term try to disable IMR boot just before running this test using sof_debug as discussed in thesofproject#1153 This would also measure from a possibly cleaner DSP state. New, DSP "debug ops" being proposed in thesofproject/linux#4726 might also help. Fixes: thesofproject#1153 Signed-off-by: Baofeng Tian <[email protected]> Signed-off-by: Chao Song <[email protected]>

The --skip-to-first-trace option was added as a precaution to exclude log pollution from before the start of a test, normally the residue traces are related to trigger stop IPC, dma put and component reset (as you can see). There could be 10 ~ 20 lines. mtrace timestamps used to return to zero on every DSP boot, which --skip-to-first-trace depends on. After SOF commit 76b0979c6453 ("sync time between host and fw for logging"), mtrace timestamps still return to zero (briefly) but only on "cold" DSP boots, not on when booting from IMR. Booting from IMR has now been enabled by default in most configurations. That combination breaks --skip-to-first-trace. The only practical short-term solution is to turn off --skip-to-first-trace and take the risk of old logs polluting the test results, which is what this commit does. If results pollution proves to be a problem, we could in the longer term try to disable IMR boot just before running this test using sof_debug as discussed in thesofproject#1153 This would also measure from a possibly cleaner DSP state. New, DSP "debug ops" being proposed in thesofproject/linux#4726 might also help. Fixes: thesofproject#1153 Signed-off-by: Baofeng Tian <[email protected]> Signed-off-by: Chao Song <[email protected]>

marc-hb · 2024-01-18T06:07:50Z

https://github.com/orgs/community/discussions/66741

Let's close when the test is passing.

marc-hb · 2024-04-19T20:16:37Z

test has been passing for a while https://sof-ci.ostc.intel.com/#/result/planresultdetail/40027

btian1 self-assigned this Jan 16, 2024

marc-hb mentioned this issue Jan 16, 2024

lib.sh: perf_analyze(): drop --skip-to-first-trace #1150

Merged

marc-hb transferred this issue from thesofproject/sof Jan 16, 2024

marc-hb changed the title ~~[BUG] performance test is failing~~ [BUG] performance test is failing following timestamp sync change sof#8480 Jan 16, 2024

marc-hb closed this as completed in 14c049b Jan 18, 2024

marc-hb reopened this Jan 18, 2024

marc-hb closed this as completed Apr 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] performance test is failing following timestamp sync change sof#8480 #1153

[BUG] performance test is failing following timestamp sync change sof#8480 #1153

btian1 commented Jan 16, 2024 •

edited by marc-hb

Loading

btian1 commented Jan 16, 2024 •

edited

Loading

btian1 commented Jan 16, 2024 •

edited by marc-hb

Loading

marc-hb commented Jan 16, 2024 •

edited

Loading

marc-hb commented Jan 16, 2024

aiChaoSONG commented Jan 17, 2024 •

edited

Loading

marc-hb commented Jan 17, 2024 •

edited

Loading

marc-hb commented Jan 17, 2024 •

edited

Loading

aiChaoSONG commented Jan 17, 2024

marc-hb commented Jan 17, 2024

aiChaoSONG commented Jan 17, 2024 •

edited

Loading

marc-hb commented Jan 18, 2024 •

edited

Loading

marc-hb commented Apr 19, 2024

[BUG] performance test is failing following timestamp sync change sof#8480 #1153

[BUG] performance test is failing following timestamp sync change sof#8480 #1153

Comments

btian1 commented Jan 16, 2024 • edited by marc-hb Loading

btian1 commented Jan 16, 2024 • edited Loading

btian1 commented Jan 16, 2024 • edited by marc-hb Loading

marc-hb commented Jan 16, 2024 • edited Loading

marc-hb commented Jan 16, 2024

aiChaoSONG commented Jan 17, 2024 • edited Loading

marc-hb commented Jan 17, 2024 • edited Loading

marc-hb commented Jan 17, 2024 • edited Loading

aiChaoSONG commented Jan 17, 2024

marc-hb commented Jan 17, 2024

aiChaoSONG commented Jan 17, 2024 • edited Loading

marc-hb commented Jan 18, 2024 • edited Loading

marc-hb commented Apr 19, 2024

btian1 commented Jan 16, 2024 •

edited by marc-hb

Loading

btian1 commented Jan 16, 2024 •

edited

Loading

btian1 commented Jan 16, 2024 •

edited by marc-hb

Loading

marc-hb commented Jan 16, 2024 •

edited

Loading

aiChaoSONG commented Jan 17, 2024 •

edited

Loading

marc-hb commented Jan 17, 2024 •

edited

Loading

marc-hb commented Jan 17, 2024 •

edited

Loading

aiChaoSONG commented Jan 17, 2024 •

edited

Loading

marc-hb commented Jan 18, 2024 •

edited

Loading