Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] performance test is failing following timestamp sync change sof#8480 #1153

Closed
btian1 opened this issue Jan 16, 2024 · 12 comments
Closed
Assignees

Comments

@btian1
Copy link
Contributor

btian1 commented Jan 16, 2024

Describe the bug
A clear and concise description of what the bug is.
CI build performance test is failing

To Reproduce
use local mtrace and dmesg file can reproduce as well

Reproduction Rate
100%.

Expected behavior
A clear and concise description of what you expected to happen.

Impact
CI performance build test should pass.

Environment

  1. Branch name and commit hash of the 2 repositories: sof (firmware/topology) and linux (kernel driver).
    • Kernel: since ww02.2
    • SOF: since ww02.2
  2. Name of the topology file
    • Topology: default
  3. Name of the platform(s) on which the bug is observed.
    • Platform: mtl

image

@btian1 btian1 self-assigned this Jan 16, 2024
@btian1
Copy link
Contributor Author

btian1 commented Jan 16, 2024

passed: https://sof-ci.ostc.intel.com/#/result/planresultdetail/36559
failed: https://sof-ci.ostc.intel.com/#/result/planresultdetail/36620

passed mtrace logging:
image

failed mtrace logging:
image

Current performance testing are requesting skip trace until find trace timestamp between [0, 1] as the start
of valid mtrace performance logging, due to current sof-test test case may collect some traces belonging to previous
test case due to mtrace is configured in deferred mode.

So once Rander aligned kernel and DSP timestamp, DSP timestamp only restart(from 0) on DSP power off/on, this means
only the first test case(or test case require DSP restart) have timestamp with [0,1].

with current option, until mtrace logging end, there may no timestamp between [0,1], so raise this exception and cause performance test failed.

Current solution is remove this option and python script will directly try to find logs include "comp_copy" as valid module performance logging, then collect and calculate its performance data.

This is also working with previous mtrace logs, no other negative impact observed so far.

@btian1
Copy link
Contributor Author

btian1 commented Jan 16, 2024

PR submitted for review and discuss:

@marc-hb
Copy link
Collaborator

marc-hb commented Jan 16, 2024

Thanks for collecting logs. Next time please avoid screenshots: https://docs.zephyrproject.org/latest/develop/getting_started/index.html#use-copy-paste

@marc-hb
Copy link
Collaborator

marc-hb commented Jan 16, 2024

Thanks for collecting these logs and describing the change, this is great progress.

So once Rander aligned kernel and DSP timestamp, DSP timestamp only restart(from 0) on DSP power off/on
this means only the first test case(or test case require DSP restart) have timestamp with [0,1].

The initial timestamp sync message (0x44000000|0x31400008) is sent only every time the DSP boots. This sync message can be seen in all the logs you shared. So we can see that the DSP boots in all the logs you shared. Yet the timestamp does not always restarts from zero. So no: "DSP boot" != "timestamp from 0"; it's more complicated.

I investigated this a bit and I think I found the key fact and key difference that you have been constantly missing so far: booting from IMR.

Current solution is remove this option and python script will directly try to find logs include "comp_copy" as valid module performance logging, then collect and calculate its performance data.

Very interesting, so you're saying that the --skip-trace option was ALREADY useless, even BEFORE Rander's timestamp change? This would be another, very important fact. @aiChaoSONG was --skip-trace always useless from the beginning? If not, what other change made --skip-trace already useless before Rander's timestamp change?

@marc-hb marc-hb transferred this issue from thesofproject/sof Jan 16, 2024
@marc-hb marc-hb changed the title [BUG] performance test is failing [BUG] performance test is failing following timestamp sync change sof#8480 Jan 16, 2024
@aiChaoSONG
Copy link

aiChaoSONG commented Jan 17, 2024

@aiChaoSONG was --skip-trace always useless from the beginning? If not, what other change made --skip-trace already useless before Rander's timestamp change?

The several lines of trace at the beginning in a test belongs to the previous test, that's why we(or maybe I) want to skip them. Before Rander's patch, there's a clear division you can tell start from where, the traces belong to current test, but after Rander's patch, you can't.
In CI, we set --skip-to-first-trace, so all traces are skipped after Rander's patch, that's why the test case failed. My proposal is to not use the --skip-to-first-trace option here

perf_cmd="sof_perf_analyzer.py --skip-to-first-trace --kmsg=$LOG_ROOT/dmesg.txt --out2html $LOG_ROOT/sof_perf.html $LOG_ROOT/mtrace.txt"
, so we have some trace to analyze, though, several lines belong to previous test, but it doesn't harm.

@marc-hb
Copy link
Collaborator

marc-hb commented Jan 17, 2024

Thanks @aiChaoSONG , this is useful. So you added the --skip-to-first-trace just for trying to be very conservative / risk-averse but you never saw this option actually pollute test results? Or barely? Or only in people's local workspace but not in CI? Thanks for sharing.

I had a look at these but I couldn't tell:

but after Rander's patch, you can't.

I think I finally found why. Too bad I had to analyze this myself.

@marc-hb
Copy link
Collaborator

marc-hb commented Jan 17, 2024

but after Rander's patch, you can't.

Actually, we could/should probably use something like this

This would help not just with the logs, it would also ensure the FW itself is measured in a "clean" state.

I mean in the future, we obviously need a shorter-term solution now.

@aiChaoSONG
Copy link

@marc-hb If test case is run by developer manually, before Rander's patch, timestamp is always start from 0, so with/without this option won't make any difference. But after Rander's patch, it will fail because the timestamp may not start from 0, and due to the mechanism we use to skip trace, no trace will left for analysis in the end. So removing the use of --skip-to-first-trace is a simple way to make the test case functioning, you see, we don't have good solution to remove traces from last test after Rander's change

@marc-hb
Copy link
Collaborator

marc-hb commented Jan 17, 2024

we don't have good solution to remove traces from last test after Rander's change

That's the impression I got, thanks for confirming. But the main question is still: how much pollution will this typically cause? If too much / too often, is there some trick we could use to reduce it?

Another option could be to disable IMR boot when testing performance, I think it shows the zero timestamp again for a few lines. Any idea how hard disabling IMR would that be?

@aiChaoSONG
Copy link

aiChaoSONG commented Jan 17, 2024

how much pollution will this typically cause?

Not that much. Normally the residue traces are related to trigger stop IPC, dma put and component reset (as you can see). There could be 10 ~ 20 lines, but they won't affect performance measurement, because we focus on trace lines that are within the component log context and printed by comp_copy function in performance measurement. So, even with the traces from previous test, it is all good.

Any idea how hard disabling IMR would that be?

We have a kernel option sof_debug, one of its bit is used to control boot from IMR or not. But we have to set it only for performance test in deployment, which is more complex.

I think remove the use of --skip-to-first-trace is still the fastest and simplest way to workaround the current dilemma with barely negative impact.

btian1 added a commit to btian1/sof-test that referenced this issue Jan 17, 2024
The --skip-to-first-trace option was added as a precaution to exclude
log pollution from before the start of a test but how much pollution
it excludes was never assessed; we just don't know.
Hopefully not much? Replace, see below.

mtrace timestamps used to return to zero on every DSP boot, which --skip-to-first-trace
depends on. After SOF commit 76b0979c6453 ("sync time between host and fw for logging"),
mtrace timestamps still return to zero (briefly) but only on "cold" DSP boots,
not on when booting from IMR. Booting from IMR has now been enabled by default
in most configurations. That combination breaks --skip-to-first-trace.

The only practical short-term solution is to turn off --skip-to-first-trace and take
the risk of old logs polluting the test results, which is what this commit does.

If results pollution proves to be a problem, we could in the longer term try to disable
IMR boot just before running this test using sof_debug as discussed in thesofproject#1153
This would also measure from a possibly cleaner DSP state. New, DSP "debug ops"
being proposed in thesofproject/linux#4726 might also help.

Fixes: thesofproject#1153

Signed-off-by: Baofeng Tian <[email protected]>
Signed-off-by: Chao Song <[email protected]>
btian1 added a commit to btian1/sof-test that referenced this issue Jan 17, 2024
The --skip-to-first-trace option was added as a precaution to exclude
log pollution from before the start of a test, normally the residue traces are
related to trigger stop IPC, dma put and component reset (as you can see).
There could be 10 ~ 20 lines.

mtrace timestamps used to return to zero on every DSP boot, which --skip-to-first-trace
depends on. After SOF commit 76b0979c6453 ("sync time between host and fw for logging"),
mtrace timestamps still return to zero (briefly) but only on "cold" DSP boots,
not on when booting from IMR. Booting from IMR has now been enabled by default
in most configurations. That combination breaks --skip-to-first-trace.

The only practical short-term solution is to turn off --skip-to-first-trace and take
the risk of old logs polluting the test results, which is what this commit does.

If results pollution proves to be a problem, we could in the longer term try to disable
IMR boot just before running this test using sof_debug as discussed in thesofproject#1153
This would also measure from a possibly cleaner DSP state. New, DSP "debug ops"
being proposed in thesofproject/linux#4726 might also help.

Fixes: thesofproject#1153

Signed-off-by: Baofeng Tian <[email protected]>
Signed-off-by: Chao Song <[email protected]>
@marc-hb
Copy link
Collaborator

marc-hb commented Jan 18, 2024

https://github.com/orgs/community/discussions/66741

Let's close when the test is passing.

@marc-hb marc-hb reopened this Jan 18, 2024
@marc-hb
Copy link
Collaborator

marc-hb commented Apr 19, 2024

test has been passing for a while https://sof-ci.ostc.intel.com/#/result/planresultdetail/40027

@marc-hb marc-hb closed this as completed Apr 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants