Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The design and optimization of API Benchmark #284

Open
Xreki opened this issue Nov 27, 2019 · 1 comment
Open

The design and optimization of API Benchmark #284

Xreki opened this issue Nov 27, 2019 · 1 comment

Comments

@Xreki
Copy link
Collaborator

Xreki commented Nov 27, 2019

No description provided.

@Xreki
Copy link
Collaborator Author

Xreki commented Nov 27, 2019

Program = feed + abs + fetch

  • profile数据
------------------------->     Profiling Report     <-------------------------

Place: All
Time unit: ms
Sorted by total time in descending order in the same thread

Event                               Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.
thread0::GpuMemcpySync:GPU->CPU     10          65.3952     39.898246 (0.610110)    25.496945 (0.389890)    6.46307     6.75515     6.53952     0.434411
thread0::fetch                      10          42.5865     37.686449 (0.884939)    4.900031 (0.115061)     4.15256     4.90003     4.25865     0.282896
thread0::TensorCopySync:GPU->CPU    10          41.8076     37.494616 (0.896837)    4.313012 (0.103163)     4.13309     4.31301     4.18076     0.277722
thread0::abs                        10          0.6688      0.450827 (0.674083)     0.217973 (0.325917)     0.052712    0.134069    0.06688     0.00444274
thread0::feed                       10          0.079468    0.064448 (0.810993)     0.015020 (0.189007)     0.005744    0.01502     0.0079468   0.000527895

{
  name: "abs",
  device: "GPU",
  precision: { stable: "True", diff: 0.00000 },
  speed: { repeat: 10, start: 1, end: 9, total: 5.08994, feed: 0.00000, compute: 0.00000, fetch: 0.00000 }
}
  • feed数据的CPU->GPU传输,是在Executor里面设置feed数据时已经开始传输,不是在feed op里面传输的
    image

  • fetch数据的GPU->CPU传输是发生在fetch op里面,最下面gpu操作结束之后,cuda_api这一层还有很长的时间。
    image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant