Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: EDGE3D update #478

Draft
wants to merge 7 commits into
base: develop
Choose a base branch
from
Draft

Conversation

pguthrey
Copy link
Contributor

@pguthrey pguthrey commented Sep 9, 2024

Summary

  • This PR is a refactoring
  • It does the following:
    • Explores performance improvements to EDGE3D

@pguthrey pguthrey force-pushed the feature/guthrey1/edge_3d_update branch 3 times, most recently from 32723a7 to 702238f Compare September 11, 2024 22:29
@rhornung67
Copy link
Member

@pguthrey is this ready for review?

@pguthrey pguthrey marked this pull request as draft October 13, 2024 17:18
@pguthrey
Copy link
Contributor Author

@pguthrey is this ready for review?

Not just yet. Need to experiment more with different implementations. Will come back to this later.
Also - there are additional apps that are broken with respect to targetopenmp, so I think I will pull that out into another branch and work with others to fix those examples (unless this is no longer supported)

@pguthrey pguthrey force-pushed the feature/guthrey1/edge_3d_update branch from 050da7e to c9b19a6 Compare November 23, 2024 03:46
@pguthrey
Copy link
Contributor Author

pguthrey commented Nov 27, 2024

Here are the results of these changes. Good improvement for CUDA. Impossibly good improvement for HIP. I checked that the results are the same as the previous algorithm... but I might look more into what is going on with HIP.

CUDA Base_Seq Lambda_Seq RAJA_Seq Base_OpenMP Lambda_OpenMP RAJA_OpenMP Base_HIP Lambda_HIP RAJA_HIP
default default default default default default block_256 block_256 block_256
current impl 4.42E+01 4.45E+01 4.44E+01 8.98E-01 9.08E-01 9.04E-01 1.53E-01 1.53E-01 1.54E-01
new impl 3.05E+01 3.06E+01 3.06E+01 6.08E-01 6.11E-01 6.11E-01 7.06E-02 7.02E-02 7.04E-02
speedup current/new 1.5 1.5 1.5 1.5 1.5 1.5 2.2 2.2 2.2
speedup openmp/kernel 1.0 1.0 1.0 8.6 8.7 8.6
HIP Base_Seq Lambda_Seq RAJA_Seq Base_OpenMP Lambda_OpenMP RAJA_OpenMP Base_HIP Lambda_HIP RAJA_HIP
default default default default default default block_256 block_256 block_256
current impl 1.74E+01 1.74E+01 1.74E+01 4.36E-01 3.79E-01 4.44E-01 1.34E-01 9.64E-02 9.73E-02
new impl 1.56E+01 1.54E+01 1.54E+01 3.51E-01 3.70E-01 4.01E-01 1.03E-03 9.90E-05 9.10E-05
speedup current/new 1.1 1.1 1.1 1.2 1.0 1.1 130.8 974.0 1069.0
speedup new openmp/kernel 1.0 0.9 0.9 341.7 3541.4 3852.7

@MrBurmark
Copy link
Member

Perhaps there could still be register spilling with cuda or something like that that is making a dramatic difference. We'll have to look at the instructions to see what happened.

@pguthrey
Copy link
Contributor Author

Perhaps there could still be register spilling with cuda or something like that that is making a dramatic difference. We'll have to look at the instructions to see what happened.

That makes some sense. If I add the memory needed by the vectors and the matrix together I get

12*13/2 + 3*12 + 3*12 = 150 > 128

@pguthrey
Copy link
Contributor Author

Updated results after fixing a bug.

ATS2 Base_Seq Lambda_Seq RAJA_Seq Base_OpenMP Lambda_OpenMP RAJA_OpenMP Base_CUDA Lambda_CUDA RAJA_CUDA
variant default default default default default default block_256 block_256 block_256
current impl 4.42E+01 4.45E+01 4.44E+01 8.98E-01 9.08E-01 9.04E-01 1.53E-01 1.53E-01 1.54E-01
new impl 3.18E+01 3.18E+01 3.18E+01 6.35E-01 6.40E-01 6.41E-01 1.63E-01 1.63E-01 1.63E-01
speedup current/new 1.4 1.4 1.4 1.4 1.4 1.4 0.9 0.9 0.9
speedup openmp/kernel . . . 1.0 1.0 1.0 3.9 3.9 3.9
ATS4 Base_Seq Lambda_Seq RAJA_Seq Base_OpenMP Lambda_OpenMP RAJA_OpenMP Base_HIP Lambda_HIP RAJA_HIP
variant default default default default default default block_256 block_256 block_256
current impl 1.74E+01 1.74E+01 1.74E+01 4.36E-01 3.79E-01 4.44E-01 1.34E-01 9.64E-02 9.73E-02
new impl 15.460 15.529 15.517 0.288 0.170 0.166 0.006 0.005 0.005
speedup current/new 1.1 1.1 1.1 1.5 2.2 2.7 21.6 18.4 18.7
speedup new openmp/kernel . . . 1.0 1.7 1.7 46.3 55.1 55.3

@pguthrey pguthrey self-assigned this Dec 31, 2024
@pguthrey pguthrey requested a review from MrBurmark December 31, 2024 01:09
@MrBurmark
Copy link
Member

It actually made cuda slower?

const rajaperf::Real_type detj_tol,
const rajaperf::Int_type quad_type,
const rajaperf::Int_type quad_order,
rajaperf::Real_type (&matrix)[EB][EB])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is still a full matrix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interface that this is used for requires having a full matrix. However, when we we are computing the work at each quadrature point, we are using a symmetric matrix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can follow up with a study on how beneficial it would be to never create the full matrix. If that is a major impact, that may be enough incentive to rewrite how things are done in the ultimate use case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a quick study measuring the performance of an impl that has no 12x12 matrices. Its indistinguishable from the performance seen in this MR.

@pguthrey
Copy link
Contributor Author

It actually made cuda slower?

I would not say that is a statistically significant difference. (could be within sampling tolerance)

@pguthrey
Copy link
Contributor Author

pguthrey commented Jan 3, 2025

Rows CPU runs CPU runs Raw CUDA/HIP RAJA
ATS2 Base_Seq RAJA_Seq Base_CUDA RAJA_CUDA
current impl 4.42E+01 4.44E+01 1.53E-01 1.54E-01
new impl 3.18E+01 3.18E+01 1.63E-01 1.63E-01
speedup current/new 1.4 1.4 0.9 0.9
speedup gpu v 1 cpu 195.0 194.7
ATS4 Base_Seq RAJA_Seq Base_HIP RAJA_HIP
current impl 1.74E+01 1.74E+01 1.34E-01 9.73E-02
new impl 15.460 15.517 0.006 0.005
speedup current/new 1.1 1.1 21.6 18.7
speedup gpu v 1 cpu 2485.5 2983.5
Venado Base_Seq RAJA_Seq Base_CUDA RAJA_CUDA
current impl 3.070 13.027 1.401 1.428
new impl 9.597 9.352 1.013 1.033
speedup current/new 0.3 1.4 1.4 1.4
speedup gpu v 1 cpu 9.5 9.1
Speedup ATS4 vs ATS2 26.186 31.424
Speedup Venado vs ATS2 0.161 0.158

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants