WIP: EDGE3D update #478

pguthrey · 2024-09-09T20:32:06Z

Summary

This PR is a refactoring
It does the following:
- Explores performance improvements to EDGE3D

rhornung67 · 2024-10-10T17:53:38Z

@pguthrey is this ready for review?

pguthrey · 2024-10-13T17:21:32Z

@pguthrey is this ready for review?

Not just yet. Need to experiment more with different implementations. Will come back to this later.
Also - there are additional apps that are broken with respect to targetopenmp, so I think I will pull that out into another branch and work with others to fix those examples (unless this is no longer supported)

pguthrey · 2024-11-27T02:37:48Z

Here are the results of these changes. Good improvement for CUDA. Impossibly good improvement for HIP. I checked that the results are the same as the previous algorithm... but I might look more into what is going on with HIP.

CUDA	Base_Seq	Lambda_Seq	RAJA_Seq	Base_OpenMP	Lambda_OpenMP	RAJA_OpenMP	Base_HIP	Lambda_HIP	RAJA_HIP
	default	default	default	default	default	default	block_256	block_256	block_256
current impl	4.42E+01	4.45E+01	4.44E+01	8.98E-01	9.08E-01	9.04E-01	1.53E-01	1.53E-01	1.54E-01
new impl	3.05E+01	3.06E+01	3.06E+01	6.08E-01	6.11E-01	6.11E-01	7.06E-02	7.02E-02	7.04E-02
speedup current/new	1.5	1.5	1.5	1.5	1.5	1.5	2.2	2.2	2.2
speedup openmp/kernel				1.0	1.0	1.0	8.6	8.7	8.6

HIP	Base_Seq	Lambda_Seq	RAJA_Seq	Base_OpenMP	Lambda_OpenMP	RAJA_OpenMP	Base_HIP	Lambda_HIP	RAJA_HIP
	default	default	default	default	default	default	block_256	block_256	block_256
current impl	1.74E+01	1.74E+01	1.74E+01	4.36E-01	3.79E-01	4.44E-01	1.34E-01	9.64E-02	9.73E-02
new impl	1.56E+01	1.54E+01	1.54E+01	3.51E-01	3.70E-01	4.01E-01	1.03E-03	9.90E-05	9.10E-05
speedup current/new	1.1	1.1	1.1	1.2	1.0	1.1	130.8	974.0	1069.0
speedup new openmp/kernel				1.0	0.9	0.9	341.7	3541.4	3852.7

MrBurmark · 2024-11-27T16:30:39Z

Perhaps there could still be register spilling with cuda or something like that that is making a dramatic difference. We'll have to look at the instructions to see what happened.

pguthrey · 2024-11-27T19:35:45Z

Perhaps there could still be register spilling with cuda or something like that that is making a dramatic difference. We'll have to look at the instructions to see what happened.

That makes some sense. If I add the memory needed by the vectors and the matrix together I get

12*13/2 + 3*12 + 3*12 = 150 > 128

pguthrey · 2024-12-31T01:03:24Z

Updated results after fixing a bug.

ATS2	Base_Seq	Lambda_Seq	RAJA_Seq	Base_OpenMP	Lambda_OpenMP	RAJA_OpenMP	Base_CUDA	Lambda_CUDA	RAJA_CUDA
variant	default	default	default	default	default	default	block_256	block_256	block_256
current impl	4.42E+01	4.45E+01	4.44E+01	8.98E-01	9.08E-01	9.04E-01	1.53E-01	1.53E-01	1.54E-01
new impl	3.18E+01	3.18E+01	3.18E+01	6.35E-01	6.40E-01	6.41E-01	1.63E-01	1.63E-01	1.63E-01
speedup current/new	1.4	1.4	1.4	1.4	1.4	1.4	0.9	0.9	0.9
speedup openmp/kernel	.	.	.	1.0	1.0	1.0	3.9	3.9	3.9

ATS4	Base_Seq	Lambda_Seq	RAJA_Seq	Base_OpenMP	Lambda_OpenMP	RAJA_OpenMP	Base_HIP	Lambda_HIP	RAJA_HIP
variant	default	default	default	default	default	default	block_256	block_256	block_256
current impl	1.74E+01	1.74E+01	1.74E+01	4.36E-01	3.79E-01	4.44E-01	1.34E-01	9.64E-02	9.73E-02
new impl	15.460	15.529	15.517	0.288	0.170	0.166	0.006	0.005	0.005
speedup current/new	1.1	1.1	1.1	1.5	2.2	2.7	21.6	18.4	18.7
speedup new openmp/kernel	.	.	.	1.0	1.7	1.7	46.3	55.1	55.3

MrBurmark · 2024-12-31T04:26:51Z

It actually made cuda slower?

MrBurmark · 2024-12-31T04:30:34Z

src/apps/EDGE3D.hpp

+  const rajaperf::Real_type  detj_tol,
+  const rajaperf::Int_type   quad_type,
+  const rajaperf::Int_type   quad_order,
+  rajaperf::Real_type        (&matrix)[EB][EB])


There is still a full matrix?

The interface that this is used for requires having a full matrix. However, when we we are computing the work at each quadrature point, we are using a symmetric matrix.

I can follow up with a study on how beneficial it would be to never create the full matrix. If that is a major impact, that may be enough incentive to rewrite how things are done in the ultimate use case.

Did a quick study measuring the performance of an impl that has no 12x12 matrices. Its indistinguishable from the performance seen in this MR.

pguthrey · 2024-12-31T06:46:31Z

It actually made cuda slower?

I would not say that is a statistically significant difference. (could be within sampling tolerance)

pguthrey · 2025-01-03T01:45:31Z

Rows	CPU runs	CPU runs	Raw CUDA/HIP	RAJA
ATS2	Base_Seq	RAJA_Seq	Base_CUDA	RAJA_CUDA
current impl	4.42E+01	4.44E+01	1.53E-01	1.54E-01
new impl	3.18E+01	3.18E+01	1.63E-01	1.63E-01
speedup current/new	1.4	1.4	0.9	0.9
speedup gpu v 1 cpu			195.0	194.7
ATS4	Base_Seq	RAJA_Seq	Base_HIP	RAJA_HIP
current impl	1.74E+01	1.74E+01	1.34E-01	9.73E-02
new impl	15.460	15.517	0.006	0.005
speedup current/new	1.1	1.1	21.6	18.7
speedup gpu v 1 cpu			2485.5	2983.5
Venado	Base_Seq	RAJA_Seq	Base_CUDA	RAJA_CUDA
current impl	3.070	13.027	1.401	1.428
new impl	9.597	9.352	1.013	1.033
speedup current/new	0.3	1.4	1.4	1.4
speedup gpu v 1 cpu			9.5	9.1
Speedup ATS4 vs ATS2			26.186	31.424
Speedup Venado vs ATS2			0.161	0.158

pguthrey force-pushed the feature/guthrey1/edge_3d_update branch 3 times, most recently from 32723a7 to 702238f Compare September 11, 2024 22:29

rhornung67 requested review from MrBurmark, artv3 and rhornung67 October 10, 2024 17:56

pguthrey removed request for MrBurmark, artv3 and rhornung67 October 13, 2024 17:17

pguthrey marked this pull request as draft October 13, 2024 17:18

use symmetry for storage on innermost loop (but not outermost)

c9b19a6

pguthrey force-pushed the feature/guthrey1/edge_3d_update branch from 050da7e to c9b19a6 Compare November 23, 2024 03:46

pguthrey added 2 commits November 22, 2024 19:46

comments

3298c9a

call symmetric impl

5e593d5

pguthrey added 2 commits December 30, 2024 15:12

correction, allow for old impl

6fe5181

whoops

d929e14

pguthrey added 2 commits December 30, 2024 17:06

rm old code

6fd5a72

comment

9fea351

pguthrey self-assigned this Dec 31, 2024

pguthrey requested a review from MrBurmark December 31, 2024 01:09

MrBurmark reviewed Dec 31, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: EDGE3D update #478

WIP: EDGE3D update #478

pguthrey commented Sep 9, 2024 •

edited

Loading

rhornung67 commented Oct 10, 2024

pguthrey commented Oct 13, 2024

pguthrey commented Nov 27, 2024 •

edited

Loading

MrBurmark commented Nov 27, 2024

pguthrey commented Nov 27, 2024

pguthrey commented Dec 31, 2024

MrBurmark commented Dec 31, 2024

MrBurmark Dec 31, 2024

pguthrey Dec 31, 2024

pguthrey Dec 31, 2024

pguthrey Dec 31, 2024

pguthrey commented Dec 31, 2024

pguthrey commented Jan 3, 2025 •

edited

Loading

WIP: EDGE3D update #478

Are you sure you want to change the base?

WIP: EDGE3D update #478

Conversation

pguthrey commented Sep 9, 2024 • edited Loading

Summary

rhornung67 commented Oct 10, 2024

pguthrey commented Oct 13, 2024

pguthrey commented Nov 27, 2024 • edited Loading

MrBurmark commented Nov 27, 2024

pguthrey commented Nov 27, 2024

pguthrey commented Dec 31, 2024

MrBurmark commented Dec 31, 2024

MrBurmark Dec 31, 2024

Choose a reason for hiding this comment

pguthrey Dec 31, 2024

Choose a reason for hiding this comment

pguthrey Dec 31, 2024

Choose a reason for hiding this comment

pguthrey Dec 31, 2024

Choose a reason for hiding this comment

pguthrey commented Dec 31, 2024

pguthrey commented Jan 3, 2025 • edited Loading

pguthrey commented Sep 9, 2024 •

edited

Loading

pguthrey commented Nov 27, 2024 •

edited

Loading

pguthrey commented Jan 3, 2025 •

edited

Loading