Skip to content

Resolve the segmentation fault occurring in the pw float implementation #6130

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 43 commits into
base: develop
Choose a base branch
from

Conversation

A-006
Copy link
Collaborator

@A-006 A-006 commented Apr 8, 2025

Linked Issue

Fix #6111

Unit Tests and/or Case Tests for My Changes

  • Added unit tests for the pw_basis float type.
  • Included integration tests for the pw_basis float type.

What's Changed?

  • Fixed a bug in the runtime handling of the pw_basis float type.
  • Enabled GPU support for pw_basis without requiring the compilation flag ENABLE_FFTW_FLOAT.
  • ATTENTION!!! In order to reduce the test time of the GPU/CPU,i just add the PW_CG PW_CG_GPU float type to test the float version FFT can be normally used.

@mohanchen mohanchen added the Refactor Refactor ABACUS codes label Apr 21, 2025
@kogareru1z
Copy link

Could you please fix the single-precision calculation issue for the LCAO basis? When I tested with ABACUS 3.9.0.3, I encountered the same problem as with PW, but the error does not occur in version 3.10.

@A-006
Copy link
Collaborator Author

A-006 commented Apr 23, 2025

Could you please fix the single-precision calculation issue for the LCAO basis? When I tested with ABACUS 3.9.0.3, I encountered the same problem as with PW, but the error does not occur in version 3.10.

Alright, I'll try to address this issue later. However, I recall that the LCAO basis does not support a single version; I will verify this information

@A-006
Copy link
Collaborator Author

A-006 commented Apr 24, 2025

Could you please fix the single-precision calculation issue for the LCAO basis? When I tested with ABACUS 3.9.0.3, I encountered the same problem as with PW, but the error does not occur in version 3.10.

Currently, the LCAO does not support a single version. Could you please inform me whether you intend to use the GPU LCAO single? Actually, this feature will be implemented by my partner. After the pull request (PR) is merged, you will be able to set the GPU LCAO to single; however, in reality, it will run as GPU LCAO double.

@kogareru1z
Copy link

Could you please fix the single-precision calculation issue for the LCAO basis? When I tested with ABACUS 3.9.0.3, I encountered the same problem as with PW, but the error does not occur in version 3.10.

Currently, the LCAO does not support a single version. Could you please inform me whether you intend to use the GPU LCAO single? Actually, this feature will be implemented by my partner. After the pull request (PR) is merged, you will be able to set the GPU LCAO to single; however, in reality, it will run as GPU LCAO double.

I tested version 3.10 in single precision, but it still runs in double precision. If that’s the case, I don’t need this feature right now. Thank you.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two lines

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two lines

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README has already been resolved.

@@ -98,7 +98,7 @@ void PW_Basis::gatherp_scatters(std::complex<T>* in, std::complex<T>* out) const
template <typename T>
void PW_Basis::gathers_scatterp(std::complex<T>* in, std::complex<T>* out) const
{
//ModuleBase::timer::tick(this->classname, "gathers_scatterp");
ModuleBase::timer::tick(this->classname, "gathers_scatterp");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will it cost a lot of counting numbers?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The time ticker has been deleted.

@@ -2,6 +2,7 @@ remove_definitions(-D__DEEPKS)
remove_definitions(-D__CUDA)
remove_definitions(-D__ROCM)
remove_definitions(-D__EXX)
remove_definitions(-DUSE_PAW)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will delete PAW

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The command has been deleted.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't write README in this way, need to discuss

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README has been rewritten.

@@ -257,7 +257,11 @@ LR::ESolver_LR<T, TR>::ESolver_LR(ModuleESolver::ESolver_KS_LCAO<T, TR>&& ks_sol
this->gint_->reset_DMRGint(1);

// move pw basis
delete this->pw_rho; // newed in ESolver_FP::ESolver_FP
if (this->pw_rho_flag)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to discuss

@@ -26,4 +27,31 @@ AddTest(
TARGET radial_proj_test
LIBS parameter base device ${math_libs}
SOURCES radial_proj_test.cpp ../radial_proj.cpp
)

AddTest(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the aim of this test?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error with the float type originated from the structure_factor function, which did not have a test for float inputs. We have now added the test to prevent this issue from occurring again.

delete this->pw_rho;
this->pw_rho_flag = false;
}
if ( PARAM.globalv.double_grid)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra blank found

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The blank has been deleted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Refactor Refactor ABACUS codes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: Please enable floatfftw in the cmake to use float fft when precision is set to single using GPU
3 participants