Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fenix_Init hangs after an error when app contains (multiple) derived robust communicators #25

Open
rfvander opened this issue Mar 23, 2017 · 4 comments

Comments

@rfvander
Copy link
Contributor

rfvander commented Mar 23, 2017

The AMR PRK derives two robust communicators from the original robust communicator. In this implementation we use MPI_COMM_WORLD as the input communicator for Fenix_Init, and provide NULL for the output communicator address.

@rfvander rfvander added the bug label Mar 23, 2017
@rfvander
Copy link
Contributor Author

rfvander commented Mar 23, 2017

I could determine that the code hangs in the preinit part.
I take that back. The code hangs in Fenix_Init, but I don't know where yet.

@rfvander
Copy link
Contributor Author

In case of the successful run with the stencil PRK, when I kill one rank, the Fenix library reports that some processes failed, and also that the communicator was revoked. In case of the hanging AMR PRK no communicator is reported revoked.

@rfvander
Copy link
Contributor Author

rfvander commented Mar 24, 2017

More debug info. In the successful Stencil PRK, if I use newcomm as the resilient output communicator and replace all subsequent references to MPI_COMM_WORLD with newcomm, the code runs successfully after an error. However, if I use NULL for the resilient output communicator and use MPI_Comm_dup to create newcomm out of MPI_COMM_WORLD after Fenix_Init, the code hangs after an error.

@rfvander rfvander added enhancement and removed bug labels Mar 24, 2017
@rfvander
Copy link
Contributor Author

We need to maintain a list of communicators derived from the robust output communicator, and revoke all of them when an error occurs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant