Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table "assigned_resources" may be inconsistent, leading to phoenix ignoring some nodes #177

Open
bzizou opened this issue Jul 3, 2020 · 0 comments

Comments

@bzizou
Copy link
Contributor

bzizou commented Jul 3, 2020

Symptoms:
Phoenix is configured by default to not reboot suspected nodes that still have jobs running. This is configured by excluding nodes having a resource into the CURRENT state into the assigned_resources table. We noticed that our phoenix instance is always ignoring some nodes that don't have jobs running on it anymore.

The suspected bug:
A deep look inside our OAR database, revealed at least for one job, that we had such an error:
2020-05-24 00:02:14> EXIT_VALUE_OAREXEC:[bipbip 36324341] error of oarexec, exit value = 61; the job 36324341 is in Error and the node luke17 is Suspected; If this job is of type cosystem or deploy, check if the oar server is able to connect to the corresponding nodes, oar-node started
The luke17 node was never rebooted by phoenix after this date.
And we found that the corresponding resource was still in the CURRENT state into the assigned_resources table.

 moldable_job_id | resource_id | assigned_resource_index
-----------------+-------------+-------------------------
        36324736 |         391 | CURRENT

 moldable_id | moldable_job_id | moldable_walltime | moldable_index                                                                                                                                                
-------------+-----------------+-------------------+----------------                                                                                                                                               
    36324736 |        36324341 |              3600 | LOG              

Removing the inconsistent CURRENT entry solved the problem.

So, maybe the case "EXIT_VALUE_OAREXEC" when launching a job does not pass the CURRENT entry to LOG into assigned_resources ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants