FAQ / What does broken_safe and broken_unsafe mean and how do I fix it.

Broken_safe and broken_unsafe refer to a state of an RSF-1 service that has either failed to start up or shut down correctly.

As a service is started or stopped RSF-1 executes the scripts in the directory /opt/HAC/RSF-1/etc/rc.<service>.d/* where <service> is the service name itself; for example a service named web would have the service directory /opt/HAC/RSF-1/etc/rc.web.d/. The service directory contains three types of scripts:

  • start – prefixed by an S<num>
  • stop – prefixed by a K<num>
  • panic – prefixed by a P<num>

The order in which the scripts are run is dictated by the <num> part, going from low to high. The scripts perform actions to either start or stop a service. Each script should run successfully and complete with a 0 exit code. However, if during the running of one of these scripts something goes wrong, then the script will exit with a non zero exit code (exit code definitions are in /opt/HAC/bin/rsf.sh).

If an error occurs when running the start or stop scripts, a script can indicate this in it’s exit code. If the failure occurred when starting a service, then the shutdown scrips are run to release any shared resources that the failed startup attempt may have reserved.

If the start scripts failed, and the following stop scripts succeeded, the service is marked “broken_safe”. Broken indicates that something is wrong – the service could not be started, and this should be investigated and remedied before trying to start the service on this server again. The safe part indicates that the stop scrips ran properly, so no shared resources are allocated and it is safe to try to start the service on a different server.

If an error occurs when running the stop scripts, (eg. failure to unmount a shared file system, even with a forcible unmount), the service is marked “broken_unsafe”. As before broken indicates that some investigation is required, and this time unsafe means that shared resources may still be allocated, so it is NOT safe to try to start the service elsewhere in the cluster. (if you were to try to mount and use the file system on another host data corruption would most likely occur).

It is also possible for the startup scripts to indicate that the service should be marked broken unsafe immediately, without running the stop scripts. This is to allow for situations in which a severe error has been detected by the start scrips, and that running the stop scrips, or allowing another server to try to start the service, may cause further problems.

In either case, the underlying issue causing the “broken” state needs to be resolved. Check the log file /opt/HAC/RSF-1/log/rsfmon.log to discover where the error occured and what needs to be done. Once the problem has been resolved, RSF-1 needs to be told that the service is now fixed, do this by first issuing the command (as root):

/opt/HAC/RSF-1/bin/rsfcli -i=0 repair <servicename>

This will mark the service as having been repaired, but place it in manual mode; if any other nodes in the cluster are in automatic mode for the service in quesiton they will now attempt to start it. To switch the service back into automatic mode on the node it went into a broken state by issuing the command:

/opt/HAC/RSF-1/bin/rsfcli -i=0 auto <servicename>

 

Posted in: Administration