Understanding Service Status

~ 0 min
2021-03-30 13:04

A "service instance" is the combination of HA service and cluster node. For example in a 2-node cluster, each HA service will have two instances - one for each node in the cluster.

Each service instance has its own state and a service instance state consists of three items:

Mode

Each service instance has a "mode" setting which can be set to either:

  • Automatic
    Automatic mode means the service instance will be automatically started when all of the following requirements are satisfied:
    • The service instance is in the stopped state
    • The service instance is not blocked
    • No other instance of this service is in an active state
  • Manual
    Manual mode means the service instance will never be automatically started

Blocked

The "blocked" state is similar to the service "mode" except that instead of being set by the user, it is controlled automatically by the cluster's monitoring features. A service instance can be either:

  • Blocked
    • The cluster's monitoring has detected a problem that affects this service instance.
    • This service instance will not start until the problem is resolved, even if the service is in automatic mode.
    • If a service instance becomes blocked when it is already running, the cluster may decide to stop that instance to allow it to be started on another node. This will only happen if there is another service instance in the cluster that is:
      • Unblocked
      • In automatic mode, and
      • Stopped
  • Unblocked
    • The service instance is free to start as long as it is in automatic mode.

State

Each service instance has a state. The service instance states that are possible to be seen in a ZFS cluster can be separated into the following categories:

Active States

When the service instance is in an "active" state, it is likely to be holding some service resources (e.g. a ZFS pool is imported or a VIP is plumbed in). Because of this, it cannot be safe for any other instance of the same service to be started.

For example, if a service is "stopping" on node 1, it cannot yet be started on node 2.

Active states are:

  • running
    • The service is running on this node and only this node. All service resources have been brought online. For ZFS clusters this means the main ZFS pool and any additional pools have been imported, any VIPs have been plumbed in and any configured logical units have been brought online.
  • starting
    • The service is in the process of starting on this node. Service start scripts are currently running - when they complete successfully the service instance will transition to "running".
  • stopping
    • The service is in the process of stopping on this node. Service stop scripts are currently running - when they complete successfully the service instance will transition to "stopped".
  • broken_unsafe
    • The service is in a broken state because service stop or abort scripts failed to run successfully. Some or all service resources are likely to be online so it is not safe for the cluster to start another instance of this service on another node.
    • This can be caused by one of two circumstances:
      • The service failed to stop
      • The service failed to start and began running abort scripts. The abort scripts also failed
  • panicked
    • While the service was in an active state on this node, it was seen in an active state on another node. Panic scripts have been run.
  • panicking
    • While the service was in an active state on this node, it was seen in an active state on another node. Panic scripts are running and when they are finished, the service instance will transition to "panicked".
  • aborting
    • Service start scripts failed to complete successfully. Abort scripts are running (these are the same as service stop scripts). When abort scripts complete successfully the service instance will transition to the "broken_safe" state.

Inactive States

When a service instance is in an "inactive" state, no service resources are online. That means it is safe for another instance of the service to be started elsewhere in the clsuter.

Inactive states are:

  • stopped
    • The service is stopped on this node. No service resources are online.
  • broken_safe
    • This state can be the result of either of the following circumstances:
      • The service failed to start on this node but had not yet brought any service resources online. It transitioned directly to broken_safe when it failed.
      • The service failed to start after having brought some resources online. Abort scripts were run to take the resources back offline and those abort scripts finished successfully.
Average rating 0 (0 Votes)

You cannot comment on this entry

Tags