Recovering From Segment Failures

Recovering From Segment Failures

Segment host failures usually cause multiple segment failures: all primary or mirror segment instances on the host are marked as down and nonoperational. If mirroring is not enabled and a segment goes down, the system automatically becomes nonoperational.

A segment instance can fail for several reasons, such as a host failure, network failure, or disk failure. When a segment instance fails, its status is marked as down in the Greenplum Database system catalog, and its mirror is activated in change tracking mode. In order to bring the failed segment instance back into operation again, you must first correct the problem that made it fail in the first place, and then recover the segment instance in Greenplum Database using gprecoverseg.

If a segment host is not recoverable and you have lost one or more segment instances with mirroring enabled, you can attempt to recover a segment instance from its mirror. See When a segment host is not recoverable. You can also recreate your Greenplum Database system from backup files. See Backing Up and Restoring Databases.

To recover with mirroring enabled

  1. Ensure you can connect to the segment host from the master host. For example:
    $ ping failed_seg_host_address
  2. Troubleshoot the problem that prevents the master host from connecting to the segment host. For example, the host machine may need to be restarted or replaced.
  3. After the host is online and you can connect to it, run the gprecoverseg utility from the master host to reactivate the failed segment instances and start the process of sycnronizing the master and mirror instances. For example:
    $ gprecoverseg
  4. The recovery process brings up the failed segments and identifies the changed files that need to be synchronized. The process can take some time; wait for the process to complete.
  5. After gprecoverseg completes, the system goes into Resynchronizing mode and begins copying the changed files. This process runs in the background while the system is online and accepting database requests.
  6. When the resynchronization process completes, the system state is Synchronized. Run the gpstate utility to verify the status of the resynchronization process:
    $ gpstate -m

To return all segments to their preferred role

When a primary segment instance goes down, the mirror activates and becomes the primary segment. After running gprecoverseg, the currently active segment instance remains the primary and the failed segment becomes the mirror. The segment instances are not returned to the preferred role that they were given at system initialization time. This means that the system could be in a potentially unbalanced state if segment hosts have more active segments than is optimal for top system performance. To check for unbalanced segments and rebalance the system, run:

$ gpstate -e

All segments must be online and fully synchronized to rebalance the system. Database sessions remain connected during rebalancing, but queries in progress are canceled and rolled back.

  1. Run gpstate -m to ensure all mirrors are Synchronized.
    $ gpstate -m
  2. If any mirrors are in Resynchronizing mode, wait for them to complete.
  3. Run gprecoverseg with the -r option to return the segments to their preferred roles.
    $ gprecoverseg -r
  4. After rebalancing, run gpstate -e to confirm all segments are in their preferred roles.
    $ gpstate -e

To recover from a double fault

In a double fault, both a primary segment and its mirror are down. This can occur if hardware failures on different segment hosts happen simultaneously. Greenplum Database is unavailable if a double fault occurs.

To recover from a double fault.

  1. Troubleshoot the problem that caused the double fault and ensure that the segment hosts are operational and are accessible from the master host.
  2. Restart Greenplum Database. The gpstop option -r stops and restarts the system.
    $ gpstop -r
  3. After the system restarts, run gprecoverseg to reactivate the failed segment instances.
    $ gprecoverseg
  4. After gprecoverseg completes, use gpstate to check the status of your mirrors and ensure the segment instances have gone from Resynchronizing mode to Synchronized mode:
    $ gpstate -m 
  5. If you still have segment instances in change tracking mode, you can run gprecoverseg with the -F option to perform a full segment recovery.
    Warning: A full recovery deletes the data directory of the down segment instance before copying the data from the active (current primary) segment instance. Before performing a full recovery, ensure that the segment failure did not cause data corruption and that any host segment disk issues have been fixed.
    $ gprecoverseg -F
  6. If needed, return segment instances to their preferred role. See To return all segments to their preferred role.

To recover without mirroring enabled

  1. Ensure you can connect to the segment host from the master host. For example:
    $ ping failed_seg_host_address
  2. Troubleshoot the problem that is preventing the master host from connecting to the segment host. For example, the host machine may need to be restarted.
  3. After the host is online, verify that you can connect to it and restart Greenplum Database. The gpstop option -r stops and restarts the system:
    $ gpstop -r 
  4. Run the gpstate utility to verify that all segment instances are online:
    $ gpstate