Updated on 2024-11-29 GMT+08:00

How Do I Restore the FE Service from a Fault?

Symptom

The FE service failed to start bdbje, data could not be synchronized between FE nodes, metadata could not be written, or no master node was available. To restore the FE service, start a new master node based on the metadata in meta_dir, and then add FE nodes one by one.

Procedure

  1. Stop all FE processes and stop all service access to prevent unexpected problems caused by external access during metadata restoration.
  2. Search for the metadata on all FE instance nodes and locate the latest FE node as the master node to be restored.

    1. Log in to the FE background node and check the value of meta_dir in the ${BIGDATA_HOME}/FusionInsight_Doris_x.x.x/x_x_FE/etc/fe.conf file. The value is the metadata storage directory.
    2. Search for the metadata storage directories of all FE nodes and check the image/image.xxxx files in the directories. A larger value of image.xxxx indicates a newer metadata. Locate the latest FE node and use it as the first FE to be restored, that is, the master FE.
    3. Back up the metadata storage directories of all FEs.

      For example, if the metadata storage directory is /srv/BigData/doris_fe/doris-meta, run the following command:

      cp -r /srv/BigData/doris_fe/doris-meta /srv/BigData/doris_fe/doris-meta.bak

  3. Go to 2 and locate the node where the FE node with the latest metadata is deployed (that is, the master node) and add metadata_failure_recovery=true to ${BIGDATA_HOME}/FusionInsight_Doris_x.x.x/x_x_FE/etc/fe.conf. If the ${BIGDATA_HOME}/FusionInsight_Doris_x.x.x/x_x_FE_UPDATE directory exists, add the configuration to fe.conf in x_x_FE_UPDATE.
  4. Log in to FusionInsight Manager, choose Cluster > Services > Instances, select the FE node whose configuration is modified in 3, and choose More > Restart Instance to restart the FE instance. Other instances are still stopped.
  5. Check the status of the FE instance after it is started. After the FE instance is started, enter http://192.168.67.27:29980 in the address box of the browser to connect to the FE instance.

    Log in to the FE web UI, click Playground, select default_cluster:information_schema, enter the show frontends command in the command box on the right, and click Execute. If the value in the Alive column of the current FE instance in the Results list is true, the FE is restored.

  6. On FusionInsight Manager, choose Cluster > Services > Instances, select the FE instance that is not a Master node and is not started, and choose More > Delete Instance.

  7. After the FE instance is deleted, click Add Instance to add the FE instance deleted in 6.
  8. Select the instance whose configuration has expired, choose More > Restart Instance to restart the FE instance whose configuration has expired, and delete the metadata_failure_recovery parameter added to the fe.conf file on the node where the FE instance is deployed.

  9. Check whether the cluster is running properly. On the FE web UI, run the following command to check whether the FE, BE, and DBroker processes are healthy and in the same cluster. If the value of Alive for all instances in the Results list is true, the processes are healthy.

    For example, the following commands are executed in default_cluster:information_schema of Playground on the Doris web UI:

    • Run the following command to check whether all FE processes are healthy:

      show frontends;

    • Run the following command to check whether all BE processes are healthy:

      show backends;

    • Run the following command to check whether all DBroker processes are healthy:

      show broker;