Configuring HDFS Cold and Hot Data Migration

Scenario

The hot and cold data migration tool migrates HDFS files based on the configured policy. A policy is a set of conditional or non-conditional rules. If a file matches the rule set, the tool performs a group of operations for the file.

The hot and cold data migration tool supports the following rules and operations:

Migration rules:

Data is migrated based on the latest access time of the file.
Data is migrated based on the file modification time.
Data is migrated without conditions.

**Table 1** Rule condition tags
Condition Tag	Description
<age operator="lt">	Defines the conditions for changing the age or modification time.
<atime operator="gt">	Defines the condition for accessing time.

For a manual migration rule, no condition is required.

Operations:

Set the storage policy to a given data tier.
Migrate files to another folder.
Configure the number of copies for a file.
Delete a file.
Set a node label.

**Table 2** Behavior types:
Behavior Type	Description	Required Parameters
MARK	Determines the data access frequency and set a data storage policy.	<param> <name>targettier</name> <value>STORAGE_POLICY</value> <param>
MOVE	Sets the data storage policy or NodeLabel and invokes the HDFS Mover tool.	<param> <name>targettier</name> <value>STORAGE_POLICY</value> <param> <param> <name>targetnodelabels</name> <value>SOME_EXPRESSION</value> <param> NOTE: You can set either or both of the parameters.
SET_REPL	Configures the number of copies for a file.	<param> <name>replcount</name> <value>INTEGER</value> <param>
MOVE_TO_FOLDER	Moves the file to the target folder. If overwrite is set to true, the target path will be overwritten.	<param> <name>target</name> <value>PATH</value> <param> <param> <name>overwrite</name> <value>true/false</value> <param> NOTE: overwrite is an optional parameter. If this parameter is not set, the default value false is used.
DELETE	Delete a file.	N/A

Configuration Description

You must periodically invoke the migration tool and perform the following operations in the hdfs-site.xml file on the client:

**Table 3** Parameter description
Parameter	Description	Default Value
dfs.auto-data-movement.policy.class	Specifies the default data migration policy. NOTE: Currently, only DefaultDataMovementPolicy is supported.	com.xxx.hadoop.hdfs.datamovement.policy.DefaultDataMovementPolicy
dfs.auto.data.mover.id	Specifies the output file name of the hot and cold data migration policy.	Current system time (ms)
dfs.auto.data.mover.output.dir	Specifies the name of the HDFS directory to which cold and hot data is migrated. The migration tool writes the behavior status file here.	/system/datamovement

DefaultDataMovementPolicy has the configuration file default-datamovement-policy.xml. Users need to define all rules based on the age or access time and operations performed in this file. This file must be stored in classpath of the client.

The following is an example of the default-datamovement-policy.xml file:

<policies>
  <policy>
    <fileset>
      <file>
        <name>/opt/data/1.txt</name>
      </file>
      <file>
        <name>/opt/data/*/subpath/</name>
        <excludes>
          <name>/opt/data/some/subpath/sub1</name>
        </excludes>
      </file>
    </fileset>
    <rules>
      <rule>
        <age>2w</age>
        <action>
          <type>MOVE</type>
          <params>
            <param>
              <name>targettier</name>
              <value>HOT</value>
            </param>
          </params>
        </action>
      </rule>
    </rules>
  </policy>
</policies>

Other attributes can be added to the tags used in policies, rules, and behavior operations. For example, name can be used to manage the mapping between the user UI (for example, Hue UI) and tool input XML.

Example: <policy name="Manage_File1">

The tags are described as follows:

**Table 4** Description of configuring tags
Tag	Description	Reusable or Not
<policy>	Define a single policy. idempotent: specifies whether to check the next rule if the current rule is met when multiple rules exist in the policy. Example: <policy name ="policy2" idempotent ="true"> The default value is true, indicating that the rule and action are idempotent and you can continue to check the next rule. If the value is false, the evaluation stops at the current rule. hours_allowed: indicates whether to execute policy evaluation based on the system time. The value of hours_allowed is a number separated by commas (,). The value ranges from 0 to 23, indicating the system time. Example: <policy name ="policy1" hours_allowed ="2-6,13-14"> If the current system time is within the configured range, continue the evaluation. Otherwise, the evaluation will be skipped. NOTE: In the input XML, only one policy is supported per file. Therefore, all rules in the file must be covered by a policy tag.	Yes
<fileset>	Define a group of files or folders for each policy.	No (in the policy tag)
<file>	One or more <name> tags are configured for the definition file and/or folder in the <file> tag. The file or folder name supports POSIX globs.	Yes (in the fileset tag)
<excludes>	Define this tag in the <file> tag. This tag can contain multiple <name> tags. In the file or folder range configured in the <file> tag, the files or folders contained in the <name> tag will be excluded. The file or folder name supports POSIX globs.	No (in the fileset tag)
<rules>	Specifies multiple rules defined for a policy.	No (in the policy tag)
<rule>	Specifies a single rule to be defined.	Yes (in the rules tag)
<age>or<atime>	Defines the age/accesstime of the file defined in <fileset>. The policy matches the age. The value of age can be in the [num]y[num]m[num]w[num]d[num]h format. In the command, num indicates a number. The meanings of the letters are as follows: * y: year (365 days in a year) * m: month (30 days in a month) * w: week (7 days in a week) * d: day * h: hour You can use the year, month, week, day, or hour independently, or you can combine them. For example, 1y2d indicates one year and two days or 367 days. If there is no unit (that is, the number is not followed by any letter), the default unit is day. NOTE: You can configure gt (greater) and lt (less) in the <age> and <atime> tags. The default operator is gt. Example: <age operator="lt">	No (in the rule tag)
<action>	If the rule is matched, this tag defines the action to be executed.	No (in the rule tag)
<type>	Defines the action type. Currently, the supported action types are MOVE and MARK.	No (in the action tag)
<params>	Defines parameters related to each action.	No (in the action tag)
<param>	Defines a name-value format parameter that uses the <name> and <value> tags. For MARK and MOVE, only the targettier parameter is supported. This parameter specifies the data storage policy if the age rule is met. If multiple parameters have the same name, the first parameter value is used. For marks, the supported targettier values are ALL_SSD, ONE_SSD, HOT, WARM, and COLD. For MOVE, the supported targettier values are ALL_SSD, ONE_SSD, HOT, WARM, and COLD.	Yes (in the params tag)

For files or folders under the <file> tag, the FileSystem#globStatus API is used. For other files or folders, the GlobPattern class (used by GlobFilter) is used. For details, see the description of supported APIs. For example, for globStatus, /opt/hadoop/* will match everything in the /opt/hadoop folder. /opt/*/hadoop matches all hadoop folders in the subdirectories of the/opt directory.

For globStatus, the glob mode of each path component is matched. For other components, the glob mode is directly matched.

https://hadoop.apache.org/docs/r3.1.1/api/org/apache/hadoop/fs/FileSystem.html#globStatus(org.apache.hadoop.fs.Path)

Click to enlarge

Behavior Operation Example

MARK

<action>
  <type>MARK</type>
  <params>
    <param>
      <name>targettier</name>
      <value>HOT</value>
    </param>
  </params>
</action>

MOVE

<action>
  <type>MOVE</type>
  <params>
    <param>
      <name>targettier</name>
      <value>HOT</value>
    </param>
    <param>
      <name>targetnodelabels</name>
      <value>SOME_EXPRESSION</value>
    </param>
  </params>
</action>

SET_REPL

<action>
  <type>SET_REPL</type>
  <params>
    <param>
      <name>replcount</name>
      <value>5</value>
    </param>
  </params>
</action>

MOVE_TO_FOLDER

<action>
  <type>MOVE_TO_FOLDER</type>
  <params>
    <param>
      <name>target</name>
      <value>path</value>
    </param>
    <param>
      <name>overwrite</name>
      <value>true</value>
    </param>

  </params>
</action>

The MOVE_TO_FOLDER operation only changes the file path to the target folder and does not change the block location. If you want to move a block, you need to configure an independent move policy.

DELETE

<action>
  <type>DELETE</type>
</action>

When writing an XML file, pay attention to the configuration and sequence of behavior operations. The hot and cold data migration tool executes the rules in the sequence specified in the input XML file.
If you want to run only one rule based on atime/age, sort the rules in descending order of time and set the idempotent attribute to false.
If the delete operation is configured for a file set, other rules cannot be configured after the delete operation is performed.
The -fs option can be used to specify the default file system address of the client.

Audit Logs

The cold and hot data migration tool supports audit logs of the following operations:

Tool startup status
Behavior type, parameter details, and status
Tool completion status

To enable the audit log tool, add the following attributes to the <HADOOP_CONF_DIR>/log4j.property file:

autodatatool.logger=INFO, ADMTRFA
autodatatool.log.file=HDFSAutoDataMovementTool.audit
log4j.logger.com.xxx.hadoop.hdfs.datamovement.HDFSAutoDataMovementTool.audit=${autodatatool.logger}
log4j.additivity.com.xxx.hadoop.hdfs.datamovement.HDFSAutoDataMovementTool-audit=false
log4j.appender.ADMTRFA=org.apache.log4j.RollingFileAppender
log4j.appender.ADMTRFA.File=${hadoop.log.dir}/${autodatatool.log.file}
log4j.appender.ADMTRFA.layout=org.apache.log4j.PatternLayout
log4j.appender.ADMTRFA.layout.ConversionPattern=%d{ISO8601} %p %c: %m%n
log4j.appender.ADMTRFA.MaxBackupIndex=10
log4j.appender.ADMTRFA.MaxFileSize=64MB

For details, see the <HADOOP_CONF_DIR>/log4j_autodata_movment_template.properties file.

Parent topic: Typical Scenarios

Previous topic: HDFS on Hue

Next topic: Hive on Hue