Configuring HDFS Cold and Hot Data Migration
Scenario
The hot and cold data migration tool migrates HDFS files based on the configured policy. A policy is a set of conditional or non-conditional rules. If a file matches the rule set, the tool performs a group of operations for the file.
The hot and cold data migration tool supports the following rules and operations:
- Migration rules:
- Data is migrated based on the latest access time of the file.
- Data is migrated based on the file modification time.
- Data is migrated without conditions.
Table 1 Rule condition tags Condition Tag
Description
<age operator="lt">
Defines the conditions for changing the age or modification time.
<atime operator="gt">
Defines the condition for accessing time.
For a manual migration rule, no condition is required.
- Operations:
- Set the storage policy to a given data tier.
- Migrate files to another folder.
- Configure the number of copies for a file.
- Delete a file.
- Set a node label.
Table 2 Behavior types: Behavior Type
Description
Required Parameters
MARK
Determines the data access frequency and set a data storage policy.
<param>
<name>targettier</name>
<value>STORAGE_POLICY</value>
<param>
MOVE
Sets the data storage policy or NodeLabel and invokes the HDFS Mover tool.
<param>
<name>targettier</name>
<value>STORAGE_POLICY</value>
<param>
<param>
<name>targetnodelabels</name>
<value>SOME_EXPRESSION</value>
<param>
NOTE:You can set either or both of the parameters.
SET_REPL
Configures the number of copies for a file.
<param>
<name>replcount</name>
<value>INTEGER</value>
<param>
MOVE_TO_FOLDER
Moves the file to the target folder. If overwrite is set to true, the target path will be overwritten.
<param>
<name>target</name>
<value>PATH</value>
<param>
<param>
<name>overwrite</name>
<value>true/false</value>
<param>
NOTE:overwrite is an optional parameter. If this parameter is not set, the default value false is used.
DELETE
Delete a file.
N/A
Configuration Description
You must periodically invoke the migration tool and perform the following operations in the hdfs-site.xml file on the client:
Parameter |
Description |
Default Value |
---|---|---|
dfs.auto-data-movement.policy.class |
Specifies the default data migration policy.
NOTE:
Currently, only DefaultDataMovementPolicy is supported. |
com.xxx.hadoop.hdfs.datamovement.policy.DefaultDataMovementPolicy |
dfs.auto.data.mover.id |
Specifies the output file name of the hot and cold data migration policy. |
Current system time (ms) |
dfs.auto.data.mover.output.dir |
Specifies the name of the HDFS directory to which cold and hot data is migrated. The migration tool writes the behavior status file here. |
/system/datamovement |
DefaultDataMovementPolicy has the configuration file default-datamovement-policy.xml. Users need to define all rules based on the age or access time and operations performed in this file. This file must be stored in classpath of the client.
The following is an example of the default-datamovement-policy.xml file:
<policies> <policy> <fileset> <file> <name>/opt/data/1.txt</name> </file> <file> <name>/opt/data/*/subpath/</name> <excludes> <name>/opt/data/some/subpath/sub1</name> </excludes> </file> </fileset> <rules> <rule> <age>2w</age> <action> <type>MOVE</type> <params> <param> <name>targettier</name> <value>HOT</value> </param> </params> </action> </rule> </rules> </policy> </policies>
Other attributes can be added to the tags used in policies, rules, and behavior operations. For example, name can be used to manage the mapping between the user UI (for example, Hue UI) and tool input XML.
Example: <policy name="Manage_File1">
The tags are described as follows:
Tag |
Description |
Reusable or Not |
---|---|---|
<policy> |
Define a single policy.
|
Yes |
<fileset> |
Define a group of files or folders for each policy. |
No (in the policy tag) |
<file> |
One or more <name> tags are configured for the definition file and/or folder in the <file> tag. The file or folder name supports POSIX globs. |
Yes (in the fileset tag) |
<excludes> |
Define this tag in the <file> tag. This tag can contain multiple <name> tags. In the file or folder range configured in the <file> tag, the files or folders contained in the <name> tag will be excluded. The file or folder name supports POSIX globs. |
No (in the fileset tag) |
<rules> |
Specifies multiple rules defined for a policy. |
No (in the policy tag) |
<rule> |
Specifies a single rule to be defined. |
Yes (in the rules tag) |
<age>or<atime> |
Defines the age/accesstime of the file defined in <fileset>. The policy matches the age. The value of age can be in the [num]y[num]m[num]w[num]d[num]h format. In the command, num indicates a number. The meanings of the letters are as follows: * y: year (365 days in a year) * m: month (30 days in a month) * w: week (7 days in a week) * d: day * h: hour You can use the year, month, week, day, or hour independently, or you can combine them. For example, 1y2d indicates one year and two days or 367 days. If there is no unit (that is, the number is not followed by any letter), the default unit is day.
NOTE:
You can configure gt (greater) and lt (less) in the <age> and <atime> tags. The default operator is gt. Example: <age operator="lt"> |
No (in the rule tag) |
<action> |
If the rule is matched, this tag defines the action to be executed. |
No (in the rule tag) |
<type> |
Defines the action type. Currently, the supported action types are MOVE and MARK. |
No (in the action tag) |
<params> |
Defines parameters related to each action. |
No (in the action tag) |
<param> |
Defines a name-value format parameter that uses the <name> and <value> tags. For MARK and MOVE, only the targettier parameter is supported. This parameter specifies the data storage policy if the age rule is met. If multiple parameters have the same name, the first parameter value is used. For marks, the supported targettier values are ALL_SSD, ONE_SSD, HOT, WARM, and COLD. For MOVE, the supported targettier values are ALL_SSD, ONE_SSD, HOT, WARM, and COLD. |
Yes (in the params tag) |
For files or folders under the <file> tag, the FileSystem#globStatus API is used. For other files or folders, the GlobPattern class (used by GlobFilter) is used. For details, see the description of supported APIs. For example, for globStatus, /opt/hadoop/* will match everything in the /opt/hadoop folder. /opt/*/hadoop matches all hadoop folders in the subdirectories of the/opt directory.
For globStatus, the glob mode of each path component is matched. For other components, the glob mode is directly matched.
Behavior Operation Example
- MARK
<action> <type>MARK</type> <params> <param> <name>targettier</name> <value>HOT</value> </param> </params> </action>
- MOVE
<action> <type>MOVE</type> <params> <param> <name>targettier</name> <value>HOT</value> </param> <param> <name>targetnodelabels</name> <value>SOME_EXPRESSION</value> </param> </params> </action>
- SET_REPL
<action> <type>SET_REPL</type> <params> <param> <name>replcount</name> <value>5</value> </param> </params> </action>
- MOVE_TO_FOLDER
<action> <type>MOVE_TO_FOLDER</type> <params> <param> <name>target</name> <value>path</value> </param> <param> <name>overwrite</name> <value>true</value> </param> </params> </action>
The MOVE_TO_FOLDER operation only changes the file path to the target folder and does not change the block location. If you want to move a block, you need to configure an independent move policy.
- DELETE
<action> <type>DELETE</type> </action>
- When writing an XML file, pay attention to the configuration and sequence of behavior operations. The hot and cold data migration tool executes the rules in the sequence specified in the input XML file.
- If you want to run only one rule based on atime/age, sort the rules in descending order of time and set the idempotent attribute to false.
- If the delete operation is configured for a file set, other rules cannot be configured after the delete operation is performed.
- The -fs option can be used to specify the default file system address of the client.
Audit Logs
The cold and hot data migration tool supports audit logs of the following operations:
- Tool startup status
- Behavior type, parameter details, and status
- Tool completion status
To enable the audit log tool, add the following attributes to the <HADOOP_CONF_DIR>/log4j.property file:
autodatatool.logger=INFO, ADMTRFA autodatatool.log.file=HDFSAutoDataMovementTool.audit log4j.logger.com.xxx.hadoop.hdfs.datamovement.HDFSAutoDataMovementTool.audit=${autodatatool.logger} log4j.additivity.com.xxx.hadoop.hdfs.datamovement.HDFSAutoDataMovementTool-audit=false log4j.appender.ADMTRFA=org.apache.log4j.RollingFileAppender log4j.appender.ADMTRFA.File=${hadoop.log.dir}/${autodatatool.log.file} log4j.appender.ADMTRFA.layout=org.apache.log4j.PatternLayout log4j.appender.ADMTRFA.layout.ConversionPattern=%d{ISO8601} %p %c: %m%n log4j.appender.ADMTRFA.MaxBackupIndex=10 log4j.appender.ADMTRFA.MaxFileSize=64MB
For details, see the <HADOOP_CONF_DIR>/log4j_autodata_movment_template.properties file.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot