datax 编译方式

System Requirements

Quick Start

工具部署

  • 方法一、直接下载 DataX 工具包:DataX下载地址

    下载后解压至本地某个目录,进入 bin 目录,即可运行同步作业:

    1
    2
    $ cd  {YOUR_DATAX_HOME}/bin
    $ python datax.py {YOUR_JOB.json}

    自检脚本:python {YOUR_DATAX_HOME}/bin/datax.py {YOUR_DATAX_HOME}/job/job.json

  • 方法二、下载 DataX 源码,自己编译:DataX源码

    1. 下载 DataX 源码:$ git clone git@github.com:alibaba/DataX.git

    2. 通过 maven 打包:

      1
      2
      $ cd  {DataX_source_code_home}
      $ mvn -U clean package assembly:assembly '-Dmaven.test.skip=true'

      打包成功,日志显示如下:

      1
      2
      3
      4
      5
      6
      [INFO] BUILD SUCCESS
      [INFO] -----------------------------------------------------------------
      [INFO] Total time: 08:12 min
      [INFO] Finished at: 2015-12-13T16:26:48+08:00
      [INFO] Final Memory: 133M/960M
      [INFO] -----------------------------------------------------------------

      打包成功后的DataX包位于 {DataX_source_code_home}/target/datax/datax/ ,结构如下:

      1
      2
      3
      $ cd  {DataX_source_code_home}
      $ ls ./target/datax/datax/
      bin conf job lib log log_perf plugin script tmp
    3. 我们在运行跳过单元测试时的命令 mvn package -Dmaven.test.skip=true 时,出现 Unknown lifecycle phase ".test.skip=true". 如下:

      1
      2
      3
      4
      5
      6
      7
      [ERROR] Unknown lifecycle phase ".test.skip=true". You must specify a valid lifecycle phase or a goal in the format <plugin-prefix>:<goal> or <plugin-group-id>:<plugin-artifact-id>[:<plugin-version>]:<goal>. Available lifecycle phases are: validate, initialize, generate-sources, process-sources, generate-resources, process-resources, compile, process-classes, generate-test-sources, process-test-sources, generate-test-resources, process-test-resources, test-compile, process-test-classes, test, prepare-package, package, pre-integration-test, integration-test, post-integration-test, verify, install, deploy, pre-clean, clean, post-clean, pre-site, site, post-site, site-deploy. -> [Help 1]
      [ERROR]
      [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
      [ERROR] Re-run Maven using the -X switch to enable full debug logging.
      [ERROR]
      [ERROR] For more information about the errors and possible solutions, please read the following articles:
      [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/LifecyclePhaseNotFoundException

      这是因为 -Dmaven.test.skip=true 没有被正确地识别,我们只需要在 '-Dmaven.test.skip=true' 加上单引号即可。即: mvn package '-Dmaven.test.skip=true'

配置示例:从stream读取数据并打印到控制台

  • 第一步、创建作业的配置文件(json格式)

    可以通过命令查看配置模板: python datax.py -r {YOUR_READER} -w {YOUR_WRITER}

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    $ cd  {YOUR_DATAX_HOME}/bin
    $ python datax.py -r streamreader -w streamwriter
    DataX (UNKNOWN_DATAX_VERSION), From Alibaba !
    Copyright (C) 2010-2015, Alibaba Group. All Rights Reserved.
    Please refer to the streamreader document:
    https://github.com/alibaba/DataX/blob/master/streamreader/doc/streamreader.md

    Please refer to the streamwriter document:
    https://github.com/alibaba/DataX/blob/master/streamwriter/doc/streamwriter.md

    Please save the following configuration as a json file and use
    python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json
    to run the job.

    {
    "job": {
    "content": [
    {
    "reader": {
    "name": "streamreader",
    "parameter": {
    "column": [],
    "sliceRecordCount": ""
    }
    },
    "writer": {
    "name": "streamwriter",
    "parameter": {
    "encoding": "",
    "print": true
    }
    }
    }
    ],
    "setting": {
    "speed": {
    "channel": ""
    }
    }
    }
    }

    根据模板配置json如下:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    #stream2stream.json
    {
    "job": {
    "content": [
    {
    "reader": {
    "name": "streamreader",
    "parameter": {
    "sliceRecordCount": 10,
    "column": [
    {
    "type": "long",
    "value": "10"
    },
    {
    "type": "string",
    "value": "hello,你好,世界-DataX"
    }
    ]
    }
    },
    "writer": {
    "name": "streamwriter",
    "parameter": {
    "encoding": "UTF-8",
    "print": true
    }
    }
    }
    ],
    "setting": {
    "speed": {
    "channel": 5
    }
    }
    }
    }
  • 第二步:启动 DataX

    1
    2
    $ cd {YOUR_DATAX_DIR_BIN}
    $ python datax.py ./stream2stream.json

    同步结束,显示日志如下:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    ...
    2015-12-17 11:20:25.263 [job-0] INFO JobContainer -
    任务启动时刻 : 2015-12-17 11:20:15
    任务结束时刻 : 2015-12-17 11:20:25
    任务总计耗时 : 10s
    任务平均流量 : 205B/s
    记录写入速度 : 5rec/s
    读出记录总数 : 50
    读写失败总数 : 0

datax 编译方式
https://flepeng.github.io/044-DataX-datax-编译方式/
作者
Lepeng
发布于
2021年3月6日
许可协议