Monday, February 10, 2014

Get Oozie Job List Via Command Line

You may want to list oozie jobs which given criteria in command line. But it may not work good with the command oozie provided. For oozie version 2.0 or above, you may have option -jobtype combined with -filter for a list(refer to the link http://oozie.apache.org/docs/3.1.3-incubating/DG_CommandLineTool.html#Checking_the_Status_of_multiple_Coordinator_Jobs). However, for older versions, we need workarounds for this.

Luckily, we can get a list of oozie jobs by using curl via Ooize URLs in Oozie with older version. For example, the following command can give a list of workflow jobs:

$ curl http://localhost:11000/oozie/v1/jobs

(You can replace localhost above with any valid oozie node names)

Above command will give a JSON string within a line(this will be an extreme long line if there are dozens of jobs record) that record a list of workflow jobs. Due to pagination, this list could not be completed, you can specify the maximum number(suppose there shouldn't be more than 1000 workflows) of jobs listed:

$ curl 'http://localhost:11000/oozie/v1/jobs?len=1000'

(I just add single quote(') to prevent wild-casting of '?' in URL string)

Maybe we just only need Oozie workflow job IDs, use this command:

$ curl 'http://localhost:11000/oozie/v1/jobs?len=1000' | sed 's/id\[/\n/g; s/W\]/W\n/g' | grep "oozie-oozi-W$"

0006030-140210061306533-oozie-oozi-W
0006031-140210061306533-oozie-oozi-W
0006032-140210061306533-oozie-oozi-W
0006033-140210061306533-oozie-oozi-W
0006034-140210061306533-oozie-oozi-W
...


Then we can do something with the output by script conveniently, such as killing all workflows:

$ curl 'http://localhost:11000/oozie/v1/jobs?len=1000' | sed 's/id\[/\n/g; s/W\]/W\n/g' | grep "oozie-oozi-W$"  | while read job_id; do oozie job -kill $job_id; done

OK, you may want to get all RUNNING workflow job IDs, use filter=status%3DRUNNING:

$ curl 'http://localhost:11000/oozie/v1/jobs?len=1000&filter=status%3DRUNNING'
{"total":2, ...

So at the beginning of the output line, you will see how many workflow jobs are running. After that, the JSON string shows the detailed information of each job.

If you want to get other types of jobs, you can add jobtype= to URL, e.g, the following command will give the list of coordinator job IDs :

$ curl 'http://localhost:11000/oozie/v1/jobs?jobtype=coord&len=2000' | sed 's/id\[/\n/g; s/C]/C\n/g' | grep "oozie-oozi-C$"

(Just notice that the Coordinator job IDs end with 'C')