Base of Batch Processes

For Not Big Player

@joker1007 (Repro inc. CTO)

self.inspect

icon

My gems, or My contributions


We're seriously hiring now :rocket:

Complexity of Batch process is getting more and more.

even if we are not IT giant, it is inevitable.

Main purpose of batch process

  • aggregate logs
  • settlement
  • data deletion
  • backup

etc, etc

And some batch processes has dependency to others

Requirements of Complicated Batch

  • Define, visualize dependency of jobs
    • Fork and merge job route
    • namely DAG
  • Concurrent execution
  • Control concurrency
  • Retry any jobs
  • Re-usable jobnet

Batch process is DAG

DAG = Directed Acyclic Graph.

有向非巡回グラフ - Wikipedia

library tsort (Ruby 2.3.0)

Rake is sometimes painful

  • Hard to control concurrent execution
  • Hard to understand complicated job dependencies
  • Cannot Resume jobs freely
  • Hard to ignore dependency even when necessary

To solve thease probrem, I developed rukawa

Why not luiji, airflow, azkaban ?

Because We're Rubyist :trollface:

And Rails application can use this seamlessly.

Sample Job

class SampleJob < Rukawa::Job
  def run
    sleep rand(5)
    ExecuteLog.store[self.class] = Time.now
  end
end

class Job1 < SampleJob
  set_description "Job1 description body"
end
class Job2 < SampleJob
  def run
    raise "job2 error"
  end
end
class Job3 < SampleJob
end
class Job4 < SampleJob
  set_dependency_type :one_success
end

Sample JobNet

class SampleJobNet < Rukawa::JobNet
  class << self
    def dependencies
      {
        Job1 => [],
        Job2 => [Job1], Job3 => [Job1],
        Job4 => [Job2, Job3],
      }
    end
  end
end

Separates actual job implementation and job dependencies

User needs only to inherit base class and implement run

DEMO

Features of Rukawa

  • Visualize dependency (Graphviz)
  • Change dependency type
    • all_success, one_success, all_failed, and ...
    • inspired by Airflow
  • Define resource_count (like Semaphore)
  • Visualize results (Graphviz and colored node)
  • Variables from cli options
  • ActiveJob Integration

Rukawa focuses

  • Creating DAG
  • Simple Ruby Class Interface

Rukawa not focuses

  • Implements job queue
  • Implements concurrency control
  • Distributed execution on multi nodes
    • Rukawa is single process currently
  • No GUI, No Web UI
  • No Cron like scheduler

Concurrent execution

I don't want to implement base of concurrent execution. Because it is very hard. It is over technorogy for normal human being.

Use concurrent-ruby

Dataflow

Join some Futures, and continue to process.

a = Concurrent::dataflow { 1 }
b = Concurrent::dataflow { 2 }
c = Concurrent::dataflow(a, b) { |av, bv| av + bv }

I use dataflow as simple job queue.

Execute on ThreadPool

pool = Concurrent::FixedThreadPool.new(5)

Concurrent.dataflow_with(pool, *depend_dataflows) do |*results|
  # do something
end

Throws hard work to concurrent-ruby

My work becomes light :smile:

Distrubuted execution

  • It is very hard to develop seriously
  • Need to define usage of datastore outside of Ruby
  • I don't focus it

We have ActiveJob

  • Many implementations already exist
  • I only write simple wrapper of ActiveJob
  • Rukaha do only few things
    • Define dependency
    • Kick ActiveJob
    • Track job status

Pragmatic attitude

  • Use rundeck as scheduler
  • I don't use Ruby, when large scale distributrd computation
    • Hadoop, Spark, Bigquery, Redshift
  • What I really need is kicking other job framework
    • GIL of Ruby is not serious performance probrem

It is important to make compact tool what you really need for myself

and rely on ecosystem as much as possible

In order to effective use of limited resource

ワークフローエンジン

Rukawaをよろしく

Star