Base of Batch Processes
For Not Big Player
@joker1007 (Repro inc. CTO)
self.inspect
My gems, or My contributions
We're seriously hiring now
Complexity of Batch process is getting more and more.
even if we are not IT giant, it is inevitable.
Main purpose of batch process
- aggregate logs
- settlement
- data deletion
- backup
etc, etc
And some batch processes has dependency to others
Requirements of Complicated Batch
- Define, visualize dependency of jobs
- Fork and merge job route
- namely DAG
- Concurrent execution
- Control concurrency
- Retry any jobs
- Re-usable jobnet
Rake is sometimes painful
- Hard to control concurrent execution
- Hard to understand complicated job dependencies
- Cannot Resume jobs freely
- Hard to ignore dependency even when necessary
To solve thease probrem, I developed rukawa
Why not luiji, airflow, azkaban ?
Because We're Rubyist
And Rails application can use this seamlessly.
Sample Job
class SampleJob < Rukawa::Job
def run
sleep rand(5)
ExecuteLog.store[self.class] = Time.now
end
end
class Job1 < SampleJob
set_description "Job1 description body"
end
class Job2 < SampleJob
def run
raise "job2 error"
end
end
class Job3 < SampleJob
end
class Job4 < SampleJob
set_dependency_type :one_success
end
Sample JobNet
class SampleJobNet < Rukawa::JobNet
class << self
def dependencies
{
Job1 => [],
Job2 => [Job1], Job3 => [Job1],
Job4 => [Job2, Job3],
}
end
end
end
Separates actual job implementation and job dependencies
User needs only to inherit base class and implement run
Features of Rukawa
- Visualize dependency (Graphviz)
- Change dependency type
- all_success, one_success, all_failed, and ...
- inspired by Airflow
- Define
resource_count
(like Semaphore)
- Visualize results (Graphviz and colored node)
- Variables from cli options
- ActiveJob Integration
Rukawa focuses
- Creating DAG
- Simple Ruby Class Interface
Rukawa not focuses
- Implements job queue
- Implements concurrency control
- Distributed execution on multi nodes
- Rukawa is single process currently
- No GUI, No Web UI
- No Cron like scheduler
Concurrent execution
I don't want to implement base of concurrent execution.
Because it is very hard.
It is over technorogy for normal human being.
Dataflow
Join some Futures, and continue to process.
a = Concurrent::dataflow { 1 }
b = Concurrent::dataflow { 2 }
c = Concurrent::dataflow(a, b) { |av, bv| av + bv }
I use dataflow as simple job queue.
Execute on ThreadPool
pool = Concurrent::FixedThreadPool.new(5)
Concurrent.dataflow_with(pool, *depend_dataflows) do |*results|
# do something
end
Throws hard work to concurrent-ruby
My work becomes light
Distrubuted execution
- It is very hard to develop seriously
- Need to define usage of datastore outside of Ruby
- I don't focus it
We have ActiveJob
- Many implementations already exist
- I only write simple wrapper of ActiveJob
- Rukaha do only few things
- Define dependency
- Kick ActiveJob
- Track job status
Pragmatic attitude
- Use rundeck as scheduler
- I don't use Ruby, when large scale distributrd computation
- Hadoop, Spark, Bigquery, Redshift
- What I really need is kicking other job framework
- GIL of Ruby is not serious performance probrem
It is important to make compact tool what you really need for myself
and rely on ecosystem as much as possible
In order to effective use of limited resource
ワークフローエンジン
Rukawaをよろしく
Star