Batch Processing How- To
Or the “The Single Threaded Batch Processing Paradigm”
Stefan Rufer, Netcetera
Matthias Markwalder, SIX Card Solutions 6840
Speakers
> St ef an Rufer
– St udied business IT at t he Universit y of Applied Sciences in Bern
– Senior Soft ware Engineer at Net cet era
– Main int erest : Server side applicat ion developm ent using JEE
> Mat t hias Markwalder
– Graduat ed from ETH Zurich
– Senior Developer + Fram ework Responsible at SIX Card Solut ions
3
Why are we here?
AGENDA
> What do we do
> Sharing our ex perience
5
What do we do
> Credit / debit card t ransact ion processing
> Backoff ice bat ch processing applicat ion 24x 7x 365
> 1.7 Mio card t ransact ions a day
> Volum e will double by end of 2010 be ready…
> Migrat ed from Fort é UDS t o JEE
How do we do it
> Transact ional int egrit y at any t im e
> Cust om bat ch processing fram ework (not Spring Bat ch)
> 1 cont roller builds t he jobs
35 workers process t he st eps of jobs
(or as m any as you want and your syst em can t ake)
> 1 applicat ion server (12 cores)
7
Batch Processing Basics
> It ‘ s sim ple, but parallel:
– Read file(s)
– Process a bit
– Writ e file(s)
> Term inology from Spring Bat ch
AGENDA
> What do we do
> Sharing our ex perience > Wrap up + Q&A
9
Bake an omelet
> 200g flour, 3 eggs, 2 dl m ilk, 2 dl wat er, ½ t able spoon salt
> St ir well, wait 30m in ( )
> St ir again
> Put lit t le but t er in heat ed pan
> Add 1dl dough
> Bake unt il slight ly brown, flip over, bake again half as long
> Put cheese / m arm alade / apfelm us / ... on t op, f old
Jobs run in parallel
Mot ivat ion
> Load balancing
Ex am ple
> Com plet e yest erdays report s while doing t oday's business
How t o achieve
> Use bat ch scheduling applicat ion t hat cont rols your ent ire processing.
Load limitations
Mot ivat ion
> Load balancing
Ex am ple
> Generat e 70 report s, but m ax 20 in parallel
How t o achieve
> Num ber of workers one job can use
13
Decouple controller + workers
Mot ivat ion
> Scalabilit y
Ex am ple
Mot ivat ion
> Avoid st ruct uring st eps in code
Ex am ple
> Collect dat a, af t erwards writ e a file.
How t o achieve
> Sequent ial ex ecut ion
> Fail on ex cept ion (rollback ent ire st ep)
15
Mot ivat ion
> Minim ize work left
Ex am ple
> Process 30'000
t ransact ions in 3 st eps.
How t o achieve
> Parallel ex ecut ion
> Cont inue on ex cept ion (st ill rollback ent ire st ep)
Mot ivat ion
> Speedup
Ex am ple
> A file of 200'000 credit card aut horisat ions and t ransact ions have t o be read int o dat abase.
How t o achieve
> Cut input file in pieces of 10'000 lines each.
– bt w: perl, sort are unbeat en for t his...
> Process each piece in a parallel st ep.
17
Parallelize processing
Motivat ion
> Speedup
Ex ample
> Summarize accounting data and
store result in database again.
How to achieve
> Group data in chunks of 10'000 and process each chunk in a parallel step.
> Choose grouping criteria carefully:
– No overlapping data areas
Parallelize processing – how to group
Motivat ion
> Structuring your data in parallelizable chunks
> Load balancing
Ex ample
> Parallelize processing by client as data is distinct by design.
How to achieve
> Group by client
> Group by keys: Ranges or ids
– Ranges (1..5) can grow very large
19
Parallelize writing
Mot ivat ion
> Transact ional int egrit y while writ ing files.
> Easy recovery while writ ing files.
Ex am ple
> Collect dat a f or t he paym ent file.
How t o achieve
> Collect dat a in parallel and writ e t o a st aging t able.
> St aging t able cont ent very close t o t arget file form at .
Different processes write in parallel
Mot ivat ion
> Don't lock out each ot her Ex am ple
> Account inf orm at ion changes while account balance grows. How t o achieve
> No opt im ist ic locking
> Modify delt as on sum s and count ers
> Keep dist inct f ields f or diff erent parallel jobs
21
Avoid insert and update in same table and
step
Mot ivat ion
> Speedup
> Avoid DB locks
Ex am ple
> Sum mary rows in sam e t able as t he raw dat a.
How t o achieve
Let the database work for you
Mot ivat ion
> Simple code
> Speedup
Ex am ple
> Sort ing or joining arrays in m em ory.
How t o achieve
> Code review.
23
Read long, write short
Motivat ion
> Keep lock contention on database minimal
> Keep transactional DB overhead minimal
Ex ample
> Fully process the whole batch of 1‘000 records before starting to write to
DB.
How to achieve
> 1 (one) "writing" database transaction per step.
interface IModifyingStepRunner {
void prepareData();
void writeData();
This omelet did not taste like grandma's!
> Despite following the recipe, there are the hidden corners
25
Don't forget to catch Error
Motivat ion
> Application int egrity delegated to DB
Ex ample
> OutOfMemoryError caused half of a batch to be committed. Fatal as rerun
can not fix inconsistency. How to fix
try {
result = action.doInTransaction(status); } catch (Throwable err) {
transactionManager.rollback(status);
throw err; }
Use BufferedReader / BufferedWriter
Mot ivat ion
> Speedup (file reading t im e cut in half)
Ex am ple
> Forgot t o use Buff eredReader in file reading f ram ework.
How t o f ix
> Code review.
27
Use 1 thread only
Mot ivat ion
> Simplicit y for t he program m er
> Saf et y (no concurrent access)
Ex am ple
> Singlet on, synchronized blocks, st at ic variables, st at ef ul st ep runners – we had it all...
How t o achieve
Cache wisely
Mot ivat ion
> Speedup
> Lim it m em ory use Ex am ple
> Tax rat es do not change during a processing day, cache it long.
> Cust omer dat a will be reused if processing t ransact ion of sam e cust om er – cache it short .
How t o achieve
> Cache per worker
29
Support JDBC batch operations
Mot ivat ion
> Speedup Ex am ple
List<Booking> bookings = new ArrayList<Booking>(); ...
bookingDao.update(bookings);
How t o achieve
> Enhance your dat abase layer wit h a built - in JDBC bat ch f acilit y.
> Ex ecut e bat ch aft er 1000 it em s added.
Structured patching
Mot ivat ion
> Risk m anagem ent
> St ay agile in product ion
Ex am ple
> Bug f ound, fix ed and unit t est ed. Deploy t o product ion asap.
How t o achieve
> Eclipse- wizard t o creat e pat ch (all f iles involved t o fix a bug)
31
Never, ever, update primary keys
Mot ivat ion
> Good dat abase design
> Speedup
Ex am ple
> Hom em ade library always wrot e ent ire row t o dat abase.
How t o f ix
> Only writ e changed f ields (dirt y flags).
AGENDA
> What do we do
> Sharing our ex perience
33
Future
> Scalabilit y is an issue wit h a single dat abase server.
– Part it ioning opt ions used, but not t o t he end.
– Will Moore's law save us again?
If you remember just three things...
Java batch processing works and is cool :- )
Trade- offs:
>
Do not stock the work, start.
>
Single threaded, many JVMs.
>
Designing for scalability, stability needs experts.
Stefan Rufer st efan.rufer@net cet era.ch
Netcetera AG www.net cet era.ch
Matthias Markwalder m at t hias.m arkwalder@six
-group.com
Links / References
> ht t p:/ / en.wikipedia.org/ wiki/ Bat ch_processing
> ht t p:/ / st at ic.springf ramework.org/ spring- bat ch/
> ht t p:/ / www.bm c.com / product s/ off ering/ cont rol- m .ht m l
> ht t p:/ / www.javaspecialist s.eu/
And t o really learn how t o bake f ine om elet s, buy a book:
> ht t p:/ / de.wikipedia.org/ wiki/ Marianne_Kalt enbach
37
Other batch processing frameworks (public
only)
> http:/ / www.bmap4j.org/
> http:/ / freshmeat.net/ projects/ jppf > http:/ / hadoop.apache.org/