optimization – Alexander Vondrous

Benefits of “Row Recycling”

15. March 2021 by sasa Leave a Comment

TLDR

Make updates, don’t delete rows just to insert them with small changes.

The Crime Scene

In the shadows of the night, a batch job starts his task at 2 am to clean data. A table with millions of entries and more columns then healthy for any developer gets queried for new entries of the last working day. Each cleaning step comes in chunks of 1 to 10 rows. From here on out, there are three different ways a chunk of rows can develop. Either the number of rows in a chunk increases, stays as it is or decreases. In any way, the content is slightly modified. This batch jobs takes about 2 to 3 hours to complete processing all the chunks sequentially.

The Culprit

To process all chunks, all changed rows of the last working day get fetched and processing starts:

Start Transaction
Get all rows of a chunk
Delete all rows of the chunk from the database
Modify the row content and row count
Insert the modified rows of the chunk into the database
End Transaction

The delete-modify-insert pattern works well for a certain size of table but can get slow as soon as the content of the table gets out of hand. An update could be faster than delete-modify-insert.

Old developers wisdom: Make measurements, especially if in doubt.

Delete and insert operations are expensive because the index has to be updated. It also depends strongly on the tables configuration and database settings. Some databases can handle it very well but in this case it was the reason for a moderate performance.

One Solution: Row Recycling

To optimize chunk processing, keep the rows of a chunk in the database to the moment, when it is clear, how the new chunks look like in terms of row count and content. Three scenarios are possible:

number of rows stays the same
number of rows decreases
number of rows increases

In any case, row content is slightly modified. With that knowledge, it is possible to create a model to understand possible performance benefits. Lets get the hands dirty. The property of interest is speedup $S$, which we could gain from recycling rows to get rid of the delete-modify-insert schema. Speedup is defined as the time the old program (old: delete-modify-insert) takes $T_{o}$ by the time the new program (new: lets call it recycle) $T_{n}$ takes.

$S = \frac{T_{o}}{T_{n}}$

The time a delete-modify-insert $T_{o}$ takes, consists of the time $t_d$ it takes to delete one of $n$ elements and the time $t_i$ to insert one of $m$ elements.

Wait a minute: Where is the time to fetch, modify and send the data?

There are several reasons to remove those considerations. The very first and obvious reason is, the “recycling speedup” I want to show looks much ~~better~~ clearer ;-). Another reason is to keep the model simple for this post. The most reasonable reason is to neglect the latency bound operation becaues the database and batch process are on the same machine. Its an easy task to add more and more reality/complexity to the model which would make a really long post with dozens of parameters. Simplicity is good because one can use available tools at hand (e.g. Mathematica, …) to analyze certain aspects of reality, without the need to develop a professional simulator, which is fun though.

Now lets get the equations going and define $T_{o}$

$T_{o} = n\cdot t_d + m\cdot{} t_i$

The situation for $T_{n}$ is a little bit more complex but still managable, because the three cases (row count increases, stays and decreases) from above are used. The time to update a row is $t_u$.

$T_{n} = \begin{cases} (n-m) \cdot{} t_d & + & m\cdot{}t_u, & \text{if $m<n$} \\
& & m\cdot{}t_u, & \text{if $m=n$} \\
(m-n) \cdot{} t_i & + & n \cdot{}t_u, & \text{if $m>n$} \end{cases}$

I simplified the equation (some Voodoo) with one Assumption: The time it takes to delete a row is the same time it takes to insert a row. This is heavy and it was true for that “crime”. Imagine the length and width of the database table to get such performance behaviour.

$T_n = |n-m|\cdot{}t_d + \text{min($m,n$)}\cdot{}t_u$

Lets put everything into the cooking pot and determine the Speedup.

$S = \frac{T_{o}}{T_{n}} = \frac{n\cdot t_d + m\cdot{} t_d}{|n-m|\cdot{}t_d + \text{min($m,n$)}\cdot{}t_u} = \frac{n+m}{|n-m| + \text{min($m,n$)}\frac{t_u}{t_d}}$

To understand a little bit better, what this means, lets take another perspective on this equation with surface plots. The x-axis is $n$ and the y-axis is $m$. The quotient $\frac{t_u}{t_d}$ is set to 1.0 (update and delete time is equal). The z-axis is representing the Speedup $S$. Here the Mathematica line:

Plot3D[(x+y)/(Abs[x-y]+Min[x,y]*1), {x, 0, 10}, {y, 0, 10}]

The plot is symmetrical to the first bisector and has a maximum speedup of about 2.

Now this looks good. In the worst case the speedup of row recycling is as good as without recycling (delete-insert). Things start to look even better if the time for an update is smaller than for a delete/insert. Lets assume an update is 10 times faster than a delete.

Plot3D[(x+y)/(Abs[x-y]+Min[x,y]*0.1), {x, 0, 10}, {y, 0, 10}]

The same as earlier but the maximum speedup is around 15.

Now this shark fin is a good result. Lets push things one order of magnitude and assume an update is 100 times faster than a delete/insert.

Plot3D[(x+y)/(Abs[x-y]+Min[x,y]*0.01), {x, 0, 10}, {y, 0, 10}]

Wow the result is, that one has to increase the bounding box 😉.

Ok ok but what happens if the time of an update is 10 times slower than an delete/insert?

Plot3D[(x+y)/(Abs[x-y]+Min[x,y]*10), {x, 0, 10}, {y, 0, 10}]

The surface is also symmetrical along the first bisctor and in the worst case, we are about 10 times slower than delete-insert.

This does not look very good but every coin has two sides. Its somehow clear, if an update takes 10 times the time of a delete or insert, that the update centric approach ist slower. Heads up. This is not the end. Something interesting happens if the time for an update is two times the time of a delete/insert.

Plot3D[(x+y)/(Abs[x-y]+Min[x,y]*2.0), {x, 0, 10}, {y, 0, 10}]

This is very interesting, since there is no speedup at all. This shows, that it is possible that an optimization can have no speedup at all. Imagine the confusion analysing such behaviour after a performance optimization.

All in all its imperative to understand and proof performance bottlenecks before conducting an optimization like “row recycling”. The delete-insert method has benefits if the update takes way to long but the update is clearly benefitial if the update time is equal to the delete/insert time for this use case.

There are more things to explain around this use case, more things to discuss and more things to analyse. If you like, let me know.

Ready, Steady, GOOOOO

11. August 2015 by sasa Leave a Comment

Traffic jams and rush hours reduce my free time, which I can spend with more meaningful activities. Usually I use the commute to come down from work and to switch into private mode. Unfortunately I do not need an hour to arrive in the private world, such that I waste a lot of time in the car. Even the best podcasts are sometimes of no use to bridge the time. In order to reduce the commute time from my office to my home, I tried experimentally to figure out which departure time is the best. Beside the desired sampling rate and a long observation period, I am lazy and I forget to make notes. After a week the experiments felt fuzzy and the way of solving the problem did not satisfy the little nerdiness I carry with me but I got the best case and worst case timings:

Travel distance:                50 km
Best case (moderate speed):     35 minutes
Best case (pedal to the metal): 30 minutes
Worst case:                     60 minutes

What do humans in the 21st century use to estimate the travel time? Google maps! OK thats the way to go. The nice thing is, that google maps is able to take the current traffic situation into account. I can let the computer do the nasty work of taking notes.

Within 80 lines of Java code (with empty lines ;-), the google maps API is queried every minute and the result is stored in a file. After about $3.5$ weeks of data acquisition a bash script extracts and groups the estimated travel time for each day of the week into a single file. One data point consists of the time and estimated travel time pair: “HH-MM-SS : estimated travel time”. Up to 4 data points of each minute made the visualization messy. To reduce the number of data points a python script computes the mean value for each minute. Finally D3 javascript is used to visualize the data as you can see if you click on one of the following weekdays.

Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday

Between 1 am and 5 am google does not deliver travel time estimations or the more probable explanation is, that google knows about a stargate nearby, which opens from 1 am to 5 am. Beside the things google knows, the interesting parts for me are the rush hour peaks around 5 pm. On Monday there is surprisingly no peak. Maybe the measurement period of about $3.5$ weeks was to short. On Tuesday and Wednesday the peaks are after 5 pm and on Thursday and Friday the peaks are shifted towards 4pm. Surprisingly a 12 o’clock peak occurs on saturday. A colleague called it “the Saturday shopping rush”, which matches pretty well with my observations of shopping on a saturday. Sunday is no surprise at all.

As a result for me: Find the Stargate! Go early to work and leave early. If you want to stay in the bed for a few moments, consider a few moments more.

Supplement: The experimental observation indicates, that during school vacations the rush hour is significantly shorter.