{"id":7,"date":"2015-03-11T00:08:47","date_gmt":"2015-03-10T23:08:47","guid":{"rendered":"http:\/\/alexander.vondrous.de\/?p=7"},"modified":"2015-09-30T10:03:05","modified_gmt":"2015-09-30T09:03:05","slug":"1d-vs-2d-domain-decomposition-for-parallel-execution-on-regular-grids","status":"publish","type":"post","link":"https:\/\/alexander.vondrous.de\/?p=7","title":{"rendered":"1D vs 2D domain decomposition for parallel execution on regular grids"},"content":{"rendered":"<p style=\"text-align: justify;\">This post is a brief\u00a0summary of the paper on parallel computing with regular grids (<a href=\"http:\/\/dx.doi.org\/10.1177\/1094342013490972\">link<\/a>).<\/p>\n<p style=\"text-align: justify;\">It describes the advantage of 3D domain decomposition over 1D or 2D domain decomposition for distributed memory computing with a stencil code\u00a0because of the better surface to volume ratio (computation to communication ratio).<\/p>\n<p style=\"text-align: justify;\">Many simulations are performed on a regular 3D grid. Seismic wave propagation, fluid flow or grain growth are a view examples, which are computed on regular grids. It is often referred to stencil codes because a stencil is applied to compute the next state in time for each grid cell.\u00a01D and 2D domains are favorable, because of small stencils and small memory foot prints, which leads to a fast computation and less cache misses. Unfortunately, many phenomena need to be investigated in\u00a03D. Lets stay in the \u00a02D world till the end\u00a0because it is possible explain the effects on speedup and efficiency already in 2D.<\/p>\n<p style=\"text-align: justify;\">One very accessible example for a stencil code on a 2D grid is <a title=\"Conway's Game of Life\" href=\"http:\/\/en.wikipedia.org\/wiki\/Conway%27s_Game_of_Life\" target=\"_blank\">Conway&#8217;s Game of Life<\/a>, where a cell is either alive 1 or dead 0. Four simple rules determine if a cell dies, starts to live ;-) or stays in the same state based on a 9-point stencil as depicted\u00a0in the following figure.<\/p>\n<p style=\"text-align: justify;\"><a href=\"http:\/\/alexander.vondrous.de\/wp-content\/uploads\/2015\/03\/9-point-stencil.png\"><img decoding=\"async\" loading=\"lazy\" class=\" wp-image-182 aligncenter\" src=\"http:\/\/alexander.vondrous.de\/wp-content\/uploads\/2015\/03\/9-point-stencil.png\" alt=\"9-point-stencil\" width=\"71\" height=\"71\" srcset=\"https:\/\/alexander.vondrous.de\/wp-content\/uploads\/2015\/03\/9-point-stencil.png 183w, https:\/\/alexander.vondrous.de\/wp-content\/uploads\/2015\/03\/9-point-stencil-150x150.png 150w\" sizes=\"(max-width: 71px) 100vw, 71px\" \/><\/a><\/p>\n<p style=\"text-align: justify;\">In this example I have one 2D grid for the current\u00a0time step $n$ and one for the next time step $n+1$. The field with time step n contains a initial state, which is randomly filled with alive (1) and dead (0) cells. To compute the state of all cells in the grid for time step\u00a0$n+1$, the 9-point stencil has to be applied for all cells in in time step $n$. The following figure depicts the two fields (time step $n$ and $n+1$) and shows the application of the 9-point stencil on a cell. Already processed cells in time step $n+1$ are colored in blue.<\/p>\n<p style=\"text-align: justify;\"><a href=\"http:\/\/alexander.vondrous.de\/wp-content\/uploads\/2015\/03\/gol-stencil3.png\"><img decoding=\"async\" loading=\"lazy\" class=\" wp-image-271 aligncenter\" src=\"http:\/\/alexander.vondrous.de\/wp-content\/uploads\/2015\/03\/gol-stencil3.png\" alt=\"gol-stencil3\" width=\"403\" height=\"230\" srcset=\"https:\/\/alexander.vondrous.de\/wp-content\/uploads\/2015\/03\/gol-stencil3.png 920w, https:\/\/alexander.vondrous.de\/wp-content\/uploads\/2015\/03\/gol-stencil3-300x172.png 300w, https:\/\/alexander.vondrous.de\/wp-content\/uploads\/2015\/03\/gol-stencil3-500x286.png 500w\" sizes=\"(max-width: 403px) 100vw, 403px\" \/><\/a><\/p>\n<p style=\"text-align: justify;\">Now lets take a look at a 2D grid consisting of $3200\\times3200$ cells to get into domain\u00a0decomposition for\u00a0<a title=\"Conway's Game of Life\" href=\"http:\/\/en.wikipedia.org\/wiki\/Conway%27s_Game_of_Life\" target=\"_blank\">Conway&#8217;s Game of Life<\/a>\u00a09-point stencil example.<\/p>\n<p style=\"text-align: justify;\"><a href=\"http:\/\/alexander.vondrous.de\/wp-content\/uploads\/2015\/03\/3200x3200.png\"><img decoding=\"async\" loading=\"lazy\" class=\" wp-image-199 size-medium aligncenter\" src=\"http:\/\/alexander.vondrous.de\/wp-content\/uploads\/2015\/03\/3200x3200-300x234.png\" alt=\"\" width=\"300\" height=\"234\" srcset=\"https:\/\/alexander.vondrous.de\/wp-content\/uploads\/2015\/03\/3200x3200-300x234.png 300w, https:\/\/alexander.vondrous.de\/wp-content\/uploads\/2015\/03\/3200x3200-500x390.png 500w, https:\/\/alexander.vondrous.de\/wp-content\/uploads\/2015\/03\/3200x3200.png 648w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p style=\"text-align: justify;\">I use 16 CPU cores to speedup the computation, such that I cut\u00a0the domain along the $x$ axis into 16 subdomains as you can see in the following figure. One domain (red) is picked out to show the size of the\u00a0domains.<\/p>\n<p style=\"text-align: justify;\"><a href=\"http:\/\/alexander.vondrous.de\/wp-content\/uploads\/2015\/03\/3200x200-red.png\"><img decoding=\"async\" loading=\"lazy\" class=\" wp-image-210 aligncenter\" src=\"http:\/\/alexander.vondrous.de\/wp-content\/uploads\/2015\/03\/3200x200-red-300x142.png\" alt=\"\" width=\"404\" height=\"191\" srcset=\"https:\/\/alexander.vondrous.de\/wp-content\/uploads\/2015\/03\/3200x200-red-300x142.png 300w, https:\/\/alexander.vondrous.de\/wp-content\/uploads\/2015\/03\/3200x200-red-500x237.png 500w, https:\/\/alexander.vondrous.de\/wp-content\/uploads\/2015\/03\/3200x200-red.png 963w\" sizes=\"(max-width: 404px) 100vw, 404px\" \/><\/a><\/p>\n<p style=\"text-align: justify;\">In order to compute one subdomain independently on one processor, it is necessary to introduce one additional line of cells at the cutting edge. In other words, a boundary layer or ghost layer has to be introduced. The following two figures depict the state of the uncut domain and the state after cutting into subdomains with additional ghost layers.<\/p>\n<p style=\"text-align: justify;\"><a href=\"http:\/\/alexander.vondrous.de\/wp-content\/uploads\/2015\/03\/sub-domain-before22.png\"><img decoding=\"async\" loading=\"lazy\" class=\" wp-image-274 aligncenter\" src=\"http:\/\/alexander.vondrous.de\/wp-content\/uploads\/2015\/03\/sub-domain-before22-1024x595.png\" alt=\"sub-domain-before2\" width=\"485\" height=\"282\" srcset=\"https:\/\/alexander.vondrous.de\/wp-content\/uploads\/2015\/03\/sub-domain-before22-1024x595.png 1024w, https:\/\/alexander.vondrous.de\/wp-content\/uploads\/2015\/03\/sub-domain-before22-300x174.png 300w, https:\/\/alexander.vondrous.de\/wp-content\/uploads\/2015\/03\/sub-domain-before22-500x291.png 500w, https:\/\/alexander.vondrous.de\/wp-content\/uploads\/2015\/03\/sub-domain-before22.png 1053w\" sizes=\"(max-width: 485px) 100vw, 485px\" \/><\/a><\/p>\n<p style=\"text-align: justify;\"><a href=\"http:\/\/alexander.vondrous.de\/wp-content\/uploads\/2015\/03\/sub-domain-after2.png\"><img decoding=\"async\" loading=\"lazy\" class=\" size-large wp-image-272 aligncenter\" src=\"http:\/\/alexander.vondrous.de\/wp-content\/uploads\/2015\/03\/sub-domain-after2-1024x584.png\" alt=\"sub-domain-after2\" width=\"500\" height=\"285\" srcset=\"https:\/\/alexander.vondrous.de\/wp-content\/uploads\/2015\/03\/sub-domain-after2-1024x584.png 1024w, https:\/\/alexander.vondrous.de\/wp-content\/uploads\/2015\/03\/sub-domain-after2-300x171.png 300w, https:\/\/alexander.vondrous.de\/wp-content\/uploads\/2015\/03\/sub-domain-after2-500x285.png 500w, https:\/\/alexander.vondrous.de\/wp-content\/uploads\/2015\/03\/sub-domain-after2.png 1291w\" sizes=\"(max-width: 500px) 100vw, 500px\" \/><\/a><\/p>\n<p style=\"text-align: justify;\">The final part is to update the ghost cells after each time step, which requires communication. Now communication is the dark side of computing. Its the reason for the caches on the CPUs. If you want to utilize\u00a0a cluster of over 10.000 CPU cores, communication over a network is necessary, such that the boundary exchange or boundary updates are performance critical.<\/p>\n<p style=\"text-align: justify;\">To estimate the performance of a $3200\\times 3200$ cells domain\u00a0lets assume the computation of one 9-point stencil $t_s$\u00a0requires 1 time unit and the communication time$t_c$\u00a0of one cell requires also 1 time unit, then the runtime $T_{s}$\u00a0would look something like this for the sequential case with 1000 time steps.<\/p>\n<p style=\"text-align: justify;\">$T_{s}=3200^2 \\cdot t_s \\cdot 1000$.<\/p>\n<p style=\"text-align: justify;\">The parallel runtime with 16 cores\u00a0$T_p$ is<\/p>\n<p style=\"text-align: justify;\">$T_{p}=(3200 \\cdot 200 \\cdot t_s + 3200 \\cdot 2 \\cdot t_c)\\cdot 1000$.<\/p>\n<p style=\"text-align: justify;\">The speedup $S$ would \u00a0be<\/p>\n<p style=\"text-align: justify;\">$S=\\frac{T_s}{T_p}=\\frac{3200^2 \\cdot t_s \\cdot 1000}{(3200 \\cdot 200 \\cdot t_s + 3200 \\cdot 2 \\cdot t_c)\\cdot 1000}=\\frac{3200\\cdot t_s}{200\\cdot t_s + 2 \\cdot t_c}=\\frac{3200}{202}\\approx 15.84$,<\/p>\n<p style=\"text-align: justify;\">which sounds not bad. Now lets increase the cores to 128.<\/p>\n<p style=\"text-align: justify;\">$S=\\frac{T_s}{T_p}=\\frac{3200^2 \\cdot t_s \\cdot 1000}{(3200 \\cdot 25 \\cdot t_s + 3200 \\cdot 2 \\cdot t_c)\\cdot 1000}=\\frac{3200\\cdot t_s}{25\\cdot t_s + 2 \\cdot t_c}=\\frac{3200}{27}\\approx 118.52$<\/p>\n<p style=\"text-align: justify;\">A Speedup of about 119 is also nice with 128 CPU cores. Lets go to the end of the line with 3200 CPU cores.<\/p>\n<p style=\"text-align: justify;\">$S=\\frac{T_s}{T_p}=\\frac{3200^2 \\cdot t_s \\cdot 1000}{(3200 \\cdot 1 \\cdot t_s + 3200 \\cdot 2 \\cdot t_c)\\cdot 1000}=\\frac{3200\\cdot t_s}{1 \\cdot t_s + 2 \\cdot t_c}=\\frac{3200}{3}\\approx 1066.67$<\/p>\n<p style=\"text-align: justify;\">This now looks a little weird. 3200 Cores create a speedup of about 1067. What happened with the other cores. Are they sitting idle? Yes, communication is the bottleneck.\u00a0This example and the performance model are simple\u00a0and\u00a0some assumptions are made, such that I strongly advice to make measurements before you conduct development. If your Measurement show the same or similar scaling behavior, you should consider multi dimensional decomposition as one option. One step to optimize would be to hide communication as much as possible but to reduce the communication to a minimum\u00a0tackles\u00a0the problem\u00a0directly.<\/p>\n<p style=\"text-align: justify;\">To reduce communication, the cutting surface (red cells from the last figure) have to be reduced by cutting the domain in squares\u00a0instead of salami slices as shown in the following picture.<\/p>\n<p style=\"text-align: justify;\"><a href=\"http:\/\/alexander.vondrous.de\/wp-content\/uploads\/2015\/03\/800x800-red.png\"><img decoding=\"async\" loading=\"lazy\" class=\" wp-image-232 aligncenter\" src=\"http:\/\/alexander.vondrous.de\/wp-content\/uploads\/2015\/03\/800x800-red.png\" alt=\"800x800-red\" width=\"368\" height=\"175\" srcset=\"https:\/\/alexander.vondrous.de\/wp-content\/uploads\/2015\/03\/800x800-red.png 963w, https:\/\/alexander.vondrous.de\/wp-content\/uploads\/2015\/03\/800x800-red-300x143.png 300w, https:\/\/alexander.vondrous.de\/wp-content\/uploads\/2015\/03\/800x800-red-500x238.png 500w\" sizes=\"(max-width: 368px) 100vw, 368px\" \/><\/a><\/p>\n<p style=\"text-align: justify;\">The main difference between the square in the last picture and the salami slices in the earlier picture is not the number of cells inside the domain, its the cutting surface. The square has a cutting line\u00a0of\u00a0$800\\times 4 = 3200$ and the salami slice has $3200 \\times 2 = 6400$ cells, which is the double amount. This will not affect the speedup much for small numbers of CPU cores but for large numbers.<\/p>\n<p style=\"text-align: justify;\">Lets see what happens for 3200 cores.<\/p>\n<p style=\"text-align: justify;\">$S=\\frac{T_s}{T_p}=\\frac{3200^2 \\cdot t_s \\cdot 1000}{(3200 \\cdot t_s +\\frac{3200}{\\sqrt{3200}} \\cdot 4 \\cdot t_c)\\cdot 1000}=\\frac{3200\\cdot t_s}{1 \\cdot t_s + \\frac{4}{\\sqrt{3200}}\\cdot t_c}=\\frac{3200}{1.07}\\approx 2988.67$<\/p>\n<p style=\"text-align: justify;\">Now this looks much nicer with a speedup of about 2989 instead of 1067 with 3200 CPU cores.<\/p>\n<p style=\"text-align: justify;\">The scaling behavior between\u00a01D decomposition 2D decomposition is significantly\u00a0different. If you have a 3D volume, the decomposition dimension matters even more.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This post is a brief\u00a0summary of the paper on parallel computing with regular grids (link). It describes the advantage of 3D domain decomposition over 1D or 2D domain decomposition for distributed memory computing with a stencil code\u00a0because of the better surface to volume ratio (computation to communication ratio). Many simulations are performed on a regular &#8230; <span class=\"more\"><a class=\"more-link\" href=\"https:\/\/alexander.vondrous.de\/?p=7\">[Read more&#8230;]<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[9,3,4,6,7,2,8],"_links":{"self":[{"href":"https:\/\/alexander.vondrous.de\/index.php?rest_route=\/wp\/v2\/posts\/7"}],"collection":[{"href":"https:\/\/alexander.vondrous.de\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/alexander.vondrous.de\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/alexander.vondrous.de\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/alexander.vondrous.de\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=7"}],"version-history":[{"count":75,"href":"https:\/\/alexander.vondrous.de\/index.php?rest_route=\/wp\/v2\/posts\/7\/revisions"}],"predecessor-version":[{"id":371,"href":"https:\/\/alexander.vondrous.de\/index.php?rest_route=\/wp\/v2\/posts\/7\/revisions\/371"}],"wp:attachment":[{"href":"https:\/\/alexander.vondrous.de\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=7"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/alexander.vondrous.de\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=7"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/alexander.vondrous.de\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=7"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}