How do you achieve data parallelism?

High-Performance Business Intelligence

David Loshin, in Business Intelligence (Second Edition), 2013

Data Parallelism

Data parallelism is a different kind of parallelism that, instead of relying on process or task concurrency, is related to both the flow and the structure of the information. An analogy might revisit the automobile factory from our example in the previous section. There we looked at how the construction of an automobile could be transformed into a pipelined process. Here, because the construction of cars along one assembly has no relation to the construction of the same kinds of cars along any other assembly line, there is no reason why we can’t duplicate the same assembly line multiple times; two assembly lines will result in twice as many cars being produced in the same amount of time as a single assembly line.

For data parallelism, the goal is to scale the throughput of processing based on the ability to decompose the data set into concurrent processing streams, all performing the same set of operations. For example, a customer address standardization process iteratively grabs an address and attempts to transform it into a standard form. This task is adaptable to data parallelism and can be sped up by a factor of 4 by instantiating four address standardization processes and streaming one-fourth of the address records through each instantiation (Figure 14.3). Data parallelism is a more finely grained parallelism in that we achieve our performance improvement by applying the same small set of tasks iteratively over multiple streams of data.

Figure 14.3. Duplicating the assembly line provides linear scalability—a feature of data parallelism.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780123858894000144

Deep Learning and Its Parallelization

X. Li, ... W. Zheng, in Big Data, 2016

4.3.3 Deep Learning Based on Multi-GPUs

The data in deep learning can be divided into two types of data: parameters and input/output data. Parameters in CNNs include the learning rate, convolutional parameters (eg, filter numbers, kernel size, and stride), pooling parameters (kernel size, stride), bias, etc. Input data includes the raw data (eg, images and speeches) received from the input layer, and output data keep the intermediate output of each layer, such as the convolutional and pooling layers. The key to train large scale CNN models with multiple GPUs is how to divide tasks between different GPUs. We have three ways to train these large models with multiple GPUs: data parallelism, model parallelism, and data-model parallelism.

Data parallelism

Data parallelism can be easily implemented and it is thus the most widely used implementation strategy on multi-GPUs.

Data parallelism means that each GPU uses the same model to trains on different data subset. In data parallel, there is no synchronization between GPUs in forward computing, because each GPU has a fully copy of the model, including the deep net structure and parameters. But the parameter gradients computed from different GPUs must be synchronized in BP (Fig. 18).

Fig. 18. The illustration of data parallelism mode.

Model parallelism

Model parallelism means that each computational node is responsible for parts of the model by training the same data samples.

The model is divided into several pieces and each computing node such as GPU is responsible for one piece of them (Fig. 19). The communication happens between computational nodes when the input of a neuron is from the output of the other computational node. The performance of model parallelism is often worse than data parallelism, because the communication expenses from model parallelism are much more than that of data parallelism.

Fig. 19. The illustration of model parallelism mode.

Data-model parallelism

Several restrictions exist in both data parallelism and model parallelism. For data parallelism, we have to reduce the learning rate to keep a smooth training process if there are too many computational nodes. For model parallelism, the performance of the network will be dramatically decreased for the sake of communication expense if we have too many nodes.

Model parallelism could get a good performance with a large number of neuron activities, and data parallel is efficient with large number of weights. In CNNs, the convolution layer contain about 90% of the computation and 5% of the parameters, while the full connected layer contain 95% of the parameters and 5%-10% the computation. Therefore, we can parallelize the CNNs in data-model mode by using data parallelism for convolutional layer and model parallelism for a fully connected layer (Fig. 20).

Fig. 20. The illustration of data-model parallelism mode.

Example system of multi-GPUs

Facebook designed a parallel framework by using four NVIDIA TITAN GPUs with 6 GB of RAM on a single server in data parallel and model parallel. ImageNet 2012 dataset can be trained in 5 days [28].

Commodity Off-The-Shelf High Performance Computing (COTS HPC) system was designed by Google to train large-scale deep networks on more than 1 billion parameters. COTS HPC consists of GPU servers with Infiniband interconnections, and the communication between different GPUs is controlled by Message Passing Interface (MPI). The training of a large deep net with more than 1 billion parameters was completed in 3 days on COTS HPC [29]. The same experiment was done by DistBelief, but COTS HPC provides us with a much cheaper and faster way of doing it.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780128053942000040

Paradigms for Developing Cloud Applications

Dinkar Sitaram, Geetha Manjunath, in Moving To The Cloud, 2012

Data parallelism versus task parallelism

Data parallelism is a way of performing parallel execution of an application on multiple processors. It focuses on distributing data across different nodes in the parallel execution environment and enabling simultaneous sub-computations on these distributed data across the different compute nodes. This is typically achieved in SIMD mode (Single Instruction, Multiple Data mode) and can either have a single controller controlling the parallel data operations or multiple threads working in the same way on the individual compute nodes (SPMD).

In contrast, task parallelism focuses on distributing parallel execution threads across parallel computing nodes. These threads may execute the same or different threads. These threads exchange messages either through shared memory or explicit communication messages, as per the parallel algorithm. In the most general case, each of the threads of a Task-Parallel system can be doing completely different tasks but co-ordinating to solve a specific problem. In the most simplistic case, all threads can be executing the same program and differentiating based on their node-id's to perform any variation in task-responsibility. Most common Task-Parallel algorithms follow the Master-Worker model, where there is a single master and multiple workers. The master distributes the computation to different workers based on scheduling rules and other task-allocation strategies.

MapReduce falls under the category of data parallel SPMD architectures.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9781597497251000056

Portable Explicit Vectorization Intrinsics

Paulo Souza, ... Philippe Thierry, in High Performance Parallelism Pearls, 2015

Why vectorization?

SIMD parallelism enhances the performance of computationally intensive applications that execute the same operation on distinct elements in a dataset. SIMD parallelism is typically accomplished by vectorization where successive instances of a scalar operation, which operates on a single pair of operands, are transformed into a vector instruction that operates on multiple pairs of operands at once, as exemplified in Figure 24.1.

Figure 24.1. Scalar loop with 40 iterations vectorized to 10 iterations, assuming a processor with vector register that can hold 4 scalar elements.

Today, most commodity processors are based on hardware architectures that feature SIMD vector instructions. Intel MMX/SSE/VX/AVX-512, IBM Power AltiVec and Cell SPU, and ARM NEON are examples of instruction sets enabling loop vectorization.

This wide range of hardware implementations makes maintenance of explicit manual vectorization almost impossible across multiple architectures and vendors. The differences can be attributed to multiple factors like:

SIMD width: Intel’s Streaming SIMD Extensions (SSE) supports vectorization of four single-precision floating-point or two double-precision floating-point values. More recently, the Advanced Vector Extensions (AVX) introduced 256-bit registers doubling from the SSE vectorization length to either eight packed single-precision floating-point values or four-packed double-precision floating-point values. Intel’s Initial Many-Core Instructions (IMCI) vector instructions on the Intel® Xeon Phi™ coprocessor have 512-bit vector registers (16-packed single-precision, or 8-packed double-precision values) that are present in the AVX-512 instruction set. AltiVec is also a SIMD instruction set for integer and floating-point vector computations. It features 128-bit registers that allow computations with four-packed 32-bit integers or single-precision floating-point values (IBM Power6 or later) and two-packed 64-bit double-precision floating-point values (IBM Power7 or later). It is important to note that although the register width and the HPC functionality of interest (ADD, MUL, etc.) are basically similar between SSE and AltiVec, the translation between the associated intrinsics can be troublesome. NEON (or Advanced SIMD) is the counterpart of SIMD instruction set devised for ARM-based architectures. Depending on the version, it can support from 8-bit to 64-bit integers and 32-bit single-precision floating point. Its latest versions can process 128-bit (four single-precision floating-point values) in the same execution cycle (for each architecture, see “For more information” at the end of this chapter).

Extended instructions: AVX2 added fused-multiply-add support and nondestructive three-operand instructions that preserve source operands values. Together with IMCI, it also introduced comprehensive support for scatter/gather vector operations.

Register masking: AVX also greatly extended register masking functionality. This can be used to vectorize divergent code paths and implement conditional reduction primitives.

A good programming mode for explicit vectorization must allow developers to write single source code using all the above features, regardless of the underlined hardware architecture.

Some compilers have an auto-vectorizer that can be good at recognizing common patterns found in loops and issuing the proper vector instructions. But auto-vectorization often produces non-optimal vector code due to factors like complex loop structure, lack of information about data-dependency, improper data alignment, pipeline synchronization, etc. Hence, compilers typically provide many more ways for developers to drive the way compilers vectorize. The supported ways to improve code generation of vector instructions can range from high-level language extensions for array notation, SIMD-enabled functions, SIMD pragmas/directives, and standards like OpenMP 4.0; more lower-level coding approaches are SIMD vector intrinsics and inline assembly code.

In some cases, applications may contain very specific and delimited region(s) of code where the majority of the compute time is spent and squeezing every bit of performance is a critical matter. Under this circumstance, vector SIMD intrinsics may provide performance gains closer to the theoretical attainable performance of the architecture. Vector intrinsics have been greatly improved by hardware vendors providing APIs allowing intrinsic “functions” to be inserted in the source code. However, most of the current API implementations available are far from being cross-platform compatible.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780128038192000082

Parallel Computing, Graphics Processing Unit (GPU) and New Hardware for Deep Learning in Computational Intelligence Research

M. Madiajagan MS, PhD, S. Sridhar Raj BTech, MTech, in Deep Learning and Parallel Computing Environment for Bioengineering Systems, 2019

1.2.4.2 Data Parallelism

The idea of data parallelism was brought up by Jeff Dean style as parameter averaging. We have three copies of the same model. We deploy the same model A over three different nodes, and a subset of the data is fed over the three identical models. The values of the parameters are sent to the parameter server and, after collecting all the parameters, they are averaged. Using the parameter server, the omega is synchronized. The neural networks can be trained in parallel in two ways, i.e., synchronously (by waiting for one complete iteration and updating the value for omega) and asynchronously (by sending outdated parameters out of the network). But the amount of time taken for both methods is same, and the method choice is not a big issue here. See Fig. 1.6.

Fig. 1.6. Data parallelism.

To overcome the problems in data parallelism, task level parallelism has been introduced. Independent computation tasks are processed in parallel by using the conditional statements in GPUs. Task level parallelism can act without the help of data parallelism only to a certain extent, beyond which the GPU needs data parallelism for better efficiency. But the task level parallelism gives more flexibility and computation acceleration to the GPUs.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780128167182000087

Manual Parallelization Versus State-of-the-Art Parallelization Techniques

Aleksandar Vitorović, ... Veljko M. Milutinović, in Advances in Computers, 2014

3.1 Levels of Parallelism

Domain decomposition or “data parallelism” implies partitioning data to processes (or parallel computing nodes), such that a single portion of data is assigned to a single process. The portions of data are of approximately equal size. If the portions require rather different amounts of time to be processed, the performance is limited by the speed of the slowest process. In that case, the problem can be mitigated by partitioning the data into a large number of smaller portions. Then, a process takes another portion once it finishes with the previous one, and a faster process is assigned more portions. Single-program-multiple-data (SPMD) paradigm is an example of data parallelism, as the processes share the same code but operate on different data. Another example is parallelization of a loop with no loop-carried dependences in which the processes execute the same loop body, but for different loop indices, and consequently, for different data.

In functional decomposition or “task parallelism,” processes are assigned pieces of code. Each piece of code works on the same data and is assigned to exactly one process. An example of task parallelism is computing the average and standard deviation on the same data. These two tasks can be executed by separate processes. Another example is parallelization of a loop with an if–then–else construct inside, such that different code is executed in different iterations.

In the development of application software for parallel machines, the parallelism can be specified and extracted either implicitly or explicitly by applying an appropriate parallel programming model.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780124202320000052

Designing and Supporting Scalable Data Analytics

Domenico Talia, ... Fabrizio Marozzo, in Data Analysis in the Cloud, 2016

4.3.3.6 Input Sweeping

Input sweeping that exploits data parallelism is a pattern in which a set of input data is analyzed independently to produce the same number of output data. It is similar to the parameter-sweeping pattern, with the difference that in this case the sweeping is done on the input data rather than on a tool parameter. An example of input sweeping pattern is represented in the following figure:

In this example, 10 training sets are processed in parallel by 10 instances of J48, to produce the same number of data mining models. Data arrays are used to represent both input data and output models, while a tool array is used to represent the J48 tools. The following JS4Cloud script corresponds to the example shown above:

var nMod = 10;

var MRef = Data.define(“Model”, nMod);

for(var i = 0; i<nMod; i++)

 J48({dataset:TsRef[i], model:MRef[i], confidence:0.1});

It is assumed that TsRef is a reference to an array of training sets created on a previous step. The for loop creates 10 instances of J48, where the ith instance takes as input TsRef[i] to produce MRef[i].

Another example of input sweeping pattern is represented in the following figure:

In this case there are 15 instances of a Predictor. Each Predictor takes as input one unclassified dataset and one model, and generates concurrently one classified dataset. The following JS4Cloud script corresponds to this example:

var nData = 5, nMod = 3;

var CRef = Data.define(“ClassD”, [nData, nMod]);

for(var i = 0; i<nData; i++)

 for(var j = 0; j<nMod; j++)

 Predictor({dataset:DRef[i], model:MRef[j], classDataset:CRef[i][j]});

Here is assumed that DRef is a reference to an array of unlabeled datasets, and MRef is a reference to an array of models, created on previous steps. The double for loop creates a bidimensional array of classified datasets, denoted CRef, where CRef[i][j] is the classified dataset generated by a Predictor instance on DRef[i] using MRef[j]. Also in this case, since the tools are independent of each other, they can be executed in parallel by the runtime.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780128028810000044

Parallel program development

Peter S. Pacheco, Matthew Malensek, in An Introduction to Parallel Programming (Second Edition), 2022

7.5 Summary

In this chapter, we've looked at serial and parallel solutions to two very different problems: the n-body problem and sorting using sample sort. In each case we began by studying the problem and looking at serial algorithms. We continued by using Foster's methodology for devising a parallel solution, and then, using the designs developed with Foster's methodology, we implemented parallel solutions using Pthreads, OpenMP, MPI, and CUDA.

In developing the reduced MPI solution to the n-body problem, we determined that the “obvious” solution would be difficult to implement correctly, and it would require a huge amount of communication. We therefore turned to an alternative “ring pass” algorithm, which proved to be much easier to implement and is probably more scalable.

We considered implementing a CUDA n-body solver that used some of the same ideas we used in developing the reduced Pthreads and OpenMP solvers. However, we found that the amount of memory required would be too great. So we looked at an alternative approach to improving the performance of the basic CUDA n-body solver: we reduced the number of accesses to global memory by loading from global memory into fast, on-chip shared memory. Then, when we carried out the calculations, we accessed shared memory rather than global memory. We accomplished this by dividing the arrays into “virtual” tiles. Then each thread block loaded a tile, did all the required calculations involving the particles in that tile, and, when it was done with the particles in the current tile, proceeded to the next tile. Effectively, then, we used shared memory as a programmer-managed cache.

We looked at two serial implementations of sample sort. In the first, we chose the sample using a random number generator, sorted the sample, and chose “splitters” to be equally spaced elements of the sample. Then the elements of the list were mapped to their destination “buckets” and each bucket was sorted.

The second serial implementation used a deterministic scheme to choose the sample, and the contents of the buckets were determined using exclusive prefix sums. An inclusive prefix sum of the elements of a list {a0,a1,…,an−1} is the new list

a0,a0+a1,…,a0+a1+⋯+an−1.

An exclusive prefix sum starts with 0. So it is

0,a0,a0+a1,…,a0+a1+⋯+an−2.

We parallelized both implementations using data parallelism. For MIMD systems, the “sublists” were contiguous blocks of n/p elements of the original list and the “buckets” were contiguous blocks elements of the final list—which, in general, did not have exactly n/p elements. For CUDA, we used the same terminology, but the definitions were different. In the first implementation, we used a single thread block. So the sublists had n/

elements and the buckets had approximately n/
elements. In the second implementation, we used multiple thread blocks, and sublists had n/
elements and buckets had approximately this many elements.

All of the implementations used similar ideas. In general, first implementations did the sampling, and identification of splitters using a single process/thread in the MIMD systems and the host processor with CUDA. The building of data structures, mapping of elements to buckets, and final sorting was done in parallel. The second implementation used parallelism at all stages, except that the OpenMP and Pthreads implementation used a serial sorting algorithm to sort the sample.

In closing, we looked briefly at the problem of deciding which API to use. Our first consideration was whether the problem was suitable for parallelization using CUDA: serial programs that use more or less the same operations on different data items are usually well suited to a CUDA parallelization. The second consideration is whether to use shared memory or distributed memory. To decide this, we should look at the memory requirements of the application and the amount of communication among the processes/threads. If the memory requirements are great or the distributed-memory version can work mainly with cache, then a distributed-memory program is likely to be much faster. On the other hand, if there is considerable communication, a shared-memory program will probably be faster.

When you're choosing between OpenMP and Pthreads, if there's an existing serial program and it can be parallelized by the insertion of OpenMP directives, then OpenMP is probably the clear choice. However, if complex thread synchronization is needed—for example, read-write locks or thread signaling—then Pthreads will be easier to use.

7.5.1 MPI

In the course of developing these programs, we learned about several additional MPI collective communication functions. In the first MPI implementation of sample sort, we recalled a generalization of

,

, which allows us to gather different numbers of elements from one process onto another:

This gathers the contents of each process'

into

on the process with rank root

in

. However, unlike

, the number of elements in

may differ from one process to another. So the argument

tells us the number of elements each process is sending, and the argument

tells us where each process' first element goes in

.

We also used two MPI implementations of “all-to-all scatter-gather.” These combine the features of both scatter and gather: each process scatters its own collection to all the processes in the communicator, and each process gathers data from all the processes. The first implementation is

, which sends the same amount of data from each process to each process. The second implementation is

, which can send different amounts of data to different processes. Their syntaxes are

We also learned about the nonblocking send function

:

We used this to implement a memory-efficient send-receive. The first six arguments are the same as the arguments to

. When it's called, it starts a send, but when it returns, the send will not, in general, be completed. The output argument,

, is an opaque object. It's used by the program and MPI system to identify the communication. When we want to finish the communication, we can call the function

:

When this function is called, it will block until the call associated with

(in our case, the call to

) has completed. The

argument will return information on the call. In our setting we didn't use it. So we just passed in

.

To determine whether the process with which we were communicating had sent its data, we used the function

.

This blocks until it is notified that there is a message from process

with tag

in communicator

. The

will return the usual status fields. In particular, we can use the function

to determine how big the message is, and we used this to be sure sufficient storage was allocated for the receive buffer for the message. If we're unable to allocate sufficient storage for the receive buffer, we can call

:

This is supposed to terminate all the processes in

and return errorcode to the invoking environment, although most implementations terminate all the processes in

.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780128046050000142

Parallel Algorithms

Thomas Sterling, ... Maciej Brodowicz, in High Performance Computing, 2018

9.7 Permutation: Cannon's Algorithm

Among algorithms which rely upon a data parallelism approach where the same algorithm is applied to different data to extract concurrency, a certain subclass of problem relies upon permutation routing operations to perform all-to-all operations iteratively. This type of parallel algorithm is very frequent used in applications requiring a linear algebra transpose operation or some type of matrix–matrix multiplication. In this section, one such example is explored: Cannon's algorithm for dense matrix–matrix [5].

In computational linear algebra, algorithms involving matrix operations are frequently divided into two classes: sparse and dense. Sparse matrices refer to those matrices that are dominated by zeros and generally employ some type of compression algorithm so that the zero entries are neither stored nor operated on. Dense matrices are those which are dominated by nonzero entries. Cannon's algorithm is a matrix–matrix multiplication algorithm for distributed memory parallelism designed for dense matrices, and relies heavily on permutation routing.

Matrix–matrix multiplication for two N × N matrices A and B is summarized in Eq. (9.9)

(9.9)Cij=∑k=0k=N−1AikBkj

where the subscripts indicate the row and column index of the matrix entry. To create a parallel algorithm for Eq. (9.9), a good place to start is a block algorithm that distributes subblocks of A, B, and C among processes where each subblock is of size N/P×N/P and P is the number of processes. This is illustrated in Fig. 9.17.

Figure 9.17. The global N × N matrices A and B are partitioned into P subblocks so that each subblock is of size N/P×N/P. In this illustration, P = 16. Each process holds only one subblock.

For example, computing the subblock C11 of the matrix–matrix product of A×B requires computing several serial matrix–matrix products each of size N/P×N /P, as illustrated in Fig. 9.18.

Figure 9.18. To compute the C11 subblock of the matrix–matrix product of A×B, several matrix–matrix products of the highlighted subblocks must be computed. However, one block is assigned to each process, and only subblocks A11 and B11 are local to the process where C11 resides. All others subblocks must be communicated.

For this block partitioning approach, matrix–matrix multiplication becomes a matter of orchestrating the communication and computation of the various serial subblock matrix–matrix products. This is the heart of Cannon's algorithm.

Initially the subblocks are mapped to each process, as illustrated in Fig. 9.19.

Figure 9.19. The subblocks are each mapped to a process for distributed memory parallelism. The process number is indicated in the upper left-hand corner in this illustration.

To set up Cannon's algorithm, the A subblocks are shifted to the left while the B subblocks are shifted up, as illustrated in Figs. 9.20 and 9.21.

Figure 9.20. The A subblocks are permuted to the left to set up Cannon's algorithm.

Figure 9.21. The B subblocks are permuted up to set up Cannon's algorithm.

The memory layout after the set-up permutations is shown in Fig. 9.22.

Figure 9.22. The layout of the matrix subblocks after performing the permutations illustrated in Figs. 9.19 and 9.20. This completes the set up of Cannon's algorithm.

Cannon's algorithm consists of moving matrix subblocks so that for each iteration k from 0 to 3 matrix subblocks Ai,(i+j+k) and B(i+j+k),j are located on the same process as Cij. For each iteration, the partial sum in Eq. (9.10) is accumulated to Cij:

(9.10) Cij+=Ai,(i+j+k)B(i+j+k),j

where each subblock matrix–matrix multiplication uses Eq. (9.9) to compute the matrix–matrix product. The sums in Eq. (9.10), i + j + k, are modulus P (4 in this example). Thus if (i + j + k) = 6, the index in the matrix would become 2.

For k = 0, Cannon's algorithm has already been set up. For example, in Fig. 9.22 matrix C31 is located in the same process as matrix A30 and B01. For every subsequent iteration of k, the A matrices have to be shifted once left and the B matrices have to be shifted up once to satisfy the condition of Eq. (9.10) and compute the partial sum. This is illustrated in Fig. 9.23.

Figure 9.23. For each subsequent iteration of k, the B matrices are shifted up and the A matrices are shifted to the left to fulfill the condition for Eq. (9.10).

After P iterations of k, the matrix–matrix product has been computed. The resulting matrices for each of the k iterations for the example are shown in Fig. 9.24. Cannon's algorithm is summarized in Fig. 9.25.

Figure 9.24. The distribution of the subblock matrices for each iteration of Cannon's algorithm for the example presented in Fig. 9.18.

Figure 9.25. Summary of Cannon's algorithm for dense matrix–matrix multiplication.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780124201583000095

Modern Architectures

Bertil Schmidt, ... Moritz Schlarb, in Parallel Programming, 2018

AoS and SoA

In order to exploit the power of SIMD parallelism it is often necessary to modify the layout of the employed data structures. In this subsection we study two different ways to store a sequence record where each record consists of a (fixed) number of elements:

AOS (Array of Structures) simply stores the records consecutively in single array.

SOA (Structure of Arrays) uses one array per dimension. Each array only stores the values of the associated element dimension.

As a case study we use a collection of n real-valued 3D vectors (i.e. each vector has x, y, and z coordinates) to compare the SIMD-friendliness of AoS and SoA. A definition of the corresponding AoS would be:

    auto xyz = new float[3*n];

A definition of the corresponding SoA would look like:

    auto x = new float[n];
    auto y = new float[n];
    auto z = new float[n];

Fig. 3.12 illustrates the memory layout of AoS and SoA for a collection of 3D vectors.

Figure 3.12. Comparison of the AoS and the SoA memory layout of a collection of eight 3D vectors: (x0,y0,z0),…,(x7,y7,z7).

We now want to normalize each vector; i.e., we want to map each vector vi= (xi,yi,zi) to

(3.3) vˆi=vi‖vi‖=(xiρi ,yiρi,ziρi)whereρi=xi2+yi2+zi2 .

Normalization of vectors is a common operation in computer graphics and computational geometry. Using n 3D vectors stored in the AoS data layout in the array xyz as defined above, this can be performed sequentially in a straightforward non-vectorized way by the following function plain_aos_norm.

    void plain_aos_norm(float * xyz, uint64_t length) {

        for (uint64_t i = 0; i < 3*length; i += 3) {
            const float x = xyz[i+0];
            const float y = xyz[i+1];
            const float z = xyz[i+2];

            float irho = 1.0f/std::sqrt(x*x+y*y+z*z);

            xyz[i+0] *= irho;
            xyz[i+1] *= irho;
            xyz[i+2] *= irho;
        }
    }

Unfortunately, vectorization of 3D vector normalization based on the AoS format would be relatively inefficient because of the following reasons:

1.

Vector registers would not be fully occupied; e.g., for a 128-bit register and single-precision floating-point numbers, a single vector would only occupy three of the four available vector lanes.

2.

Summing the squares (for the computation of irho in plain_aos_norm) would require operations between neighboring (horizontal) lanes resulting in only a single value for the inverse square root calculation.

3.

Scaling to longer vector registers becomes increasingly inefficient.

On the contrary, SIMD parallelization is more efficient when the 3D vectors are stored in SoA format. The following function avx_soa_norm stores the n vectors in the SoA data layout using the three arrays x, y, and z to implement normalization using AVX2 registers.

    void avx_soa_norm(float * x, float * y, float * z,
                       uint64_t length) {

        for (uint64_t i = 0; i < length; i += 8) {

            // aligned loads
            __m256 X = _mm256_load_ps(x+i);
            __m256 Y = _mm256_load_ps(y+i);
            __m256 Z = _mm256_load_ps(z+i);

            // R <- X*X+Y*Y+Z*Z
             __m256 R = _mm256_fmadd_ps(X, X,
                        _mm256_fmadd_ps(Y, Y,
                        _mm256_mul_ps  (Z, Z)));

            // R <- 1/sqrt(R)
            R = _mm256_rsqrt_ps(R);

            // aligned stores
            _mm256_store_ps(x+i, _mm256_mul_ps(X, R));
            _mm256_store_ps(y+i, _mm256_mul_ps(Y, R));
            _mm256_store_ps(z+i, _mm256_mul_ps(Z, R));
        }
    }

During each loop iteration, eight vectors are normalized simultaneously. This is made possible by the SoA layout where each lane of the arrays x[], y[], and z[] store the corresponding coordinate of a vector as illustrated in Fig. 3.12 leading to an efficient SIMD implementation.

However, certain applications still prefer to arrange their geometric data in the compact AoS format since other operations might benefit from more densely packed vectors. Nevertheless, we still would like to take advantage of the efficient SoA-based SIMD code. In this case, a possible solutions using 256-bit registers could work as follows:

1.

Transpose eight consecutive 3D vectors stored in AoS format into SoA format using three 256-bit registers.

2.

Perform the vectorized SIMD computation using the SoA format.

3.

Transpose the result from SoA back to AoS format.

The transposition of data between AoS and SoA requires a permutation of values. A possible implementation is shown in Fig. 3.13. In order to implement the illustrated data rearrangement from AoS to SoA using AVX2, we will take advantage the following three intrinsics:

Figure 3.13. Transposition of eight 3D vectors: (x0,y0,z0),…,(x7,y7,z7) stored in AoS format into SoA format using 256-bit registers. Indicated register names correspond to the variables used in Listing 3.3. The upper part loads vector elements into the registers M03, M14, and M25. The lower part performs five shuffle operations to store the vectors into the 256-bit registers X, Y, and Z. The indicated number of each shuffle operation corresponds to the order of shuffle intrinsics used in Listing 3.3.

__m256 _mm256_shuffle_ps(__m256 m1, __m256 m2, const int sel): selects elements from m1 and m2 according to four 2-bit values (i.e. four numbers between 0 and 3) stored in sel to be placed in the output vector. The first two elements of the output vectors are selected from the first four elements of m1 according to the first two bit-pairs of sel. The third and fourth elements are selected from the first four elements of m2 according to the third and fourth bit-pair of sel. The elements five to eight in the output vector are selected in a similar way but choosing from the elements five to eight in m1 and m2 instead. For example, shuffle operation “2” in Fig. 3.13 can be implemented by:

YZ = _mm256_shuffle_ps(M03, M14, _MM_SHUFFLE(1,0,2,1));

whereby the values 1 and 2 select elements from M03 (i.e. y0, z0 from the lower half and y4,z4 from the upper half) and the values 0 and 1 from M14 (i.e. y1,z1 from the lower half and y5,z5 from the upper half). Those are then combined in YZ to form the vector (y0,z0,y1,z1,y4,z4,y 5,z5).•

__m256 _mm256_castps128_ps256(__m128 a): typecasts a 128-bit vector into a 256-bit vector. The lower half of the output vector contains the source vector values and the upper half is undefined.

__m256 _mm256_insertf128_ps(__m256 a, __m128 b, int offset): inserts a 128-bit vector into a 256-bit vector according to an offset. For example, when loading two 128-bit vectors M[0] and M[3] storing the elements x0,y 0,z0,x1 and x4,y4,z4,x5 into a single 256-bit AVX register, M03 can be accomplished by:

   M03 = _mm256_castps128_ps256(M[0]);
   M03 = _mm256_insertf128_ps(M03 ,M[3],1);

The code for vectorized normalization of an array of 3D vectors stored in AoS format using transposition into SoA format is shown in Listing 3.3. Our solution transposes eight subsequent 3D vectors at a time. The corresponding function avx_aos_norm consists of three stages: AoS2SoA, SoA computation, and SoA2AoS. The AoS2SoA stage starts by loading pairs of four subsequent vector elements into three 256-bit registers using intrinsics _mm256_castps128_ps256 and _mm256_insertf128_ps as described above and as illustrated in the upper part of Fig. 3.13. Subsequently, we apply five _mm256_shuffle_ps operations to implement the necessary shuffling of vector elements as illustrated in the lower part of Fig. 3.13. The efficiently vectorized SoA computation can then proceed in the way we have studied earlier. Since the result is stored in SoA, it needs to be transposed back to AoS format. This is implemented by the six corresponding shuffling operations in the SoA2AoS part of the function.

Actual execution of our AVX program and the corresponding plain AoS program on an Intel i7-6800K CPU using n=228 vectors produces the following runtimes:

    # elapsed time (plain_aos_normalize): 0.718698s
    # elapsed time (avx_aos_normalize): 0.327667s

We can see that despite the transposition overhead, the vectorized implementation can still achieve a speedup of around 2.2.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780128498903000034

How do you achieve parallelism in computing?

Fundamentals of Parallel Computer Architecture The classes of parallel computer architectures include: Multi-core computing: A multi-core processor is a computer processor integrated circuit with two or more separate processing cores, each of which executes program instructions in parallel.

How is parallelism achieved in parallel programming?

Data parallelism is a way of performing parallel execution of an application on multiple processors. It focuses on distributing data across different nodes in the parallel execution environment and enabling simultaneous sub-computations on these distributed data across the different compute nodes.

What is data parallelism with example?

Data Parallelism means concurrent execution of the same task on each multiple computing core. Let's take an example, summing the contents of an array of size N. For a single-core system, one thread would simply sum the elements [0] . . . [N − 1].

What is data parallelism how is it achieved in Map Reduce?

The MapReduce programming model is created for processing data which requires “DATA PARALLELISM”, the ability to compute multiple independent operations in any order (King). In parallel processing, commutative operations are operations where the order of execution does not matter to the results of the equation.

Toplist

Última postagem

Tag