Parallel Computing CS149 ASST 1 note

This series will cover my assignment solution and some notes to Standford CS149 lecture. And I truly recommand anyone who wants to learn Parallel-Computing to follow this lecture.

Program 1 by Apple M3#

important things#

Property of a program where the same instruction sequence applies to many data elements
Coherent execution IS NECESSARY for SIMD processing resources to be used efficiently
Coherent execution IS NOT NECESSARY for efficient parallelization across different cores, since each core has the capability to fetch/decode a different instructions from their thread’s instruction stream

task#

try to make mandelbrot set built with Multi-Thread has the same Result as the Serial method.
add code to workerThreadStart function to accomplish this task.
just change the fn workerThreadStart

notice#

Because the height is 1200 ( maybe we can change ), so if we divide the picture partly by height div number of threads and can’t get a integer result, there will be wrong error.

done#

2 thread#

We split the picture generating to 2 part. The first Thread deals with half of the picture in the top, and other draws who in the bottom.

Result:

Serial: 316.964ms
2-Threads: 163.130ms

Almost two times faster

1
void workerThreadStart(WorkerArgs * const args) {
2
    // NOTE:  in a program that uses two threads, thread 0 could compute the top
3
    // half of the image and thread 1 could compute the bottom half.
4
    printf("[thread %d] working\n", args->threadId);
5
    if (!args->threadId) {
6
        mandelbrotSerial(args->x0, args->y0, args->x1, args->y1,
7
                     args->width,  args->height,
8
                     0,  args->height / 2,
9
                     args->maxIterations, args->output);
10
    }
11
    else {
12
        mandelbrotSerial(args->x0, args->y0, args->x1, args->y1,
13
                     args->width,  args->height,
14
                     args->height /2 ,  args->height / 2,
15
                     args->maxIterations, args->output);
16
    }
17
}

more thread using split method#

This Idea just divides the picture horizontally to different part (match the number of threads).

Result:

8-Threads: 3.53x speedup
16-Threads: 5.13x speedup

Actually this doesn’t give me 8 times faster, just 4 times faster.

some bad things#

When we run the view 1 and set to odd thread numbers, the speedup is smaller than less thread.

e.g. 3-Thread gets 1.58x speedup at the same time 2-Thread gets 1.94x.

Program 2 ( runned by Ryzen7 6800H )#

I must say this is very an interesting assignment. As we just in the lecture for maybe 2 classes, the SIMD instructions we are not easy to get, but the CS149intrin helps us. Although it just use fixed-length array to simulate, the code frame helps us analyse our Vector Utilization which helps us build a SIMD eye in an acceptable way.

task#

using CS149’s “fake vector intrinsics” defined in CS149intrin.h.
vectorize clampedExpVector & arraySumVector so it can be run on a machine with SIMD vector instructions.

done#

Vectorize Clamped EXP#

THINKS for Serial implement. My job is just translate the Serial Version to “CS149 SIMD” Version. Maybe you can see that in asst1/prog2_vecintrin/main.cpp

Vectorize Array Sum#

I Like The Hint ❤️. “You may find the hadd and interleave operations useful.” That helps me a lot.

This is a part of the code which I use hadd and interleave.

hadd
- Adds up adjacent pairs of elements, so [0 1 2 3] -> [0+1 0+1 2+3 2+3]
interleave
- Performs an even-odd interleaving where all even-indexed elements move to front half of the array and odd-indexed to the back half, so [0 1 2 3 4 5 6 7] -> [0 2 4 6 1 3 5 7]

So if I get 1 2 2 1 to get sum, just do this:

[1 2 3 4] -> [1+2 1+2 3+4 3+4] -> [3 3 7 7]
[3 3 7 7] -> [3 7 3 7]
[3 7 3 7] -> [3+7 3+7 3+7 3+7] -> result

1
    // O( N / VECTOR_WIDTH )
2
    for (int i = 0; i < N; i += VECTOR_WIDTH) {
3
        _cs149_vload_float(temp, values + i, maskAll);
4
        _cs149_vadd_float(sum, sum, temp, maskAll);
5
    }
6

7
    // O( log2(VECTOR_WIDTH) )
8
    for (int i = 1; i < VECTOR_WIDTH; i *= 2) {
9
        _cs149_hadd_float(sum, sum);
10
        _cs149_interleave_float(sum, sum);
11
    }
12
    float result[VECTOR_WIDTH];
13
    _cs149_vstore_float(result, sum, maskAll);
14
    return result[0];

Program 3 (Ryzen7 6800H)#

Part 1#

in mandelbrot_ispc_task, we can see that every task is allocated with a continuous picture clip.

1
// taskIndex is an ISPC built-in
2

3
uniform int ystart = taskIndex * rowsPerTask;
4
uniform int yend = ystart + rowsPerTask;

This is speedup below for different task number for view-1.

number	multi-task ispc	ispc
2	16.87x	8.89x
4	21.67x	7.76x
8	32.71x	9.02x
16	51.41x	7.93x
32	71.71x ~ 100.x	5.87x
200	72.00x ~ 91.x	5.88x
800	70.x ~ 110.x	5.83x

This is speedup below for different task number for view-2.

number	multi-task ispc	ispc
2	7.89x	4.92x
4	17.05x	7.54x
8	18.47x	4.94x
16	33.86x	4.91x
32	46.36.x ~ 68.x	4.92x
200	59.41x ~ 60.x	4.92x
800	57.x ~ 94.x	5.43x

The result is very interesting, as I even get 105.52x when I launch 32 tasks. But the ispc part is very unstable, sometimes goes up to 9.x speedup while sometimes just get 5.x speedup. When I switch to 32 tasks, seems there can’t be faster more.

I must say in view-2, 32-task version is slower than 200-task. I just list the fastest upper boundary I get when I run, but that happens little. If I can run more turn and calculate an avarage number, the 200-task may run a little fast. And actually 800-task seems to be even faster on avarage ( appear much more 60.x speedup).

The Expected 8-wide AVX2 performance#

In theory, 8-wide AVX2 can do 8 times job more than just serial per operation, which means that we will get 8 times faster, right? But we get 5.x ~ 9.x speedup, and most of that is slower than 8.x speedup. Why?

Coherent execution IS NECESSARY for SIMD. The view-2 is much more slower than view-1. If we put the focus to see where block the Coherent, most obvious part is the white & black boundary. view-2 has much more white-black boundary, which is not very ideal for SIMD. But for view-2, we launch more task with height / the number of task. Each task will do smaller part for the job, which may allocate more Coherent chip for task to do.

Part 2 determine how many task is best#

Actually I can’t tell the result. Ryzen7 6800h is 8 cores with 16 threads at most simultaneously. But 32-tasks is really faster than 16-tasks. In view-2 version, 800-task seems better on avarage. If we analyse the profiling using perf stat, we can tell that in ‘thread version’ for prog1_mandelbrot_thread, the cpu utilization is heigher. But the ispc version using task is faster and exceeds the sequential code over 32 times. BTW, the 32-task has lower cpu utilization then ‘thread version’.

After I see the lecture 4, I get it. I only have 16-wide task per instructions, but giving 32-task makes the compiler issue two instructions which run continuously and that is more conducive to pipeline computing.

Program 4 (Ryzen7 6800H)#

Part 1 Build and Run#

we can see the speedup there.

multi-task ispc	ispc
55.24x	4.97x

Part 2 Best or Worst Case#

Best Case#

Since all the values are the same, some every lane can do the same SIMD instruction while the sequential version must compute with maxIterations.

multi-task ispc	ispc
81.87x	6.77x

Worst Case#

The SIMD used is avx2 which supports ‘8-wide’ SIMD instructions. So we should break the 8-wide can’t be implement with useful SIMD instructions while let sequential version to be fast. Obviously, the fastest case for sqrt is setting all the value to 1 that needed lest iteration. So we can assign the value every 8 rounds to be slowest for compute, and the rest to assign as fast as the sequential can do.

I choose from ‘2.5f ~ 2.8f’ and finally I choose 2.68f. Because we shouldn’t give a value to near to 3, or the sequential part’s job will be more. Our goal is just let SIMD is not very helpful , we should break Coherent. So in that case, we can choose 0.1f also.

multi-task ispc	ispc
6.98x	0.94x

1
if ( i % 8 == 0 ) {
2
    values[i] = 2.68f;
3
} else {
4
    values[i] = 1.f;
5
}
6

7
// if ( i % 8 == 0 ) {
8
//     values[i] = 0.101f;
9
// } else {
10
//     values[i] = 1.f;
11
// }

Part 3 AVX2 version#

Refer to intel intrinsics guide. AVX2 version is near to ispc version, but a little faster. In the worst case, it gets 1.2x speedup while ispc gets 0.9x. And in the good case, avx2 version can gets faster than just ispc.

Now we all runned by Ryzen7 6800H.

Program 5#

Part 1 speedup#

1
[saxpy serial]:    [29.205] ms  [10.204] GB/s  [1.370] GFLOPS
2
[saxpy ispc]:    [26.056] ms  [11.438] GB/s  [1.535] GFLOPS
3
[saxpy task ispc]:  [21.102] ms  [14.123] GB/s  [1.896] GFLOPS
4
        (1.12x speedup from ISPC to serial)
5
        (1.38x speedup from task ISPC to serial)
6
        (1.23x speedup from use of tasks to just ISPC)

A problem happens here, the extra credit hints to think about cpu caches work and memory bandwidth. I dont know the myth machine’s situation. In order to think about what happened, I try to test with just ispc or with just task.

I find that there is many page-faults in just ‘ispc’, and lots of frontend cycles idle. All these make the program slower.

The multi-task version uses CPU up to 345.2%. That exceeds my past program much more (in the past almost just 100% maybe more a little). But the speedup is still not good.

1
// ispc
2
Performance counter stats for './saxpy -i':
3

4
            129.53 msec task-clock:u                     #    0.992 CPUs utilized
5
                 0      context-switches:u               #    0.000 /sec
6
                 0      cpu-migrations:u                 #    0.000 /sec
7
             2,754      page-faults:u                    #   21.262 K/sec
8
       254,357,756      cycles:u                         #    1.964 GHz
9
        47,137,032      stalled-cycles-frontend:u        #   18.53% frontend cycles idle
10
       680,745,469      instructions:u                   #    2.68  insn per cycle
11
                                                  #    0.07  stalled cycles per insn
12
       134,833,625      branches:u                       #    1.041 G/sec
13
            19,039      branch-misses:u                  #    0.01% of all branches
14

15
// task-ispc
16
Performance counter stats for './saxpy -t':
17

18
            435.02 msec task-clock:u                     #    3.452 CPUs utilized
19
                 0      context-switches:u               #    0.000 /sec
20
                 0      cpu-migrations:u                 #    0.000 /sec
21
             2,791      page-faults:u                    #    6.416 K/sec
22
     1,139,191,206      cycles:u                         #    2.619 GHz
23
         2,627,409      stalled-cycles-frontend:u        #    0.23% frontend cycles idle
24
       695,864,736      instructions:u                   #    0.61  insn per cycle
25
                                                  #    0.00  stalled cycles per insn
26
       134,857,706      branches:u                       #  310.000 M/sec
27
            22,567      branch-misses:u                  #    0.02% of all branches

So I still analyse the memory miss (actually L1-cache miss). I also analysed sqrt program, and thus we know there meets memory boundary.

1
// sqrt
2
39,405      l1_dtlb_misses:u                 #    5.129 K/sec
3

4
// task-ispc
5
25,874      l1_dtlb_misses:u                 #   60.855 K/sec
6

7
// ispc
8
19,639      l1_dtlb_misses:u                 #  151.536 K/sec

Part 2 try to improve#

I just have an idea to pre-load some data to the cache. If our ‘task 1’ is doing ‘1 ~ 1 + span’ work, we can load ‘2 ~ 2 + span’ data to cache simultaneously. But I don’t know how to achieve it😭.

Program 6#

Case I can’t access the myth machine, I generate data.

1
Running K-means with: M=1000000, N=100, K=3, epsilon=0.100000
2
[Total Time]: 4297.427 ms

Part 1 profiling#

Based on profiling, we can see that the computeAssignments may be the boundary. And dist was called the maxminum times. We can try to improve the performance of function dist,computeAssignments,computeCentroids, computeCost(initData isn’t our care)

1
Each sample counts as 0.01 seconds.
2
  %   cumulative   self              self     total
3
 time   seconds   seconds    calls   s/call   s/call  name
4
 59.48      4.62     4.62       24     0.19     0.19  computeAssignments(WorkerArgs*)
5
 15.19      5.80     1.18       24     0.05     0.05  computeCost(WorkerArgs*)
6
 14.68      6.94     1.14        1     1.14     1.14  initData(double*, int, int)
7
  8.75      7.62     0.68       24     0.03     0.03  computeCentroids(WorkerArgs*)
8
  1.67      7.75     0.13  3000000     0.00     0.00  dist(double*, double*, int)
9
  0.26      7.77     0.02        2     0.01     0.01  logToFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, double, double*, int*, double*, int, int, int)
10
  0.00      7.77     0.00        4     0.00     0.00  std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string<std::allocator<char> >(char const*, std::allocator<char> const&)
11
  0.00      7.77     0.00        2     0.00     0.00  CycleTimer::secondsPerTick()
12
  0.00      7.77     0.00        1     0.00     6.48  kMeansThread(double*, double*, int*, int, int, int, double)
13
  0.00      7.77     0.00        1     0.00     0.00  initCentroids(double*, int, int)
14
  0.00      7.77     0.00        1     0.00     0.00  readData(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, double**, double**, int**, int*, int*, int*, double*)
15
  0.00      7.77     0.00        1     0.00     0.00  writeData(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, double*, double*, int*, int*, int*, int*, double*)

Part 2 SIMD dist#

A simple and quick try, I make a SIMD dist using AVX2. Actually it seems reducing the time consumed about 1 seconds.

1
[Total Time]: 3024.816 ms

But finally it seems the SIMD dist( distAVX2 ) is the boundary. But this is because we have so many data to compute (as the function is called 99000000 times), and time consumed per call is near to 0s.

1
Each sample counts as 0.01 seconds.
2
  %   cumulative   self              self     total
3
 time   seconds   seconds    calls   s/call   s/call  name
4
 63.47      3.45     3.45 99000000     0.00     0.00  distAVX2(double*, double*, int)
5
 17.87      4.42     0.97        1     0.97     0.97  initData(double*, int, int)
6
 12.71      5.11     0.69       24     0.03     0.03  computeCentroids(WorkerArgs*)
7
  4.61      5.36     0.25       24     0.01     0.11  computeAssignments(WorkerArgs*)
8
  0.74      5.40     0.04       24     0.00     0.04  computeCost(WorkerArgs*)
9
  0.37      5.42     0.02        2     0.01     0.01  logToFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, double, double*, int*, double*, int, int, int)
10
  0.18      5.43     0.01                             main
11
  0.09      5.43     0.01                             distNoSqrt(double*, double*, int)
12
  0.00      5.43     0.00        4     0.00     0.00  std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string<std::allocator<char> >(char const*, std::allocator<char> const&)
13
  0.00      5.43     0.00        2     0.00     0.00  CycleTimer::secondsPerTick()
14
  0.00      5.43     0.00        1     0.00     4.32  kMeansThread(double*, double*, int*, int, int, int, double)
15
  0.00      5.43     0.00        1     0.00     0.00  initCentroids(double*, int, int)
16
  0.00      5.43     0.00        1     0.00     0.00  readData(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, double**, double**, int**, int*, int*, int*, double*)
17
  0.00      5.43     0.00        1     0.00     0.00  writeData(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, double*, double*, int*, int*, int*, int*, double*)

Part 3 Multi-Thread#

just compute the computeAssignments with 32-threads.

1
[Total Time]: 2023.298 ms # With SIMD dist
2
[Total Time]: 3529.115 ms # Without SIMD dist

this is profiling.

1
# With SIMD dist
2
Each sample counts as 0.01 seconds.
3
  %   cumulative   self              self     total
4
 time   seconds   seconds    calls   s/call   s/call  name
5
 56.69      2.76     2.76 29357400     0.00     0.00  distAVX2(double*, double*, int)
6
 20.95      3.78     1.02        1     1.02     1.02  initData(double*, int, int)
7
 17.46      4.63     0.85       24     0.04     0.04  computeCentroids(WorkerArgs*)
8
  3.08      4.78     0.15                             computeAssignmentsSingle(WorkerArgs*, double*, int, int)
9
  0.82      4.82     0.04       24     0.00     0.00  computeAssignmentsMutiThread(WorkerArgs*)
10
  0.62      4.85     0.03       24     0.00     0.10  computeCost(WorkerArgs*)
11
  0.41      4.87     0.02                             _init
12
  0.00      4.87     0.00        4     0.00     0.00  std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string<std::allocator<char> >(char const*, std::allocator<char> const&)
13
  0.00      4.87     0.00        2     0.00     0.00  logToFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, double, double*, int*, double*, int, int, int)
14
  0.00      4.87     0.00        2     0.00     0.00  CycleTimer::secondsPerTick()
15
  0.00      4.87     0.00        1     0.00     3.18  kMeansThread(double*, double*, int*, int, int, int, double)
16
  0.00      4.87     0.00        1     0.00     0.00  initCentroids(double*, int, int)
17
  0.00      4.87     0.00        1     0.00     0.00  readData(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, double**, double**, int**, int*, int*, int*, double*)
18
  0.00      4.87     0.00        1     0.00     0.00  writeData(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, double*, double*, int*, int*, int*, int*, double*)

It seems magic happens. Times of calling distAVX2 is reduced significantly. ( But Actually I dont know why 🥹) And the iteration turns is the same, 24-iteration.