Parallel Computing CS149 ASST 1 note

This series will cover my assignment solution and some notes to Standford CS149 lecture. And I truly recommand anyone who wants to learn Parallel-Computing to follow this lecture.

Program 1 by Apple M3

important things

  • Property of a program where the same instruction sequence applies to many data elements
  • Coherent execution IS NECESSARY for SIMD processing resources to be used efficiently
  • Coherent execution IS NOT NECESSARY for efficient parallelization across different cores, since each core has the capability to fetch/decode a different instructions from their thread’s instruction stream

task

  • try to make mandelbrot set built with Multi-Thread has the same Result as the Serial method.
  • add code to workerThreadStart function to accomplish this task.
  • just change the fn workerThreadStart

notice

Because the height is 1200 ( maybe we can change ), so if we divide the picture partly by height div number of threads and can’t get a integer result, there will be wrong error.

done

2 thread

We split the picture generating to 2 part. The first Thread deals with half of the picture in the top, and other draws who in the bottom.

Result:

  • Serial: 316.964ms
  • 2-Threads: 163.130ms

Almost two times faster

void workerThreadStart(WorkerArgs * const args) {
    // NOTE:  in a program that uses two threads, thread 0 could compute the top
    // half of the image and thread 1 could compute the bottom half.
    printf("[thread %d] working\n", args->threadId);
    if (!args->threadId) {
        mandelbrotSerial(args->x0, args->y0, args->x1, args->y1,
                     args->width,  args->height, 
                     0,  args->height / 2,
                     args->maxIterations, args->output);
    }
    else {
        mandelbrotSerial(args->x0, args->y0, args->x1, args->y1,
                     args->width,  args->height, 
                     args->height /2 ,  args->height / 2,
                     args->maxIterations, args->output);
    }
}

more thread using split method

This Idea just divides the picture horizontally to different part (match the number of threads).

Result:

  • 8-Threads: 3.53x speedup
  • 16-Threads: 5.13x speedup

Actually this doesn’t give me 8 times faster, just 4 times faster.

some bad things

When we run the view 1 and set to odd thread numbers, the speedup is smaller than less thread.

e.g. 3-Thread gets 1.58x speedup at the same time 2-Thread gets 1.94x.

Program 2 ( runned by Ryzen7 6800H )

I must say this is very an interesting assignment. As we just in the lecture for maybe 2 classes, the SIMD instructions we are not easy to get, but the CS149intrin helps us. Although it just use fixed-length array to simulate, the code frame helps us analyse our Vector Utilization which helps us build a SIMD eye in an acceptable way.

task

  • using CS149’s “fake vector intrinsics” defined in CS149intrin.h.
  • vectorize clampedExpVector & arraySumVector so it can be run on a machine with SIMD vector instructions.

done

Vectorize Clamped EXP

THINKS for Serial implement. My job is just translate the Serial Version to “CS149 SIMD” Version. Maybe you can see that in asst1/prog2_vecintrin/main.cpp

Vectorize Array Sum

I Like The Hint ❤️. “You may find the hadd and interleave operations useful.” That helps me a lot.

This is a part of the code which I use hadd and interleave.

  • hadd
    • Adds up adjacent pairs of elements, so [0 1 2 3] -> [0+1 0+1 2+3 2+3]
  • interleave
    • Performs an even-odd interleaving where all even-indexed elements move to front half of the array and odd-indexed to the back half, so [0 1 2 3 4 5 6 7] -> [0 2 4 6 1 3 5 7]

So if I get 1 2 2 1 to get sum, just do this:

  • [1 2 3 4] -> [1+2 1+2 3+4 3+4] -> [3 3 7 7]
  • [3 3 7 7] -> [3 7 3 7]
  • [3 7 3 7] -> [3+7 3+7 3+7 3+7] -> result
    // O( N / VECTOR_WIDTH )
    for (int i = 0; i < N; i += VECTOR_WIDTH) {
        _cs149_vload_float(temp, values + i, maskAll);
        _cs149_vadd_float(sum, sum, temp, maskAll);
    }

    // O( log2(VECTOR_WIDTH) )
    for (int i = 1; i < VECTOR_WIDTH; i *= 2) {
        _cs149_hadd_float(sum, sum);
        _cs149_interleave_float(sum, sum);
    }
    float result[VECTOR_WIDTH];
    _cs149_vstore_float(result, sum, maskAll);
    return result[0];

Program 3 (Ryzen7 6800H)

Part 1

in mandelbrot_ispc_task, we can see that every task is allocated with a continuous picture clip.

// taskIndex is an ISPC built-in

uniform int ystart = taskIndex * rowsPerTask;
uniform int yend = ystart + rowsPerTask;

This is speedup below for different task number for view-1.

numbermulti-task ispcispc
216.87x8.89x
421.67x7.76x
832.71x9.02x
1651.41x7.93x
3271.71x ~ 100.x5.87x
20072.00x ~ 91.x5.88x
80070.x ~ 110.x5.83x

This is speedup below for different task number for view-2.

numbermulti-task ispcispc
27.89x4.92x
417.05x7.54x
818.47x4.94x
1633.86x4.91x
3246.36.x ~ 68.x4.92x
20059.41x ~ 60.x4.92x
80057.x ~ 94.x5.43x

The result is very interesting, as I even get 105.52x when I launch 32 tasks. But the ispc part is very unstable, sometimes goes up to 9.x speedup while sometimes just get 5.x speedup. When I switch to 32 tasks, seems there can’t be faster more.

I must say in view-2, 32-task version is slower than 200-task. I just list the fastest upper boundary I get when I run, but that happens little. If I can run more turn and calculate an avarage number, the 200-task may run a little fast. And actually 800-task seems to be even faster on avarage ( appear much more 60.x speedup).

The Expected 8-wide AVX2 performance

In theory, 8-wide AVX2 can do 8 times job more than just serial per operation, which means that we will get 8 times faster, right? But we get 5.x ~ 9.x speedup, and most of that is slower than 8.x speedup. Why?

Coherent execution IS NECESSARY for SIMD. The view-2 is much more slower than view-1. If we put the focus to see where block the Coherent, most obvious part is the white & black boundary. view-2 has much more white-black boundary, which is not very ideal for SIMD. But for view-2, we launch more task with height / the number of task. Each task will do smaller part for the job, which may allocate more Coherent chip for task to do.

Part 2 determine how many task is best

Actually I can’t tell the result. Ryzen7 6800h is 8 cores with 16 threads at most simultaneously. But 32-tasks is really faster than 16-tasks. In view-2 version, 800-task seems better on avarage. If we analyse the profiling using perf stat, we can tell that in ’thread version’ for prog1_mandelbrot_thread, the cpu utilization is heigher. But the ispc version using task is faster and exceeds the sequential code over 32 times. BTW, the 32-task has lower cpu utilization then ’thread version'.

After I see the lecture 4, I get it. I only have 16-wide task per instructions, but giving 32-task makes the compiler issue two instructions which run continuously and that is more conducive to pipeline computing.

Program 4 (Ryzen7 6800H)

Part 1 Build and Run

we can see the speedup there.

multi-task ispcispc
55.24x4.97x

Part 2 Best or Worst Case

Best Case

Since all the values are the same, some every lane can do the same SIMD instruction while the sequential version must compute with maxIterations.

multi-task ispcispc
81.87x6.77x

Worst Case

The SIMD used is avx2 which supports ‘8-wide’ SIMD instructions. So we should break the 8-wide can’t be implement with useful SIMD instructions while let sequential version to be fast. Obviously, the fastest case for sqrt is setting all the value to 1 that needed lest iteration. So we can assign the value every 8 rounds to be slowest for compute, and the rest to assign as fast as the sequential can do.

I choose from ‘2.5f ~ 2.8f’ and finally I choose 2.68f. Because we shouldn’t give a value to near to 3, or the sequential part’s job will be more. Our goal is just let SIMD is not very helpful , we should break Coherent. So in that case, we can choose 0.1f also.

multi-task ispcispc
6.98x0.94x
if ( i % 8 == 0 ) {
    values[i] = 2.68f;
} else {
    values[i] = 1.f; 
}

// if ( i % 8 == 0 ) {
//     values[i] = 0.101f;
// } else {
//     values[i] = 1.f; 
// }

Part 3 AVX2 version

Refer to intel intrinsics guide . AVX2 version is near to ispc version, but a little faster. In the worst case, it gets 1.2x speedup while ispc gets 0.9x. And in the good case, avx2 version can gets faster than just ispc.


Now we all runned by Ryzen7 6800H.


Program 5

Part 1 speedup

[saxpy serial]:		[29.205] ms	[10.204] GB/s	[1.370] GFLOPS
[saxpy ispc]:		[26.056] ms	[11.438] GB/s	[1.535] GFLOPS
[saxpy task ispc]:	[21.102] ms	[14.123] GB/s	[1.896] GFLOPS
				(1.12x speedup from ISPC to serial)
				(1.38x speedup from task ISPC to serial)
				(1.23x speedup from use of tasks to just ISPC)

A problem happens here, the extra credit hints to think about cpu caches work and memory bandwidth. I dont know the myth machine’s situation. In order to think about what happened, I try to test with just ispc or with just task.

I find that there is many page-faults in just ‘ispc’, and lots of frontend cycles idle. All these make the program slower.

The multi-task version uses CPU up to 345.2%. That exceeds my past program much more (in the past almost just 100% maybe more a little). But the speedup is still not good.

// ispc
Performance counter stats for './saxpy -i':

            129.53 msec task-clock:u                     #    0.992 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
             2,754      page-faults:u                    #   21.262 K/sec                     
       254,357,756      cycles:u                         #    1.964 GHz                       
        47,137,032      stalled-cycles-frontend:u        #   18.53% frontend cycles idle      
       680,745,469      instructions:u                   #    2.68  insn per cycle            
                                                  #    0.07  stalled cycles per insn   
       134,833,625      branches:u                       #    1.041 G/sec                     
            19,039      branch-misses:u                  #    0.01% of all branches           

// task-ispc
Performance counter stats for './saxpy -t':

            435.02 msec task-clock:u                     #    3.452 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
             2,791      page-faults:u                    #    6.416 K/sec                     
     1,139,191,206      cycles:u                         #    2.619 GHz                       
         2,627,409      stalled-cycles-frontend:u        #    0.23% frontend cycles idle      
       695,864,736      instructions:u                   #    0.61  insn per cycle            
                                                  #    0.00  stalled cycles per insn   
       134,857,706      branches:u                       #  310.000 M/sec                     
            22,567      branch-misses:u                  #    0.02% of all branches           

So I still analyse the memory miss (actually L1-cache miss). I also analysed sqrt program, and thus we know there meets memory boundary.

// sqrt
39,405      l1_dtlb_misses:u                 #    5.129 K/sec

// task-ispc
25,874      l1_dtlb_misses:u                 #   60.855 K/sec

// ispc
19,639      l1_dtlb_misses:u                 #  151.536 K/sec

Part 2 try to improve

I just have an idea to pre-load some data to the cache. If our ’task 1’ is doing ‘1 ~ 1 + span’ work, we can load ‘2 ~ 2 + span’ data to cache simultaneously. But I don’t know how to achieve it😭.

Program 6

Case I can’t access the myth machine, I generate data.

Running K-means with: M=1000000, N=100, K=3, epsilon=0.100000
[Total Time]: 4297.427 ms

Part 1 profiling

Based on profiling, we can see that the computeAssignments may be the boundary. And dist was called the maxminum times. We can try to improve the performance of function dist,computeAssignments,computeCentroids, computeCost(initData isn’t our care)

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 59.48      4.62     4.62       24     0.19     0.19  computeAssignments(WorkerArgs*)
 15.19      5.80     1.18       24     0.05     0.05  computeCost(WorkerArgs*)
 14.68      6.94     1.14        1     1.14     1.14  initData(double*, int, int)
  8.75      7.62     0.68       24     0.03     0.03  computeCentroids(WorkerArgs*)
  1.67      7.75     0.13  3000000     0.00     0.00  dist(double*, double*, int)
  0.26      7.77     0.02        2     0.01     0.01  logToFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, double, double*, int*, double*, int, int, int)
  0.00      7.77     0.00        4     0.00     0.00  std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string<std::allocator<char> >(char const*, std::allocator<char> const&)
  0.00      7.77     0.00        2     0.00     0.00  CycleTimer::secondsPerTick()
  0.00      7.77     0.00        1     0.00     6.48  kMeansThread(double*, double*, int*, int, int, int, double)
  0.00      7.77     0.00        1     0.00     0.00  initCentroids(double*, int, int)
  0.00      7.77     0.00        1     0.00     0.00  readData(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, double**, double**, int**, int*, int*, int*, double*)
  0.00      7.77     0.00        1     0.00     0.00  writeData(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, double*, double*, int*, int*, int*, int*, double*)

Part 2 SIMD dist

A simple and quick try, I make a SIMD dist using AVX2. Actually it seems reducing the time consumed about 1 seconds.

[Total Time]: 3024.816 ms

But finally it seems the SIMD dist( distAVX2 ) is the boundary. But this is because we have so many data to compute (as the function is called 99000000 times), and time consumed per call is near to 0s.

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 63.47      3.45     3.45 99000000     0.00     0.00  distAVX2(double*, double*, int)
 17.87      4.42     0.97        1     0.97     0.97  initData(double*, int, int)
 12.71      5.11     0.69       24     0.03     0.03  computeCentroids(WorkerArgs*)
  4.61      5.36     0.25       24     0.01     0.11  computeAssignments(WorkerArgs*)
  0.74      5.40     0.04       24     0.00     0.04  computeCost(WorkerArgs*)
  0.37      5.42     0.02        2     0.01     0.01  logToFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, double, double*, int*, double*, int, int, int)
  0.18      5.43     0.01                             main
  0.09      5.43     0.01                             distNoSqrt(double*, double*, int)
  0.00      5.43     0.00        4     0.00     0.00  std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string<std::allocator<char> >(char const*, std::allocator<char> const&)
  0.00      5.43     0.00        2     0.00     0.00  CycleTimer::secondsPerTick()
  0.00      5.43     0.00        1     0.00     4.32  kMeansThread(double*, double*, int*, int, int, int, double)
  0.00      5.43     0.00        1     0.00     0.00  initCentroids(double*, int, int)
  0.00      5.43     0.00        1     0.00     0.00  readData(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, double**, double**, int**, int*, int*, int*, double*)
  0.00      5.43     0.00        1     0.00     0.00  writeData(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, double*, double*, int*, int*, int*, int*, double*)

Part 3 Multi-Thread

just compute the computeAssignments with 32-threads.

[Total Time]: 2023.298 ms # With SIMD dist
[Total Time]: 3529.115 ms # Without SIMD dist

this is profiling.

# With SIMD dist
Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 56.69      2.76     2.76 29357400     0.00     0.00  distAVX2(double*, double*, int)
 20.95      3.78     1.02        1     1.02     1.02  initData(double*, int, int)
 17.46      4.63     0.85       24     0.04     0.04  computeCentroids(WorkerArgs*)
  3.08      4.78     0.15                             computeAssignmentsSingle(WorkerArgs*, double*, int, int)
  0.82      4.82     0.04       24     0.00     0.00  computeAssignmentsMutiThread(WorkerArgs*)
  0.62      4.85     0.03       24     0.00     0.10  computeCost(WorkerArgs*)
  0.41      4.87     0.02                             _init
  0.00      4.87     0.00        4     0.00     0.00  std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string<std::allocator<char> >(char const*, std::allocator<char> const&)
  0.00      4.87     0.00        2     0.00     0.00  logToFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, double, double*, int*, double*, int, int, int)
  0.00      4.87     0.00        2     0.00     0.00  CycleTimer::secondsPerTick()
  0.00      4.87     0.00        1     0.00     3.18  kMeansThread(double*, double*, int*, int, int, int, double)
  0.00      4.87     0.00        1     0.00     0.00  initCentroids(double*, int, int)
  0.00      4.87     0.00        1     0.00     0.00  readData(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, double**, double**, int**, int*, int*, int*, double*)
  0.00      4.87     0.00        1     0.00     0.00  writeData(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, double*, double*, int*, int*, int*, int*, double*)

It seems magic happens. Times of calling distAVX2 is reduced significantly. ( But Actually I dont know why 🥹) And the iteration turns is the same, 24-iteration.