2298 字
11 分钟
Parallel Computing CS149 ASST 1 note

This series will cover my assignment solution and some notes to Standford CS149 lecture. And I truly recommand anyone who wants to learn Parallel-Computing to follow this lecture.

Program 1 by Apple M3#

important things#

  • Property of a program where the same instruction sequence applies to many data elements
  • Coherent execution IS NECESSARY for SIMD processing resources to be used efficiently
  • Coherent execution IS NOT NECESSARY for efficient parallelization across different cores, since each core has the capability to fetch/decode a different instructions from their thread’s instruction stream

task#

  • try to make mandelbrot set built with Multi-Thread has the same Result as the Serial method.
  • add code to workerThreadStart function to accomplish this task.
  • just change the fn workerThreadStart

notice#

Because the height is 1200 ( maybe we can change ), so if we divide the picture partly by height div number of threads and can’t get a integer result, there will be wrong error.

done#

2 thread#

We split the picture generating to 2 part. The first Thread deals with half of the picture in the top, and other draws who in the bottom.

Result:

  • Serial: 316.964ms
  • 2-Threads: 163.130ms

Almost two times faster

void workerThreadStart(WorkerArgs * const args) {
// NOTE: in a program that uses two threads, thread 0 could compute the top
// half of the image and thread 1 could compute the bottom half.
printf("[thread %d] working\n", args->threadId);
if (!args->threadId) {
mandelbrotSerial(args->x0, args->y0, args->x1, args->y1,
args->width, args->height,
0, args->height / 2,
args->maxIterations, args->output);
}
else {
mandelbrotSerial(args->x0, args->y0, args->x1, args->y1,
args->width, args->height,
args->height /2 , args->height / 2,
args->maxIterations, args->output);
}
}

more thread using split method#

This Idea just divides the picture horizontally to different part (match the number of threads).

Result:

  • 8-Threads: 3.53x speedup
  • 16-Threads: 5.13x speedup

Actually this doesn’t give me 8 times faster, just 4 times faster.

some bad things#

When we run the view 1 and set to odd thread numbers, the speedup is smaller than less thread.

e.g. 3-Thread gets 1.58x speedup at the same time 2-Thread gets 1.94x.

Program 2 ( runned by Ryzen7 6800H )#

I must say this is very an interesting assignment. As we just in the lecture for maybe 2 classes, the SIMD instructions we are not easy to get, but the CS149intrin helps us. Although it just use fixed-length array to simulate, the code frame helps us analyse our Vector Utilization which helps us build a SIMD eye in an acceptable way.

task#

  • using CS149’s “fake vector intrinsics” defined in CS149intrin.h.
  • vectorize clampedExpVector & arraySumVector so it can be run on a machine with SIMD vector instructions.

done#

Vectorize Clamped EXP#

THINKS for Serial implement. My job is just translate the Serial Version to “CS149 SIMD” Version. Maybe you can see that in asst1/prog2_vecintrin/main.cpp

Vectorize Array Sum#

I Like The Hint ❤️. “You may find the hadd and interleave operations useful.” That helps me a lot.

This is a part of the code which I use hadd and interleave.

  • hadd
    • Adds up adjacent pairs of elements, so [0 1 2 3] -> [0+1 0+1 2+3 2+3]
  • interleave
    • Performs an even-odd interleaving where all even-indexed elements move to front half of the array and odd-indexed to the back half, so [0 1 2 3 4 5 6 7] -> [0 2 4 6 1 3 5 7]

So if I get 1 2 2 1 to get sum, just do this:

  • [1 2 3 4] -> [1+2 1+2 3+4 3+4] -> [3 3 7 7]
  • [3 3 7 7] -> [3 7 3 7]
  • [3 7 3 7] -> [3+7 3+7 3+7 3+7] -> result
// O( N / VECTOR_WIDTH )
for (int i = 0; i < N; i += VECTOR_WIDTH) {
_cs149_vload_float(temp, values + i, maskAll);
_cs149_vadd_float(sum, sum, temp, maskAll);
}
// O( log2(VECTOR_WIDTH) )
for (int i = 1; i < VECTOR_WIDTH; i *= 2) {
_cs149_hadd_float(sum, sum);
_cs149_interleave_float(sum, sum);
}
float result[VECTOR_WIDTH];
_cs149_vstore_float(result, sum, maskAll);
return result[0];

Program 3 (Ryzen7 6800H)#

Part 1#

in mandelbrot_ispc_task, we can see that every task is allocated with a continuous picture clip.

// taskIndex is an ISPC built-in
uniform int ystart = taskIndex * rowsPerTask;
uniform int yend = ystart + rowsPerTask;

This is speedup below for different task number for view-1.

numbermulti-task ispcispc
216.87x8.89x
421.67x7.76x
832.71x9.02x
1651.41x7.93x
3271.71x ~ 100.x5.87x
20072.00x ~ 91.x5.88x
80070.x ~ 110.x5.83x

This is speedup below for different task number for view-2.

numbermulti-task ispcispc
27.89x4.92x
417.05x7.54x
818.47x4.94x
1633.86x4.91x
3246.36.x ~ 68.x4.92x
20059.41x ~ 60.x4.92x
80057.x ~ 94.x5.43x

The result is very interesting, as I even get 105.52x when I launch 32 tasks. But the ispc part is very unstable, sometimes goes up to 9.x speedup while sometimes just get 5.x speedup. When I switch to 32 tasks, seems there can’t be faster more.

I must say in view-2, 32-task version is slower than 200-task. I just list the fastest upper boundary I get when I run, but that happens little. If I can run more turn and calculate an avarage number, the 200-task may run a little fast. And actually 800-task seems to be even faster on avarage ( appear much more 60.x speedup).

The Expected 8-wide AVX2 performance#

In theory, 8-wide AVX2 can do 8 times job more than just serial per operation, which means that we will get 8 times faster, right? But we get 5.x ~ 9.x speedup, and most of that is slower than 8.x speedup. Why?

Coherent execution IS NECESSARY for SIMD. The view-2 is much more slower than view-1. If we put the focus to see where block the Coherent, most obvious part is the white & black boundary. view-2 has much more white-black boundary, which is not very ideal for SIMD. But for view-2, we launch more task with height / the number of task. Each task will do smaller part for the job, which may allocate more Coherent chip for task to do.

Part 2 determine how many task is best#

Actually I can’t tell the result. Ryzen7 6800h is 8 cores with 16 threads at most simultaneously. But 32-tasks is really faster than 16-tasks. In view-2 version, 800-task seems better on avarage. If we analyse the profiling using perf stat, we can tell that in ‘thread version’ for prog1_mandelbrot_thread, the cpu utilization is heigher. But the ispc version using task is faster and exceeds the sequential code over 32 times. BTW, the 32-task has lower cpu utilization then ‘thread version’.

After I see the lecture 4, I get it. I only have 16-wide task per instructions, but giving 32-task makes the compiler issue two instructions which run continuously and that is more conducive to pipeline computing.

Program 4 (Ryzen7 6800H)#

Part 1 Build and Run#

we can see the speedup there.

multi-task ispcispc
55.24x4.97x

Part 2 Best or Worst Case#

Best Case#

Since all the values are the same, some every lane can do the same SIMD instruction while the sequential version must compute with maxIterations.

multi-task ispcispc
81.87x6.77x

Worst Case#

The SIMD used is avx2 which supports ‘8-wide’ SIMD instructions. So we should break the 8-wide can’t be implement with useful SIMD instructions while let sequential version to be fast. Obviously, the fastest case for sqrt is setting all the value to 1 that needed lest iteration. So we can assign the value every 8 rounds to be slowest for compute, and the rest to assign as fast as the sequential can do.

I choose from ‘2.5f ~ 2.8f’ and finally I choose 2.68f. Because we shouldn’t give a value to near to 3, or the sequential part’s job will be more. Our goal is just let SIMD is not very helpful , we should break Coherent. So in that case, we can choose 0.1f also.

multi-task ispcispc
6.98x0.94x
if ( i % 8 == 0 ) {
values[i] = 2.68f;
} else {
values[i] = 1.f;
}
// if ( i % 8 == 0 ) {
// values[i] = 0.101f;
// } else {
// values[i] = 1.f;
// }

Part 3 AVX2 version#

Refer to intel intrinsics guide. AVX2 version is near to ispc version, but a little faster. In the worst case, it gets 1.2x speedup while ispc gets 0.9x. And in the good case, avx2 version can gets faster than just ispc.


Now we all runned by Ryzen7 6800H.


Program 5#

Part 1 speedup#

[saxpy serial]: [29.205] ms [10.204] GB/s [1.370] GFLOPS
[saxpy ispc]: [26.056] ms [11.438] GB/s [1.535] GFLOPS
[saxpy task ispc]: [21.102] ms [14.123] GB/s [1.896] GFLOPS
(1.12x speedup from ISPC to serial)
(1.38x speedup from task ISPC to serial)
(1.23x speedup from use of tasks to just ISPC)

A problem happens here, the extra credit hints to think about cpu caches work and memory bandwidth. I dont know the myth machine’s situation. In order to think about what happened, I try to test with just ispc or with just task.

I find that there is many page-faults in just ‘ispc’, and lots of frontend cycles idle. All these make the program slower.

The multi-task version uses CPU up to 345.2%. That exceeds my past program much more (in the past almost just 100% maybe more a little). But the speedup is still not good.

// ispc
Performance counter stats for './saxpy -i':
129.53 msec task-clock:u # 0.992 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
2,754 page-faults:u # 21.262 K/sec
254,357,756 cycles:u # 1.964 GHz
47,137,032 stalled-cycles-frontend:u # 18.53% frontend cycles idle
680,745,469 instructions:u # 2.68 insn per cycle
# 0.07 stalled cycles per insn
134,833,625 branches:u # 1.041 G/sec
19,039 branch-misses:u # 0.01% of all branches
// task-ispc
Performance counter stats for './saxpy -t':
435.02 msec task-clock:u # 3.452 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
2,791 page-faults:u # 6.416 K/sec
1,139,191,206 cycles:u # 2.619 GHz
2,627,409 stalled-cycles-frontend:u # 0.23% frontend cycles idle
695,864,736 instructions:u # 0.61 insn per cycle
# 0.00 stalled cycles per insn
134,857,706 branches:u # 310.000 M/sec
22,567 branch-misses:u # 0.02% of all branches

So I still analyse the memory miss (actually L1-cache miss). I also analysed sqrt program, and thus we know there meets memory boundary.

// sqrt
39,405 l1_dtlb_misses:u # 5.129 K/sec
// task-ispc
25,874 l1_dtlb_misses:u # 60.855 K/sec
// ispc
19,639 l1_dtlb_misses:u # 151.536 K/sec

Part 2 try to improve#

I just have an idea to pre-load some data to the cache. If our ‘task 1’ is doing ‘1 ~ 1 + span’ work, we can load ‘2 ~ 2 + span’ data to cache simultaneously. But I don’t know how to achieve it😭.

Program 6#

Case I can’t access the myth machine, I generate data.

Running K-means with: M=1000000, N=100, K=3, epsilon=0.100000
[Total Time]: 4297.427 ms

Part 1 profiling#

Based on profiling, we can see that the computeAssignments may be the boundary. And dist was called the maxminum times. We can try to improve the performance of function dist,computeAssignments,computeCentroids, computeCost(initData isn’t our care)

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
59.48 4.62 4.62 24 0.19 0.19 computeAssignments(WorkerArgs*)
15.19 5.80 1.18 24 0.05 0.05 computeCost(WorkerArgs*)
14.68 6.94 1.14 1 1.14 1.14 initData(double*, int, int)
8.75 7.62 0.68 24 0.03 0.03 computeCentroids(WorkerArgs*)
1.67 7.75 0.13 3000000 0.00 0.00 dist(double*, double*, int)
0.26 7.77 0.02 2 0.01 0.01 logToFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, double, double*, int*, double*, int, int, int)
0.00 7.77 0.00 4 0.00 0.00 std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string<std::allocator<char> >(char const*, std::allocator<char> const&)
0.00 7.77 0.00 2 0.00 0.00 CycleTimer::secondsPerTick()
0.00 7.77 0.00 1 0.00 6.48 kMeansThread(double*, double*, int*, int, int, int, double)
0.00 7.77 0.00 1 0.00 0.00 initCentroids(double*, int, int)
0.00 7.77 0.00 1 0.00 0.00 readData(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, double**, double**, int**, int*, int*, int*, double*)
0.00 7.77 0.00 1 0.00 0.00 writeData(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, double*, double*, int*, int*, int*, int*, double*)

Part 2 SIMD dist#

A simple and quick try, I make a SIMD dist using AVX2. Actually it seems reducing the time consumed about 1 seconds.

[Total Time]: 3024.816 ms

But finally it seems the SIMD dist( distAVX2 ) is the boundary. But this is because we have so many data to compute (as the function is called 99000000 times), and time consumed per call is near to 0s.

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
63.47 3.45 3.45 99000000 0.00 0.00 distAVX2(double*, double*, int)
17.87 4.42 0.97 1 0.97 0.97 initData(double*, int, int)
12.71 5.11 0.69 24 0.03 0.03 computeCentroids(WorkerArgs*)
4.61 5.36 0.25 24 0.01 0.11 computeAssignments(WorkerArgs*)
0.74 5.40 0.04 24 0.00 0.04 computeCost(WorkerArgs*)
0.37 5.42 0.02 2 0.01 0.01 logToFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, double, double*, int*, double*, int, int, int)
0.18 5.43 0.01 main
0.09 5.43 0.01 distNoSqrt(double*, double*, int)
0.00 5.43 0.00 4 0.00 0.00 std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string<std::allocator<char> >(char const*, std::allocator<char> const&)
0.00 5.43 0.00 2 0.00 0.00 CycleTimer::secondsPerTick()
0.00 5.43 0.00 1 0.00 4.32 kMeansThread(double*, double*, int*, int, int, int, double)
0.00 5.43 0.00 1 0.00 0.00 initCentroids(double*, int, int)
0.00 5.43 0.00 1 0.00 0.00 readData(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, double**, double**, int**, int*, int*, int*, double*)
0.00 5.43 0.00 1 0.00 0.00 writeData(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, double*, double*, int*, int*, int*, int*, double*)

Part 3 Multi-Thread#

just compute the computeAssignments with 32-threads.

[Total Time]: 2023.298 ms # With SIMD dist
[Total Time]: 3529.115 ms # Without SIMD dist

this is profiling.

# With SIMD dist
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
56.69 2.76 2.76 29357400 0.00 0.00 distAVX2(double*, double*, int)
20.95 3.78 1.02 1 1.02 1.02 initData(double*, int, int)
17.46 4.63 0.85 24 0.04 0.04 computeCentroids(WorkerArgs*)
3.08 4.78 0.15 computeAssignmentsSingle(WorkerArgs*, double*, int, int)
0.82 4.82 0.04 24 0.00 0.00 computeAssignmentsMutiThread(WorkerArgs*)
0.62 4.85 0.03 24 0.00 0.10 computeCost(WorkerArgs*)
0.41 4.87 0.02 _init
0.00 4.87 0.00 4 0.00 0.00 std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string<std::allocator<char> >(char const*, std::allocator<char> const&)
0.00 4.87 0.00 2 0.00 0.00 logToFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, double, double*, int*, double*, int, int, int)
0.00 4.87 0.00 2 0.00 0.00 CycleTimer::secondsPerTick()
0.00 4.87 0.00 1 0.00 3.18 kMeansThread(double*, double*, int*, int, int, int, double)
0.00 4.87 0.00 1 0.00 0.00 initCentroids(double*, int, int)
0.00 4.87 0.00 1 0.00 0.00 readData(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, double**, double**, int**, int*, int*, int*, double*)
0.00 4.87 0.00 1 0.00 0.00 writeData(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, double*, double*, int*, int*, int*, int*, double*)

It seems magic happens. Times of calling distAVX2 is reduced significantly. ( But Actually I dont know why 🥹) And the iteration turns is the same, 24-iteration.

Parallel Computing CS149 ASST 1 note
https://laplace825.github.io/posts/parallel-computing/parallelcomputingcs149asst1/
作者
Laplace
发布于
2024-11-20
许可协议
CC BY-NC-SA 4.0