𝓫𝔂 𝓖𝓱𝓪𝓼𝓼𝓮𝓷 𝓕𝓪𝓽𝓷𝓪𝓼𝓼𝓲

Introduction

Just as the title says , this is my own write up for LAB 1 of CS149 ⇒ don’t take it as groundtruth !!

You can look at this as the answer sheet , for ProblemSetyou should follow the ReadMe here.

Please Note that the lab is intended to be done on a QUAD-core 4.2 GHz Intel Core i7 but since i don’t have that , i’ll be running it on my own machine AMD Ryzen 3 3200U , another chance for blind exploration haha

Program 1: Parallel Fractal Generation Using Threads

Basic Multi-Threaded Generation & analysis

This part is pretty straightforward code wise , since all you have to do is read the code , realize the only thigs you’ll have to change are the args for each thread , make the start row and number of rows treated by each thread equal.

The important pieces here are easy to figure out once u take the photo and cut it into num_threads seperate parts horizontally , can be done vertically but then you will have to do some tweaks to the original code

I also added a timer on each worker(thread) as u see since i needed to check if some threads had more load than others, turns out yes there is and there is a very good reason why too, we’ll discuss it in its respecitive part

The results?

This induced a lot of thought in me , it was a question of what was the optimal way to measure , i had 2 choices , either take the average over many iterations and call it a day ( since i was thinking that mayybe the other processes in my computer were intervening and not giving my program the freedom and control it needs to harness the power of those hyper-threads fully ) , or i can take the minimum and call it a day . both approaches are ,imo , viable since both have their shortcomings (since the compiler makes optimizations either way)

If your focus is peak hardware performance: Take the minimum.
If your focus is realistic performance in practice: Take the average and optionally report variance

I made this graph with CLAUDE after giving it the textual output of my shell , the results were very variant, it’s obvious that the optimal approach is when the num_cpu_hyperthreads==num_threads in the code , which means 4 (my cpu has 2 cores and 2 threads_per_core ⇒ 4 hyperthreads in total) , and the graph supports this hypothesis, but ths graph is actually nitty picked , most other graphs i captured don’t have show this behavior :

Screenshot from 2024-12-26 16-55-16.png

Screenshot from 2024-12-26 18-53-25.png

as you can see, results aren’t very reproduceable (primarly since my hardware is actually running other programs).

if there’s anything i learned from this first part of the lab , was that benchmarking hardware and parallel software is definetly not easy 🙃

Going back to the problem of threads doing unbalanced workload , u can definetly see that if we time each thread.

here’s the result :

4 threads:

(took 0.02 on thread 0)
(took 0.02 on thread 3)
(took 0.23 on thread 1) **Most Loaded thread**
(took 0.08 on thread 2)

8 threads:

(took 0.00 on thread 0)
(took 0.00 on thread 6)
(took 0.04 on thread 2)
(took 0.04 on thread 1)
(took 0.22 on thread 4) **Most Loaded thread**
(took 0.06 on thread 3)

⇒ we can see most of the load being on thread 1 in first example and on thread 4 in the second example , This stems from the fact that not all pixels in the mandelbrot image are equal compute wise , some parts of the image associated to certain threads are dense in pixel that take a lot of time to converge ( which are basically the brightest pixels)

Balanced Multi-Threaded Generation & analysis

With an unbalanced assignement of load between threads, I reached 2x≤ S ≤3x , with S being the speedup

If we wanna get to 6x~8x speedups , we gotta up our game!

Now the question is : How do we make these threads do approximately equal work ?

The LAB guide says I need to find a way to change the mapping pixel-thread to ensure maximum pixel/ms over all threads

some ideas I’m having while writing this :

the mandelbrot set is symmetric with respect to the X axis ⇒ we can only compute half the image ,then copy the rest , or we can leverage L1d & L1i [L1 is first layer of cache , d for data and i for instructions]
we can compute the pixel and its symmetric and let the cpu have fun optimizing the second pixel data and instruction retrieval ( obviously the first approach is better , but i have a feeling this isn’t what the instructor of CS149 wanted)
here’s the code for an optimized serial anyway(Serial took 30~130ms while it used to take 60~300ms) , it also worked for the ThreadedMandelBrot but the speedup wasn’t significant since the load imbalance between threads is still there.

void mandelbrotSerial(
    float x0, float y0, float x1, float y1,
    int width, int height,
    int startRow, int totalRows,
    int maxIterations,
    int output[])
{
  float dx = (x1 - x0) / width;
  float dy = (y1 - y0) / height;

  int endRow = startRow + totalRows;

  for (int j = startRow; j < endRow; ++j)
  {
    for (int i = 0; i < width; ++i)
    {
      float x = x0 + i * dx;
      float y = y0 + j * dy;

      int index1 = (j * width + i);
      output[index1] = mandel(x, y, maxIterations);
      int index2 = ((height - j - 1) * width + i);
      output[index2] = output[index1];
         
    }
  }
}

the next approach i adopted was to give each thread a single row to process and once it finishes , i’ll give it the next row+num_threads , it achieved slighly better , probably because there will not be single rows that are dense in slow converging pixel proportionally to other rows.
I managed to get 6x speedup by making chunks of the image,16 rows chunk for every job a certain worker does at a certain time , so if a thread is behind the others , at most he’ll be 16 rows behind which is good compared to whole group of 120 (600/4) rows behind , the chunk size here is a hyper parameter that can be tuned based on the size of L1d and L2 cache registers
you can also notice from the code below i used std::atomic which guarantees that all instructions performed on it are atomic ⇒ less overhead

void workerThreadStart(WorkerArgs *const args)
{
    const int CHUNK_SIZE = 16;
    std::atomic<int> &nextRowIndex = *args->nextRow;

    while (true)
    {
        uint startRow = nextRowIndex.fetch_add(CHUNK_SIZE, std::memory_order_relaxed);
        if (startRow >= args->height / 2)
        {
            break;
        }

        int rowsToProcess = std::min(CHUNK_SIZE, static_cast<int>(args->height / 2 - startRow));

        mandelbrotSerial(
            args->x0, args->y0,
            args->x1, args->y1,
            args->width, args->height,
            startRow, rowsToProcess,
            args->maxIterations,
            args->output);
    }
}

when running the code with more threads than are available , the code doesn’t run any faster , opposite to that ,the code is slowed further because of overhead , since we only have x cores , we can only hope for a x speedups at most [ since we not leveraging any SIMD instructions even tho our code can be done that way ; we are leveraging raw Fetch/Decode power of the cpu] , even that is a stretch , since the program is not perfectly parallel since not all pixels are equal in load ⇒ we won’t experience a better speedup with 8/16 threads on quad-core (double hyperthreaded ) than with 4 threads,
Let's analyze this using Amdahl's Law: Given:
- Total pixels = N
- avg Iterations per pixel = M
The serial fraction (f) of our program can be calculated as: f = M/(M×N) = 1/N

Therefore, according to Amdahl's Law, the theoretical maximum speedup (S) is: S ≤ N (when perfectly parallel (no overhead , no bandwidth limit…))

This means we could theoretically achieve a speedup equal to the number of pixels in our image, assuming perfect parallelization.

Program 2: Vectorizing Code Using SIMD Intrinsics

in this program we’ll be implementing a[i]^exp[i] using SIMD , but there’s a catch , we won’t do it on real SIMD , we’ll be doing it using CS149's "fake vector intrinsics” to adapt first to how to use ISPC and not jump directly into hard bugs , thanks CS149 for the effort ; lessgo

Exploration:

so when i build and run the program using make and ./myexp without any modifications , this is what i get :

CLAMPED EXPONENT (required)
Wrong calculation at value[0]!
value  =  2.360751  2.132397  2.646590  0.340891  0.111099  0.909588  0.459138  2.808919  1.542847 -0.433590 -0.934798 -0.451074 -0.373284 -0.480838  2.995698  1.051730
exp    =         6         5         5         2         1         7         9         6         6         6         8         9         0         3         5         2
output =  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000
gold   =  9.999999  9.999999  9.999999  0.116207  0.111099  0.515126  0.000907  9.999999  9.999999  0.006645  0.583097 -0.000773  1.000000 -0.111172  9.999999  1.106135
****************** Printing Vector Unit Statistics *******************
Vector Width:              4
Total Vector Instructions: 0
Vector Utilization:        -nan%
Utilized Vector Lanes:     0
Total Vector Lanes:        0
************************ Result Verification *************************
@@@ Failed!!!

ARRAY SUM (bonus)
Expected 14.686064, got 0.000000
.@@@ Failed!!!

I expected this , since many functions are declared but missing implementation so they won’t be changing their input

I’m supposed to look at absVector() and try to find a way to use the cs149 intrinsic to implement a SIMD version ofclampedExpSerial() and arraySumSerial()

absVector() :

void absVector(float *values, float *output, int N)
{
  __cs149_vec_float x;
  __cs149_vec_float result;
  __cs149_vec_float zero = _cs149_vset_float(0.f);
  __cs149_mask maskAll, maskIsNegative, maskIsNotNegative;

  //  Note: Take a careful look at this loop indexing.  This example
  //  code is not guaranteed to work when (N % VECTOR_WIDTH) != 0.
  //  Why is that the case?
  for (int i = 0; i < N; i += VECTOR_WIDTH)
  {

    maskAll = _cs149_init_ones();

    maskIsNegative = _cs149_init_ones(0);

    _cs149_vload_float(x, values + i, maskAll);
    
    cs149_vlt_float(maskIsNegative, x, zero, maskAll);

    _cs149_vsub_float(result, zero, x, maskIsNegative);

    maskIsNotNegative = _cs149_mask_not(maskIsNegative);

    _cs149_vload_float(result, values + i, maskIsNotNegative);

    _cs149_vstore_float(output + i, result, maskAll);
  }
}

The comments pretty much explain what each instruction is doing , it gets clearer after reading through their definition in the code.

The comment says that this function is not guaranteed to work if N mod VECTOR_WIDTH ≠ 0 and that’s perfectly reasonable because there’s an obvious segmentation fault otherwise ; here’s what i get when i try to launch absVector() on N=17 and SIMD_width=4:

You have written to out of bound value!
Wrong calculation at value[17]!

clampedExpSerial():

void clampedExpSerial(float *values, int *exponents, float *output, int N)
{
  for (int i = 0; i < N; i++)
  {
    float x = values[i];
    int y = exponents[i];
    if (y == 0)
    {
      output[i] = 1.f;
    }
    else
    {
      float result = x;
      int count = y - 1;
      while (count > 0)
      {
        result *= x;
        count--;
      }
      if (result > 9.999999f)
      {
        result = 9.999999f;
      }
      output[i] = result;
    }
  }
}

the function is rather simple , the exponentiation is done naively here, it can become a lot faster by making it binary exponentiation O(N) ⇒ O(log(N))

IK it’s not part of CS149 but I always like optimizing things when I see them! here’s the optimized version :

void clampedBinaryExpSerial(float *values, int *exponents, float *output, int N)
{
  for (int i = 0; i < N; i++)
  {
    float x = values[i];
    int y = exponents[i];

    float res = 1;
    while (y > 0)
    {
      if (y & 1)
      {
        res = res * x;
        if (res > 9.999999f)
        {
          res = 9.999999f;
          break;
        }
      }
      y >>= 1;
      x *= x;
    }
    output[i] = res;
  }
}

enough stalling the real deal , in the exploitation part i’ll discuss the approaches I used to make this function go BRRR using what i learned so far in CS149 , SIMD particularly.

Exploitation:

clampedExpSerial():

one thing i realized is that the number of unique masks i’m going to use is not hard to find , it’s (in most cases) exactly the number of IF_CLAUSES in my serial codeafter much though and debugging ( it really took alot of time ) , this approach turned fine , I’m really proud of it since it’s not only parallel but also BinaryExp ⇒ O(log(N))*(vec_length/ SIMD_WIDTH)

void clampedBinaryExpVector(float *values, int *exponents, float *output, int N)
{
  // CS149 STUDENTS TODO: Implement your vectorized version of
  // clampedExpSerial() here.

  __cs149_vec_int y, zero, two_v;
  __cs149_vec_float x, nine, res;
  __cs149_mask all1, if1, if2, if3;
  two_v = _cs149_vset_int(2);
  nine = _cs149_vset_float(9.999999f);
  zero = _cs149_vset_int(0);
  for (int i = 0; i < N; i += VECTOR_WIDTH)
  {
    int limit = VECTOR_WIDTH;
    if (i + VECTOR_WIDTH > N)
    {
      limit = N - i;
    }
    all1 = _cs149_init_ones(limit);
    res = _cs149_vset_float(1.f);
    _cs149_vload_float(x, values + i, all1);
    _cs149_vload_int(y, exponents + i, all1);
    _cs149_vgt_int(if1, y, zero, all1);
    while (_cs149_cntbits(if1)) // while( all yi's>0 )
    {
      // creating if2 using if1
      __cs149_vec_int times_2, by_2;
      _cs149_vdiv_int(by_2, y, two_v, all1);
      _cs149_vmult_int(times_2, by_2, two_v, all1);
      _cs149_vgt_int(if2, y, times_2, if1);
      if2 = _cs149_mask_and(if2, if1);
      //

      // calculating res and clamping it
      _cs149_vmult_float(res, res, x, if2); // resi=resi*xi for all the i where (yi&1)==1
      _cs149_vgt_float(if3, res, nine, all1);
      if3 = _cs149_mask_and(if3, if2);
      _cs149_vmove_float(res, nine, if3);
      //

      _cs149_vdiv_int(y, y, two_v, all1); // yi>>=1 for all i;
      _cs149_vmult_float(x, x, x, all1);  // x*=x

      _cs149_vgt_int(if1, y, zero, if1); // giving value for curr which will be evaluated in next while if
      if1 = _cs149_mask_and(if1, all1);
    }
    _cs149_vstore_float(output + i, res, all1);
  }
}

this code works for any size of vector and any SIMD_WIDTH since i masked the out of bounds value at the end.

here’s the result:

**CLAMPED EXPONENT (required)**
Results matched with answer!
****************** Printing Vector Unit Statistics *******************
Vector Width:              4
Total Vector Instructions: 286
Vector Utilization:        82.3%
Utilized Vector Lanes:     706
Total Vector Lanes:        858
************************ Result Verification *************************
**Passed!!!**

Here’s the SIMD utilisation based on their WIDTH:

The only factor that is changing here is the final part of our vector that might not fit completely within the SIMD and might leave some LANES non functional ⇒ we get a loss to vector utilization , so the more granular the better , that’s why have a very high value at 1 , and if we try to put SIMD_WIDTH >> N , we’ll easily reach values of N/SIMD_WIDTH utilization which is very low .

Another reason why the active lanes drop the more we increase the SIMD_WIDTH , is the probability of a number taking a lot longer that the others increase the wider the SIMD , which will make load imbalance on all the small SIMD ALU’s .

If this was real SIMD we were able to measure the time it took for our operations , we would be able to see that the vector utilization is decreasing and the speedup is increasing the more we increase the SIMD_WIDTH, up until a certain threshhold that is a function of our data (values and exponents) and also a function of SIMD_WIDTH.

arraySumSerial():

Program 3: Parallel Fractal Generation Using ISPC:

Part 1 : ISPC Basics
Part 2 : ISPC Tasks

Program 4: Iterative `sqrt`:

This program is just to double check our understanding of all the previous concepts ( CS149 Instructor’s words) .

It’s about a fast newton based sqrt applied on 20 million floats betwenn 0 and 3.

Exploration:

Exploitation:

Program 5: BLAS `saxpy` :

In this program we’ll get introduced to saxpy more and check limitations on CPU

Exploration:

Exploitation:

Program 6: Making `K-Means` Faster :

I’m not a stanford Student , so i couldn’t access the data in this part , so I’m skipping it!

LAB1 Done, Awesome Nuggets CS149

Introduction

Program 1: Parallel Fractal Generation Using Threads

Program 2: Vectorizing Code Using SIMD Intrinsics

Program 3: Parallel Fractal Generation Using ISPC:

Program 4: Iterative sqrt:

Program 5: BLAS saxpy :

Program 6: Making K-Means Faster :

LAB1 Done, Awesome Nuggets CS149

Program 4: Iterative `sqrt`:

Program 5: BLAS `saxpy` :

Program 6: Making `K-Means` Faster :