𝓫𝔂 𝓖𝓱𝓪𝓼𝓼𝓮𝓷 𝓕𝓪𝓽𝓷𝓪𝓼𝓼𝓲

Introduction

Just as the title says , this is my own write up for LAB 1 of CS149 ⇒ don’t take it as groundtruth !!

You can look at this as the answer sheet , for ProblemSetyou should follow the ReadMe here.

Please Note that the lab is intended to be done on a QUAD-core 4.2 GHz Intel Core i7 but since i don’t have that , i’ll be running it on my own machine AMD Ryzen 3 3200U , another chance for blind exploration haha

Program 1: Parallel Fractal Generation Using Threads

Basic Multi-Threaded Generation & analysis

This part is pretty straightforward code wise , since all you have to do is read the code , realize the only thigs you’ll have to change are the args for each thread , make the start row and number of rows treated by each thread equal.

The important pieces here are easy to figure out once u take the photo and cut it into num_threads seperate parts horizontally , can be done vertically but then you will have to do some tweaks to the original code

I also added a timer on each worker(thread) as u see since i needed to check if some threads had more load than others, turns out yes there is and there is a very good reason why too, we’ll discuss it in its respecitive part

The results?

afinal.png

This induced a lot of thought in me , it was a question of what was the optimal way to measure , i had 2 choices , either take the average over many iterations and call it a day ( since i was thinking that mayybe the other processes in my computer were intervening and not giving my program the freedom and control it needs to harness the power of those hyper-threads fully ) , or i can take the minimum and call it a day . both approaches are ,imo , viable since both have their shortcomings (since the compiler makes optimizations either way)

I made this graph with CLAUDE after giving it the textual output of my shell , the results were very variant, it’s obvious that the optimal approach is when the num_cpu_hyperthreads==num_threads in the code , which means 4 (my cpu has 2 cores and 2 threads_per_core ⇒ 4 hyperthreads in total) , and the graph supports this hypothesis, but ths graph is actually nitty picked , most other graphs i captured don’t have show this behavior :

Screenshot from 2024-12-26 16-55-16.png

Screenshot from 2024-12-26 18-53-25.png

as you can see, results aren’t very reproduceable (primarly since my hardware is actually running other programs).

if there’s anything i learned from this first part of the lab , was that benchmarking hardware and parallel software is definetly not easy 🙃

Going back to the problem of threads doing unbalanced workload , u can definetly see that if we time each thread.

here’s the result :

⇒ we can see most of the load being on thread 1 in first example and on thread 4 in the second example , This stems from the fact that not all pixels in the mandelbrot image are equal compute wise , some parts of the image associated to certain threads are dense in pixel that take a lot of time to converge ( which are basically the brightest pixels)

Balanced Multi-Threaded Generation & analysis

With an unbalanced assignement of load between threads, I reached 2x≤ S ≤3x , with S being the speedup

If we wanna get to 6x~8x speedups , we gotta up our game!

Now the question is : How do we make these threads do approximately equal work ?

The LAB guide says I need to find a way to change the mapping pixel-thread to ensure maximum pixel/ms over all threads

some ideas I’m having while writing this :

void mandelbrotSerial(
    float x0, float y0, float x1, float y1,
    int width, int height,
    int startRow, int totalRows,
    int maxIterations,
    int output[])
{
  float dx = (x1 - x0) / width;
  float dy = (y1 - y0) / height;

  int endRow = startRow + totalRows;

  for (int j = startRow; j < endRow; ++j)
  {
    for (int i = 0; i < width; ++i)
    {
      float x = x0 + i * dx;
      float y = y0 + j * dy;

      int index1 = (j * width + i);
      output[index1] = mandel(x, y, maxIterations);
      int index2 = ((height - j - 1) * width + i);
      output[index2] = output[index1];
         
    }
  }
}
void workerThreadStart(WorkerArgs *const args)
{
    const int CHUNK_SIZE = 16;
    std::atomic<int> &nextRowIndex = *args->nextRow;

    while (true)
    {
        uint startRow = nextRowIndex.fetch_add(CHUNK_SIZE, std::memory_order_relaxed);
        if (startRow >= args->height / 2)
        {
            break;
        }

        int rowsToProcess = std::min(CHUNK_SIZE, static_cast<int>(args->height / 2 - startRow));

        mandelbrotSerial(
            args->x0, args->y0,
            args->x1, args->y1,
            args->width, args->height,
            startRow, rowsToProcess,
            args->maxIterations,
            args->output);
    }
}

Program 2: Vectorizing Code Using SIMD Intrinsics

in this program we’ll be implementing a[i]^exp[i] using SIMD , but there’s a catch , we won’t do it on real SIMD , we’ll be doing it using CS149's "fake vector intrinsics” to adapt first to how to use ISPC and not jump directly into hard bugs , thanks CS149 for the effort ; lessgo

Exploration:

Exploitation:

Program 3: Parallel Fractal Generation Using ISPC:

Program 4: Iterative sqrt:

This program is just to double check our understanding of all the previous concepts ( CS149 Instructor’s words) .

It’s about a fast newton based sqrt applied on 20 million floats betwenn 0 and 3.

Exploration:

Exploitation:

Program 5: BLAS saxpy :

In this program we’ll get introduced to saxpy more and check limitations on CPU

Exploration:

Exploitation:

Program 6: Making K-Means Faster :

I’m not a stanford Student , so i couldn’t access the data in this part , so I’m skipping it!

LAB1 Done, Awesome Nuggets CS149