𝓫𝔂 𝓖𝓱𝓪𝓼𝓼𝓮𝓷 𝓕𝓪𝓽𝓷𝓪𝓼𝓼𝓲
Just as the title says , this is my own write up for LAB 1 of CS149 ⇒ don’t take it as groundtruth !!
You can look at this as the answer sheet , for ProblemSetyou should follow the ReadMe here.
Please Note that the lab is intended to be done on a QUAD-core 4.2 GHz Intel Core i7
but since i don’t have that , i’ll be running it on my own machine AMD Ryzen 3 3200U
, another chance for blind exploration haha
Basic Multi-Threaded Generation & analysis
This part is pretty straightforward code wise , since all you have to do is read the code , realize the only thigs you’ll have to change are the args for each thread , make the start row and number of rows treated by each thread equal.
The important pieces here are easy to figure out once u take the photo and cut it into num_threads seperate parts horizontally , can be done vertically but then you will have to do some tweaks to the original code
I also added a timer on each worker(thread) as u see since i needed to check if some threads had more load than others, turns out yes there is and there is a very good reason why too, we’ll discuss it in its respecitive part
The results?
This induced a lot of thought in me , it was a question of what was the optimal way to measure , i had 2 choices , either take the average over many iterations and call it a day ( since i was thinking that mayybe the other processes in my computer were intervening and not giving my program the freedom and control it needs to harness the power of those hyper-threads fully ) , or i can take the minimum and call it a day . both approaches are ,imo , viable since both have their shortcomings (since the compiler makes optimizations either way)
I made this graph with CLAUDE after giving it the textual output of my shell , the results were very variant, it’s obvious that the optimal approach is when the num_cpu_hyperthreads==num_threads in the code , which means 4 (my cpu has 2 cores and 2 threads_per_core ⇒ 4 hyperthreads in total) , and the graph supports this hypothesis, but ths graph is actually nitty picked , most other graphs i captured don’t have show this behavior :
as you can see, results aren’t very reproduceable (primarly since my hardware is actually running other programs).
if there’s anything i learned from this first part of the lab , was that benchmarking hardware and parallel software is definetly not easy 🙃
Going back to the problem of threads doing unbalanced workload , u can definetly see that if we time each thread.
here’s the result :
4 threads:
(took 0.02 on thread 0)
(took 0.02 on thread 3)
(took 0.23 on thread 1) **Most Loaded thread**
(took 0.08 on thread 2)
8 threads:
(took 0.00 on thread 0)
(took 0.00 on thread 6)
(took 0.04 on thread 2)
(took 0.04 on thread 1)
(took 0.22 on thread 4) **Most Loaded thread**
(took 0.06 on thread 3)
⇒ we can see most of the load being on thread 1 in first example and on thread 4 in the second example , This stems from the fact that not all pixels in the mandelbrot image are equal compute wise , some parts of the image associated to certain threads are dense in pixel that take a lot of time to converge ( which are basically the brightest pixels)
Balanced Multi-Threaded Generation & analysis
With an unbalanced assignement of load between threads, I reached 2x≤ S ≤3x , with S being the speedup
If we wanna get to 6x~8x speedups , we gotta up our game!
Now the question is : How do we make these threads do approximately equal work ?
The LAB guide says I need to find a way to change the mapping pixel-thread to ensure maximum pixel/ms over all threads
some ideas I’m having while writing this :
void mandelbrotSerial(
float x0, float y0, float x1, float y1,
int width, int height,
int startRow, int totalRows,
int maxIterations,
int output[])
{
float dx = (x1 - x0) / width;
float dy = (y1 - y0) / height;
int endRow = startRow + totalRows;
for (int j = startRow; j < endRow; ++j)
{
for (int i = 0; i < width; ++i)
{
float x = x0 + i * dx;
float y = y0 + j * dy;
int index1 = (j * width + i);
output[index1] = mandel(x, y, maxIterations);
int index2 = ((height - j - 1) * width + i);
output[index2] = output[index1];
}
}
}
void workerThreadStart(WorkerArgs *const args)
{
const int CHUNK_SIZE = 16;
std::atomic<int> &nextRowIndex = *args->nextRow;
while (true)
{
uint startRow = nextRowIndex.fetch_add(CHUNK_SIZE, std::memory_order_relaxed);
if (startRow >= args->height / 2)
{
break;
}
int rowsToProcess = std::min(CHUNK_SIZE, static_cast<int>(args->height / 2 - startRow));
mandelbrotSerial(
args->x0, args->y0,
args->x1, args->y1,
args->width, args->height,
startRow, rowsToProcess,
args->maxIterations,
args->output);
}
}
when running the code with more threads than are available , the code doesn’t run any faster , opposite to that ,the code is slowed further because of overhead , since we only have x cores , we can only hope for a x speedups at most [ since we not leveraging any SIMD instructions even tho our code can be done that way ; we are leveraging raw Fetch/Decode power of the cpu] , even that is a stretch , since the program is not perfectly parallel since not all pixels are equal in load ⇒ we won’t experience a better speedup with 8/16 threads on quad-core (double hyperthreaded ) than with 4 threads,
Let's analyze this using Amdahl's Law: Given:
The serial fraction (f) of our program can be calculated as: f = M/(M×N) = 1/N
Therefore, according to Amdahl's Law, the theoretical maximum speedup (S) is: S ≤ N (when perfectly parallel (no overhead , no bandwidth limit…))
This means we could theoretically achieve a speedup equal to the number of pixels in our image, assuming perfect parallelization.
in this program we’ll be implementing a[i]^exp[i] using SIMD , but there’s a catch , we won’t do it on real SIMD , we’ll be doing it using CS149's "fake vector intrinsics”
to adapt first to how to use ISPC and not jump directly into hard bugs , thanks CS149 for the effort ; lessgo
Exploration:
so when i build and run the program using make
and ./myexp
without any modifications , this is what i get :
CLAMPED EXPONENT (required)
Wrong calculation at value[0]!
value = 2.360751 2.132397 2.646590 0.340891 0.111099 0.909588 0.459138 2.808919 1.542847 -0.433590 -0.934798 -0.451074 -0.373284 -0.480838 2.995698 1.051730
exp = 6 5 5 2 1 7 9 6 6 6 8 9 0 3 5 2
output = 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
gold = 9.999999 9.999999 9.999999 0.116207 0.111099 0.515126 0.000907 9.999999 9.999999 0.006645 0.583097 -0.000773 1.000000 -0.111172 9.999999 1.106135
****************** Printing Vector Unit Statistics *******************
Vector Width: 4
Total Vector Instructions: 0
Vector Utilization: -nan%
Utilized Vector Lanes: 0
Total Vector Lanes: 0
************************ Result Verification *************************
@@@ Failed!!!
ARRAY SUM (bonus)
Expected 14.686064, got 0.000000
.@@@ Failed!!!
I expected this , since many functions are declared but missing implementation so they won’t be changing their input
I’m supposed to look at absVector()
and try to find a way to use the cs149 intrinsic to implement a SIMD version ofclampedExpSerial()
and arraySumSerial()
absVector()
:
void absVector(float *values, float *output, int N)
{
__cs149_vec_float x;
__cs149_vec_float result;
__cs149_vec_float zero = _cs149_vset_float(0.f);
__cs149_mask maskAll, maskIsNegative, maskIsNotNegative;
// Note: Take a careful look at this loop indexing. This example
// code is not guaranteed to work when (N % VECTOR_WIDTH) != 0.
// Why is that the case?
for (int i = 0; i < N; i += VECTOR_WIDTH)
{
maskAll = _cs149_init_ones();
maskIsNegative = _cs149_init_ones(0);
_cs149_vload_float(x, values + i, maskAll);
cs149_vlt_float(maskIsNegative, x, zero, maskAll);
_cs149_vsub_float(result, zero, x, maskIsNegative);
maskIsNotNegative = _cs149_mask_not(maskIsNegative);
_cs149_vload_float(result, values + i, maskIsNotNegative);
_cs149_vstore_float(output + i, result, maskAll);
}
}
The comments pretty much explain what each instruction is doing , it gets clearer after reading through their definition in the code.
The comment says that this function is not guaranteed to work if N mod VECTOR_WIDTH ≠ 0
and that’s perfectly reasonable because there’s an obvious segmentation fault otherwise ; here’s what i get when i try to launch absVector()
on N=17 and SIMD_width=4:
You have written to out of bound value!
Wrong calculation at value[17]!
clampedExpSerial():
void clampedExpSerial(float *values, int *exponents, float *output, int N)
{
for (int i = 0; i < N; i++)
{
float x = values[i];
int y = exponents[i];
if (y == 0)
{
output[i] = 1.f;
}
else
{
float result = x;
int count = y - 1;
while (count > 0)
{
result *= x;
count--;
}
if (result > 9.999999f)
{
result = 9.999999f;
}
output[i] = result;
}
}
}
the function is rather simple , the exponentiation is done naively here, it can become a lot faster by making it binary exponentiation O(N) ⇒ O(log(N))
IK it’s not part of CS149 but I always like optimizing things when I see them! here’s the optimized version :
void clampedBinaryExpSerial(float *values, int *exponents, float *output, int N)
{
for (int i = 0; i < N; i++)
{
float x = values[i];
int y = exponents[i];
float res = 1;
while (y > 0)
{
if (y & 1)
{
res = res * x;
if (res > 9.999999f)
{
res = 9.999999f;
break;
}
}
y >>= 1;
x *= x;
}
output[i] = res;
}
}
enough stalling the real deal , in the exploitation part i’ll discuss the approaches I used to make this function go BRRR using what i learned so far in CS149 , SIMD particularly.
Exploitation:
clampedExpSerial():
one thing i realized is that the number of unique masks i’m going to use is not hard to find , it’s (in most cases) exactly the number of IF_CLAUSES in my serial codeafter much though and debugging ( it really took alot of time ) , this approach turned fine , I’m really proud of it since it’s not only parallel but also BinaryExp ⇒ O(log(N))*(vec_length/ SIMD_WIDTH)
void clampedBinaryExpVector(float *values, int *exponents, float *output, int N)
{
// CS149 STUDENTS TODO: Implement your vectorized version of
// clampedExpSerial() here.
__cs149_vec_int y, zero, two_v;
__cs149_vec_float x, nine, res;
__cs149_mask all1, if1, if2, if3;
two_v = _cs149_vset_int(2);
nine = _cs149_vset_float(9.999999f);
zero = _cs149_vset_int(0);
for (int i = 0; i < N; i += VECTOR_WIDTH)
{
int limit = VECTOR_WIDTH;
if (i + VECTOR_WIDTH > N)
{
limit = N - i;
}
all1 = _cs149_init_ones(limit);
res = _cs149_vset_float(1.f);
_cs149_vload_float(x, values + i, all1);
_cs149_vload_int(y, exponents + i, all1);
_cs149_vgt_int(if1, y, zero, all1);
while (_cs149_cntbits(if1)) // while( all yi's>0 )
{
// creating if2 using if1
__cs149_vec_int times_2, by_2;
_cs149_vdiv_int(by_2, y, two_v, all1);
_cs149_vmult_int(times_2, by_2, two_v, all1);
_cs149_vgt_int(if2, y, times_2, if1);
if2 = _cs149_mask_and(if2, if1);
//
// calculating res and clamping it
_cs149_vmult_float(res, res, x, if2); // resi=resi*xi for all the i where (yi&1)==1
_cs149_vgt_float(if3, res, nine, all1);
if3 = _cs149_mask_and(if3, if2);
_cs149_vmove_float(res, nine, if3);
//
_cs149_vdiv_int(y, y, two_v, all1); // yi>>=1 for all i;
_cs149_vmult_float(x, x, x, all1); // x*=x
_cs149_vgt_int(if1, y, zero, if1); // giving value for curr which will be evaluated in next while if
if1 = _cs149_mask_and(if1, all1);
}
_cs149_vstore_float(output + i, res, all1);
}
}
this code works for any size of vector and any SIMD_WIDTH since i masked the out of bounds value at the end.
here’s the result:
**CLAMPED EXPONENT (required)**
Results matched with answer!
****************** Printing Vector Unit Statistics *******************
Vector Width: 4
Total Vector Instructions: 286
Vector Utilization: 82.3%
Utilized Vector Lanes: 706
Total Vector Lanes: 858
************************ Result Verification *************************
**Passed!!!**
Here’s the SIMD utilisation based on their WIDTH:
The only factor that is changing here is the final part of our vector that might not fit completely within the SIMD and might leave some LANES non functional ⇒ we get a loss to vector utilization , so the more granular the better , that’s why have a very high value at 1 , and if we try to put SIMD_WIDTH >> N , we’ll easily reach values of N/SIMD_WIDTH utilization which is very low .
Another reason why the active lanes drop the more we increase the SIMD_WIDTH , is the probability of a number taking a lot longer that the others increase the wider the SIMD , which will make load imbalance on all the small SIMD ALU’s .
If this was real SIMD we were able to measure the time it took for our operations , we would be able to see that the vector utilization is decreasing and the speedup is increasing the more we increase the SIMD_WIDTH, up until a certain threshhold that is a function of our data (values and exponents) and also a function of SIMD_WIDTH.
arraySumSerial():
sqrt
:This program is just to double check our understanding of all the previous concepts ( CS149 Instructor’s words) .
It’s about a fast newton based sqrt applied on 20 million floats betwenn 0 and 3.
Exploration:
Exploitation:
saxpy
:In this program we’ll get introduced to saxpy more and check limitations on CPU
Exploration:
Exploitation:
K-Means
Faster :I’m not a stanford Student , so i couldn’t access the data in this part , so I’m skipping it!