void mv_multiply_add(int m, int n, float* y, float* A, float* x) {\n",
"\tfor (int i = 0; i < m; i++) {\n",
"\t\tfloat acc = 0.0; // product accumulator\n",
"\t\tfor (int j = 0; j < n; j++) {\n",
"\t\t\tacc += A[i*n+j] * x[j];\n",
"\t\t}\n",
"\t\ty[i] = beta * y[i] + alpha * acc;\n",
"\t}\n",
"}\n",
"

"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"source": [
"void mm_multiply_add(int m, int n, int p, float* Y, float* A, float* X) {\n",
"\tfor (int k = 0; k < p; k++) {\n",
"\t\tfor (int i = 0; i < m; i++) {\n",
"\t\t\tfloat acc = 0.0; // product accumulator\n",
"\t\t\tfor (int j = 0; j < n; j++) {\n",
"\t\t\t\tacc += A[i*n+j] * x[j*p+k];\n",
"\t\t\t}\n",
"\t\t\ty[i*p+k] = beta * y[i*p+k] + alpha * acc;\n",
"\t\t}\n",
"\t}\n",
"}\n",
"

\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"source": [
"The addition of an extra outer loop (over $k \\in \\{1, \\ldots, p\\}$) adds more operations to the algorithm that could be parallelized by an efficient implementation.\n",
"\n",
"