Different results when using more than one thread
The following code performs a transpose matrix-vector multiplication where the matrix is sparse and stored in CSR format. Depending on the number of threads the result is different. I think the reason is the concurrent memory access and addition.
Is there a way to use multithreading but keep the result the same as for single threading?
#pragma omp parallel for num_threads(m_numthreads)
for (int i = 0; i < matrix.get_rowptr()->size() - 1; ++i)
{
for (int j = matrix.get_rowptr()->operator(i); j < matrix.get_rowptr()->operator(i + 1); ++j)
{
result[matrix.get_columnindex()->operator(j)] += matrix.get_value()->operator(j) * vector1[i];
}
}
c++ vector openmp
add a comment |
The following code performs a transpose matrix-vector multiplication where the matrix is sparse and stored in CSR format. Depending on the number of threads the result is different. I think the reason is the concurrent memory access and addition.
Is there a way to use multithreading but keep the result the same as for single threading?
#pragma omp parallel for num_threads(m_numthreads)
for (int i = 0; i < matrix.get_rowptr()->size() - 1; ++i)
{
for (int j = matrix.get_rowptr()->operator(i); j < matrix.get_rowptr()->operator(i + 1); ++j)
{
result[matrix.get_columnindex()->operator(j)] += matrix.get_value()->operator(j) * vector1[i];
}
}
c++ vector openmp
Nobody can tell, without knowing the types ofresult
andmatrix
at least, and whether there's any aliasing of the two.
– Toby Speight
Nov 22 '18 at 9:19
result and vector1 are stl double vectors. Matrix is a class which holds the three vectors that are needed to describe a sparse matrix in CSR format.
– vydesaster
Nov 22 '18 at 18:12
A Minimal, Complete, and Verifiable example would make the question much clearer - could you edit so that others can reproduce and diagnose the issue you're seeing?
– Toby Speight
Nov 22 '18 at 18:18
BTW, if you're comparingint
againststd::size_t
like that, it suggests you have nowhere near enough compiler warnings enabled.
– Toby Speight
Nov 22 '18 at 18:22
add a comment |
The following code performs a transpose matrix-vector multiplication where the matrix is sparse and stored in CSR format. Depending on the number of threads the result is different. I think the reason is the concurrent memory access and addition.
Is there a way to use multithreading but keep the result the same as for single threading?
#pragma omp parallel for num_threads(m_numthreads)
for (int i = 0; i < matrix.get_rowptr()->size() - 1; ++i)
{
for (int j = matrix.get_rowptr()->operator(i); j < matrix.get_rowptr()->operator(i + 1); ++j)
{
result[matrix.get_columnindex()->operator(j)] += matrix.get_value()->operator(j) * vector1[i];
}
}
c++ vector openmp
The following code performs a transpose matrix-vector multiplication where the matrix is sparse and stored in CSR format. Depending on the number of threads the result is different. I think the reason is the concurrent memory access and addition.
Is there a way to use multithreading but keep the result the same as for single threading?
#pragma omp parallel for num_threads(m_numthreads)
for (int i = 0; i < matrix.get_rowptr()->size() - 1; ++i)
{
for (int j = matrix.get_rowptr()->operator(i); j < matrix.get_rowptr()->operator(i + 1); ++j)
{
result[matrix.get_columnindex()->operator(j)] += matrix.get_value()->operator(j) * vector1[i];
}
}
c++ vector openmp
c++ vector openmp
edited Nov 22 '18 at 18:35
Toby Speight
17.4k134368
17.4k134368
asked Nov 21 '18 at 20:10
vydesastervydesaster
175
175
Nobody can tell, without knowing the types ofresult
andmatrix
at least, and whether there's any aliasing of the two.
– Toby Speight
Nov 22 '18 at 9:19
result and vector1 are stl double vectors. Matrix is a class which holds the three vectors that are needed to describe a sparse matrix in CSR format.
– vydesaster
Nov 22 '18 at 18:12
A Minimal, Complete, and Verifiable example would make the question much clearer - could you edit so that others can reproduce and diagnose the issue you're seeing?
– Toby Speight
Nov 22 '18 at 18:18
BTW, if you're comparingint
againststd::size_t
like that, it suggests you have nowhere near enough compiler warnings enabled.
– Toby Speight
Nov 22 '18 at 18:22
add a comment |
Nobody can tell, without knowing the types ofresult
andmatrix
at least, and whether there's any aliasing of the two.
– Toby Speight
Nov 22 '18 at 9:19
result and vector1 are stl double vectors. Matrix is a class which holds the three vectors that are needed to describe a sparse matrix in CSR format.
– vydesaster
Nov 22 '18 at 18:12
A Minimal, Complete, and Verifiable example would make the question much clearer - could you edit so that others can reproduce and diagnose the issue you're seeing?
– Toby Speight
Nov 22 '18 at 18:18
BTW, if you're comparingint
againststd::size_t
like that, it suggests you have nowhere near enough compiler warnings enabled.
– Toby Speight
Nov 22 '18 at 18:22
Nobody can tell, without knowing the types of
result
and matrix
at least, and whether there's any aliasing of the two.– Toby Speight
Nov 22 '18 at 9:19
Nobody can tell, without knowing the types of
result
and matrix
at least, and whether there's any aliasing of the two.– Toby Speight
Nov 22 '18 at 9:19
result and vector1 are stl double vectors. Matrix is a class which holds the three vectors that are needed to describe a sparse matrix in CSR format.
– vydesaster
Nov 22 '18 at 18:12
result and vector1 are stl double vectors. Matrix is a class which holds the three vectors that are needed to describe a sparse matrix in CSR format.
– vydesaster
Nov 22 '18 at 18:12
A Minimal, Complete, and Verifiable example would make the question much clearer - could you edit so that others can reproduce and diagnose the issue you're seeing?
– Toby Speight
Nov 22 '18 at 18:18
A Minimal, Complete, and Verifiable example would make the question much clearer - could you edit so that others can reproduce and diagnose the issue you're seeing?
– Toby Speight
Nov 22 '18 at 18:18
BTW, if you're comparing
int
against std::size_t
like that, it suggests you have nowhere near enough compiler warnings enabled.– Toby Speight
Nov 22 '18 at 18:22
BTW, if you're comparing
int
against std::size_t
like that, it suggests you have nowhere near enough compiler warnings enabled.– Toby Speight
Nov 22 '18 at 18:22
add a comment |
2 Answers
2
active
oldest
votes
The increment operation should indeed be made atomic to make sure two threads are not updating the value at the same time, which would be a race condition. This will make the code correct, but since atomic increment is slower than standard increment, it may make the parallel code slower than the serial one overall.
Performance may also be affected by false-sharing: if the vector size is not much greater than the number of threads, it will often happen that two threads attempt to increment values belonging to the same cache line and much time is spent synchronizing the cache between CPUs.
#pragma omp parallel for num_threads(m_numthreads)
for (int i = 0; i < matrix.get_rowptr()->size() - 1; ++i)
for (int j = matrix.get_rowptr()->operator(i); j < matrix.get_rowptr()->operator(i + 1); ++j)
{
#pragma omp atomic
result[matrix.get_columnindex()->operator(j)] += matrix.get_value()->operator(j) * vector1[i];
}
This being said, for a given element in the result vector, the order in which the various products are added to make that element will not be the same in the serial and parallel cases. Rounding errors will add differently and one should not expect the serial and parallel results to be the same, but rather the difference between these results to be small relative to the precision of the float or double format.
You should probably point out that there will be very little (possibly negative) speed increase from parallelisation when the entire body of the loop is in anatomic
block.
– Toby Speight
Nov 22 '18 at 18:21
It's probably more efficient to make the elements ofresult
be some suitablestd::atomic<>
type instead. If not, at least bring the computation of lhs and rhs ahead of theatomic
section, so only the+=
is affected.
– Toby Speight
Nov 22 '18 at 18:32
I have already stated that since atomic increment is slower than standard increment, it may make the parallel code slower than the serial one. Then I do not know for sure in which direction the balance between one fast op at a time and several slow ones in parallel will point. This depends on too many factors and there may still be a small performance improvement.
– Brice
Nov 23 '18 at 8:42
I have tried the atomic pragma and it worked fine. But as stated, the performance decreased. But I am totaly fine with that. At least I am getting reproducable results.
– vydesaster
Nov 23 '18 at 19:20
add a comment |
Without a reproducible example, I've had to guess at the context of this code:
#include <vector>
class SparseMatrix
{
std::vector<double> values = { 5, 8, 3, 6, };
std::vector<std::size_t> rows = { 0, 0, 2, 3, 4, };
std::vector<std::size_t> cols = { 1, 2, 1, 1, };
public:
const std::vector<std::size_t> *get_rowptr() const { return &rows; };
const std::vector<std::size_t> *get_columnindex() const { return &cols; };
const std::vector<double> *get_value() const { return &values; };
};
#include <array>
#include <iostream>
int main()
{
SparseMatrix matrix;
std::array<double, 4> result{};
std::array<double, 4> vector1{ 1, 2, 3, 4 };
#pragma omp parallel for
for (int i = 0; i < matrix.get_rowptr()->size() - 1; ++i)
{
for (int j = matrix.get_rowptr()->operator(i); j < matrix.get_rowptr()->operator(i + 1); ++j)
{
result[matrix.get_columnindex()->operator(j)] += matrix.get_value()->operator(j) * vector1[i];
}
}
for (auto const& i: result)
std::cout << i << " ";
std::cout << 'n';
}
With some suitable variables we can simplify the code so we see what's going on:
auto const& rows = *matrix.get_rowptr();
auto const& cols = *matrix.get_columnindex();
auto const& values = *matrix.get_value();
#pragma omp parallel for
for (std::size_t i = 0; i < rows.size() - 1; ++i)
{
for (std::size_t j = rows[i]; j < rows[i+1]; ++j)
{
result[cols[j]] += values[j] * vector1[i];
}
}
Now we can see in the body of the loop that we're assigning to a result entry that other threads may also be writing to. We need to serialize the access to result[cols[j]]
so that only one thread at a time is executing the +=
. We can do that by marking that operation indivisible, using the OMP atomic
keyword:
#pragma omp parallel for
for (std::size_t i = 0; i < rows.size() - 1; ++i)
{
for (std::size_t j = rows[i]; j < rows[i+1]; ++j)
{
auto& dest = result[cols[j]];
auto const term = values[j] * vector1[i];
#pragma omp atomic
dest += term;
}
}
Thank you very much, Toby. You nailed it. Sorry I was not able to post the code in the last days.
– vydesaster
Dec 4 '18 at 20:13
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53419780%2fdifferent-results-when-using-more-than-one-thread%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
The increment operation should indeed be made atomic to make sure two threads are not updating the value at the same time, which would be a race condition. This will make the code correct, but since atomic increment is slower than standard increment, it may make the parallel code slower than the serial one overall.
Performance may also be affected by false-sharing: if the vector size is not much greater than the number of threads, it will often happen that two threads attempt to increment values belonging to the same cache line and much time is spent synchronizing the cache between CPUs.
#pragma omp parallel for num_threads(m_numthreads)
for (int i = 0; i < matrix.get_rowptr()->size() - 1; ++i)
for (int j = matrix.get_rowptr()->operator(i); j < matrix.get_rowptr()->operator(i + 1); ++j)
{
#pragma omp atomic
result[matrix.get_columnindex()->operator(j)] += matrix.get_value()->operator(j) * vector1[i];
}
This being said, for a given element in the result vector, the order in which the various products are added to make that element will not be the same in the serial and parallel cases. Rounding errors will add differently and one should not expect the serial and parallel results to be the same, but rather the difference between these results to be small relative to the precision of the float or double format.
You should probably point out that there will be very little (possibly negative) speed increase from parallelisation when the entire body of the loop is in anatomic
block.
– Toby Speight
Nov 22 '18 at 18:21
It's probably more efficient to make the elements ofresult
be some suitablestd::atomic<>
type instead. If not, at least bring the computation of lhs and rhs ahead of theatomic
section, so only the+=
is affected.
– Toby Speight
Nov 22 '18 at 18:32
I have already stated that since atomic increment is slower than standard increment, it may make the parallel code slower than the serial one. Then I do not know for sure in which direction the balance between one fast op at a time and several slow ones in parallel will point. This depends on too many factors and there may still be a small performance improvement.
– Brice
Nov 23 '18 at 8:42
I have tried the atomic pragma and it worked fine. But as stated, the performance decreased. But I am totaly fine with that. At least I am getting reproducable results.
– vydesaster
Nov 23 '18 at 19:20
add a comment |
The increment operation should indeed be made atomic to make sure two threads are not updating the value at the same time, which would be a race condition. This will make the code correct, but since atomic increment is slower than standard increment, it may make the parallel code slower than the serial one overall.
Performance may also be affected by false-sharing: if the vector size is not much greater than the number of threads, it will often happen that two threads attempt to increment values belonging to the same cache line and much time is spent synchronizing the cache between CPUs.
#pragma omp parallel for num_threads(m_numthreads)
for (int i = 0; i < matrix.get_rowptr()->size() - 1; ++i)
for (int j = matrix.get_rowptr()->operator(i); j < matrix.get_rowptr()->operator(i + 1); ++j)
{
#pragma omp atomic
result[matrix.get_columnindex()->operator(j)] += matrix.get_value()->operator(j) * vector1[i];
}
This being said, for a given element in the result vector, the order in which the various products are added to make that element will not be the same in the serial and parallel cases. Rounding errors will add differently and one should not expect the serial and parallel results to be the same, but rather the difference between these results to be small relative to the precision of the float or double format.
You should probably point out that there will be very little (possibly negative) speed increase from parallelisation when the entire body of the loop is in anatomic
block.
– Toby Speight
Nov 22 '18 at 18:21
It's probably more efficient to make the elements ofresult
be some suitablestd::atomic<>
type instead. If not, at least bring the computation of lhs and rhs ahead of theatomic
section, so only the+=
is affected.
– Toby Speight
Nov 22 '18 at 18:32
I have already stated that since atomic increment is slower than standard increment, it may make the parallel code slower than the serial one. Then I do not know for sure in which direction the balance between one fast op at a time and several slow ones in parallel will point. This depends on too many factors and there may still be a small performance improvement.
– Brice
Nov 23 '18 at 8:42
I have tried the atomic pragma and it worked fine. But as stated, the performance decreased. But I am totaly fine with that. At least I am getting reproducable results.
– vydesaster
Nov 23 '18 at 19:20
add a comment |
The increment operation should indeed be made atomic to make sure two threads are not updating the value at the same time, which would be a race condition. This will make the code correct, but since atomic increment is slower than standard increment, it may make the parallel code slower than the serial one overall.
Performance may also be affected by false-sharing: if the vector size is not much greater than the number of threads, it will often happen that two threads attempt to increment values belonging to the same cache line and much time is spent synchronizing the cache between CPUs.
#pragma omp parallel for num_threads(m_numthreads)
for (int i = 0; i < matrix.get_rowptr()->size() - 1; ++i)
for (int j = matrix.get_rowptr()->operator(i); j < matrix.get_rowptr()->operator(i + 1); ++j)
{
#pragma omp atomic
result[matrix.get_columnindex()->operator(j)] += matrix.get_value()->operator(j) * vector1[i];
}
This being said, for a given element in the result vector, the order in which the various products are added to make that element will not be the same in the serial and parallel cases. Rounding errors will add differently and one should not expect the serial and parallel results to be the same, but rather the difference between these results to be small relative to the precision of the float or double format.
The increment operation should indeed be made atomic to make sure two threads are not updating the value at the same time, which would be a race condition. This will make the code correct, but since atomic increment is slower than standard increment, it may make the parallel code slower than the serial one overall.
Performance may also be affected by false-sharing: if the vector size is not much greater than the number of threads, it will often happen that two threads attempt to increment values belonging to the same cache line and much time is spent synchronizing the cache between CPUs.
#pragma omp parallel for num_threads(m_numthreads)
for (int i = 0; i < matrix.get_rowptr()->size() - 1; ++i)
for (int j = matrix.get_rowptr()->operator(i); j < matrix.get_rowptr()->operator(i + 1); ++j)
{
#pragma omp atomic
result[matrix.get_columnindex()->operator(j)] += matrix.get_value()->operator(j) * vector1[i];
}
This being said, for a given element in the result vector, the order in which the various products are added to make that element will not be the same in the serial and parallel cases. Rounding errors will add differently and one should not expect the serial and parallel results to be the same, but rather the difference between these results to be small relative to the precision of the float or double format.
answered Nov 22 '18 at 9:07
BriceBrice
1,415110
1,415110
You should probably point out that there will be very little (possibly negative) speed increase from parallelisation when the entire body of the loop is in anatomic
block.
– Toby Speight
Nov 22 '18 at 18:21
It's probably more efficient to make the elements ofresult
be some suitablestd::atomic<>
type instead. If not, at least bring the computation of lhs and rhs ahead of theatomic
section, so only the+=
is affected.
– Toby Speight
Nov 22 '18 at 18:32
I have already stated that since atomic increment is slower than standard increment, it may make the parallel code slower than the serial one. Then I do not know for sure in which direction the balance between one fast op at a time and several slow ones in parallel will point. This depends on too many factors and there may still be a small performance improvement.
– Brice
Nov 23 '18 at 8:42
I have tried the atomic pragma and it worked fine. But as stated, the performance decreased. But I am totaly fine with that. At least I am getting reproducable results.
– vydesaster
Nov 23 '18 at 19:20
add a comment |
You should probably point out that there will be very little (possibly negative) speed increase from parallelisation when the entire body of the loop is in anatomic
block.
– Toby Speight
Nov 22 '18 at 18:21
It's probably more efficient to make the elements ofresult
be some suitablestd::atomic<>
type instead. If not, at least bring the computation of lhs and rhs ahead of theatomic
section, so only the+=
is affected.
– Toby Speight
Nov 22 '18 at 18:32
I have already stated that since atomic increment is slower than standard increment, it may make the parallel code slower than the serial one. Then I do not know for sure in which direction the balance between one fast op at a time and several slow ones in parallel will point. This depends on too many factors and there may still be a small performance improvement.
– Brice
Nov 23 '18 at 8:42
I have tried the atomic pragma and it worked fine. But as stated, the performance decreased. But I am totaly fine with that. At least I am getting reproducable results.
– vydesaster
Nov 23 '18 at 19:20
You should probably point out that there will be very little (possibly negative) speed increase from parallelisation when the entire body of the loop is in an
atomic
block.– Toby Speight
Nov 22 '18 at 18:21
You should probably point out that there will be very little (possibly negative) speed increase from parallelisation when the entire body of the loop is in an
atomic
block.– Toby Speight
Nov 22 '18 at 18:21
It's probably more efficient to make the elements of
result
be some suitable std::atomic<>
type instead. If not, at least bring the computation of lhs and rhs ahead of the atomic
section, so only the +=
is affected.– Toby Speight
Nov 22 '18 at 18:32
It's probably more efficient to make the elements of
result
be some suitable std::atomic<>
type instead. If not, at least bring the computation of lhs and rhs ahead of the atomic
section, so only the +=
is affected.– Toby Speight
Nov 22 '18 at 18:32
I have already stated that since atomic increment is slower than standard increment, it may make the parallel code slower than the serial one. Then I do not know for sure in which direction the balance between one fast op at a time and several slow ones in parallel will point. This depends on too many factors and there may still be a small performance improvement.
– Brice
Nov 23 '18 at 8:42
I have already stated that since atomic increment is slower than standard increment, it may make the parallel code slower than the serial one. Then I do not know for sure in which direction the balance between one fast op at a time and several slow ones in parallel will point. This depends on too many factors and there may still be a small performance improvement.
– Brice
Nov 23 '18 at 8:42
I have tried the atomic pragma and it worked fine. But as stated, the performance decreased. But I am totaly fine with that. At least I am getting reproducable results.
– vydesaster
Nov 23 '18 at 19:20
I have tried the atomic pragma and it worked fine. But as stated, the performance decreased. But I am totaly fine with that. At least I am getting reproducable results.
– vydesaster
Nov 23 '18 at 19:20
add a comment |
Without a reproducible example, I've had to guess at the context of this code:
#include <vector>
class SparseMatrix
{
std::vector<double> values = { 5, 8, 3, 6, };
std::vector<std::size_t> rows = { 0, 0, 2, 3, 4, };
std::vector<std::size_t> cols = { 1, 2, 1, 1, };
public:
const std::vector<std::size_t> *get_rowptr() const { return &rows; };
const std::vector<std::size_t> *get_columnindex() const { return &cols; };
const std::vector<double> *get_value() const { return &values; };
};
#include <array>
#include <iostream>
int main()
{
SparseMatrix matrix;
std::array<double, 4> result{};
std::array<double, 4> vector1{ 1, 2, 3, 4 };
#pragma omp parallel for
for (int i = 0; i < matrix.get_rowptr()->size() - 1; ++i)
{
for (int j = matrix.get_rowptr()->operator(i); j < matrix.get_rowptr()->operator(i + 1); ++j)
{
result[matrix.get_columnindex()->operator(j)] += matrix.get_value()->operator(j) * vector1[i];
}
}
for (auto const& i: result)
std::cout << i << " ";
std::cout << 'n';
}
With some suitable variables we can simplify the code so we see what's going on:
auto const& rows = *matrix.get_rowptr();
auto const& cols = *matrix.get_columnindex();
auto const& values = *matrix.get_value();
#pragma omp parallel for
for (std::size_t i = 0; i < rows.size() - 1; ++i)
{
for (std::size_t j = rows[i]; j < rows[i+1]; ++j)
{
result[cols[j]] += values[j] * vector1[i];
}
}
Now we can see in the body of the loop that we're assigning to a result entry that other threads may also be writing to. We need to serialize the access to result[cols[j]]
so that only one thread at a time is executing the +=
. We can do that by marking that operation indivisible, using the OMP atomic
keyword:
#pragma omp parallel for
for (std::size_t i = 0; i < rows.size() - 1; ++i)
{
for (std::size_t j = rows[i]; j < rows[i+1]; ++j)
{
auto& dest = result[cols[j]];
auto const term = values[j] * vector1[i];
#pragma omp atomic
dest += term;
}
}
Thank you very much, Toby. You nailed it. Sorry I was not able to post the code in the last days.
– vydesaster
Dec 4 '18 at 20:13
add a comment |
Without a reproducible example, I've had to guess at the context of this code:
#include <vector>
class SparseMatrix
{
std::vector<double> values = { 5, 8, 3, 6, };
std::vector<std::size_t> rows = { 0, 0, 2, 3, 4, };
std::vector<std::size_t> cols = { 1, 2, 1, 1, };
public:
const std::vector<std::size_t> *get_rowptr() const { return &rows; };
const std::vector<std::size_t> *get_columnindex() const { return &cols; };
const std::vector<double> *get_value() const { return &values; };
};
#include <array>
#include <iostream>
int main()
{
SparseMatrix matrix;
std::array<double, 4> result{};
std::array<double, 4> vector1{ 1, 2, 3, 4 };
#pragma omp parallel for
for (int i = 0; i < matrix.get_rowptr()->size() - 1; ++i)
{
for (int j = matrix.get_rowptr()->operator(i); j < matrix.get_rowptr()->operator(i + 1); ++j)
{
result[matrix.get_columnindex()->operator(j)] += matrix.get_value()->operator(j) * vector1[i];
}
}
for (auto const& i: result)
std::cout << i << " ";
std::cout << 'n';
}
With some suitable variables we can simplify the code so we see what's going on:
auto const& rows = *matrix.get_rowptr();
auto const& cols = *matrix.get_columnindex();
auto const& values = *matrix.get_value();
#pragma omp parallel for
for (std::size_t i = 0; i < rows.size() - 1; ++i)
{
for (std::size_t j = rows[i]; j < rows[i+1]; ++j)
{
result[cols[j]] += values[j] * vector1[i];
}
}
Now we can see in the body of the loop that we're assigning to a result entry that other threads may also be writing to. We need to serialize the access to result[cols[j]]
so that only one thread at a time is executing the +=
. We can do that by marking that operation indivisible, using the OMP atomic
keyword:
#pragma omp parallel for
for (std::size_t i = 0; i < rows.size() - 1; ++i)
{
for (std::size_t j = rows[i]; j < rows[i+1]; ++j)
{
auto& dest = result[cols[j]];
auto const term = values[j] * vector1[i];
#pragma omp atomic
dest += term;
}
}
Thank you very much, Toby. You nailed it. Sorry I was not able to post the code in the last days.
– vydesaster
Dec 4 '18 at 20:13
add a comment |
Without a reproducible example, I've had to guess at the context of this code:
#include <vector>
class SparseMatrix
{
std::vector<double> values = { 5, 8, 3, 6, };
std::vector<std::size_t> rows = { 0, 0, 2, 3, 4, };
std::vector<std::size_t> cols = { 1, 2, 1, 1, };
public:
const std::vector<std::size_t> *get_rowptr() const { return &rows; };
const std::vector<std::size_t> *get_columnindex() const { return &cols; };
const std::vector<double> *get_value() const { return &values; };
};
#include <array>
#include <iostream>
int main()
{
SparseMatrix matrix;
std::array<double, 4> result{};
std::array<double, 4> vector1{ 1, 2, 3, 4 };
#pragma omp parallel for
for (int i = 0; i < matrix.get_rowptr()->size() - 1; ++i)
{
for (int j = matrix.get_rowptr()->operator(i); j < matrix.get_rowptr()->operator(i + 1); ++j)
{
result[matrix.get_columnindex()->operator(j)] += matrix.get_value()->operator(j) * vector1[i];
}
}
for (auto const& i: result)
std::cout << i << " ";
std::cout << 'n';
}
With some suitable variables we can simplify the code so we see what's going on:
auto const& rows = *matrix.get_rowptr();
auto const& cols = *matrix.get_columnindex();
auto const& values = *matrix.get_value();
#pragma omp parallel for
for (std::size_t i = 0; i < rows.size() - 1; ++i)
{
for (std::size_t j = rows[i]; j < rows[i+1]; ++j)
{
result[cols[j]] += values[j] * vector1[i];
}
}
Now we can see in the body of the loop that we're assigning to a result entry that other threads may also be writing to. We need to serialize the access to result[cols[j]]
so that only one thread at a time is executing the +=
. We can do that by marking that operation indivisible, using the OMP atomic
keyword:
#pragma omp parallel for
for (std::size_t i = 0; i < rows.size() - 1; ++i)
{
for (std::size_t j = rows[i]; j < rows[i+1]; ++j)
{
auto& dest = result[cols[j]];
auto const term = values[j] * vector1[i];
#pragma omp atomic
dest += term;
}
}
Without a reproducible example, I've had to guess at the context of this code:
#include <vector>
class SparseMatrix
{
std::vector<double> values = { 5, 8, 3, 6, };
std::vector<std::size_t> rows = { 0, 0, 2, 3, 4, };
std::vector<std::size_t> cols = { 1, 2, 1, 1, };
public:
const std::vector<std::size_t> *get_rowptr() const { return &rows; };
const std::vector<std::size_t> *get_columnindex() const { return &cols; };
const std::vector<double> *get_value() const { return &values; };
};
#include <array>
#include <iostream>
int main()
{
SparseMatrix matrix;
std::array<double, 4> result{};
std::array<double, 4> vector1{ 1, 2, 3, 4 };
#pragma omp parallel for
for (int i = 0; i < matrix.get_rowptr()->size() - 1; ++i)
{
for (int j = matrix.get_rowptr()->operator(i); j < matrix.get_rowptr()->operator(i + 1); ++j)
{
result[matrix.get_columnindex()->operator(j)] += matrix.get_value()->operator(j) * vector1[i];
}
}
for (auto const& i: result)
std::cout << i << " ";
std::cout << 'n';
}
With some suitable variables we can simplify the code so we see what's going on:
auto const& rows = *matrix.get_rowptr();
auto const& cols = *matrix.get_columnindex();
auto const& values = *matrix.get_value();
#pragma omp parallel for
for (std::size_t i = 0; i < rows.size() - 1; ++i)
{
for (std::size_t j = rows[i]; j < rows[i+1]; ++j)
{
result[cols[j]] += values[j] * vector1[i];
}
}
Now we can see in the body of the loop that we're assigning to a result entry that other threads may also be writing to. We need to serialize the access to result[cols[j]]
so that only one thread at a time is executing the +=
. We can do that by marking that operation indivisible, using the OMP atomic
keyword:
#pragma omp parallel for
for (std::size_t i = 0; i < rows.size() - 1; ++i)
{
for (std::size_t j = rows[i]; j < rows[i+1]; ++j)
{
auto& dest = result[cols[j]];
auto const term = values[j] * vector1[i];
#pragma omp atomic
dest += term;
}
}
answered Nov 30 '18 at 8:42
Toby SpeightToby Speight
17.4k134368
17.4k134368
Thank you very much, Toby. You nailed it. Sorry I was not able to post the code in the last days.
– vydesaster
Dec 4 '18 at 20:13
add a comment |
Thank you very much, Toby. You nailed it. Sorry I was not able to post the code in the last days.
– vydesaster
Dec 4 '18 at 20:13
Thank you very much, Toby. You nailed it. Sorry I was not able to post the code in the last days.
– vydesaster
Dec 4 '18 at 20:13
Thank you very much, Toby. You nailed it. Sorry I was not able to post the code in the last days.
– vydesaster
Dec 4 '18 at 20:13
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53419780%2fdifferent-results-when-using-more-than-one-thread%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Nobody can tell, without knowing the types of
result
andmatrix
at least, and whether there's any aliasing of the two.– Toby Speight
Nov 22 '18 at 9:19
result and vector1 are stl double vectors. Matrix is a class which holds the three vectors that are needed to describe a sparse matrix in CSR format.
– vydesaster
Nov 22 '18 at 18:12
A Minimal, Complete, and Verifiable example would make the question much clearer - could you edit so that others can reproduce and diagnose the issue you're seeing?
– Toby Speight
Nov 22 '18 at 18:18
BTW, if you're comparing
int
againststd::size_t
like that, it suggests you have nowhere near enough compiler warnings enabled.– Toby Speight
Nov 22 '18 at 18:22