Different results when using more than one thread












-1















The following code performs a transpose matrix-vector multiplication where the matrix is sparse and stored in CSR format. Depending on the number of threads the result is different. I think the reason is the concurrent memory access and addition.
Is there a way to use multithreading but keep the result the same as for single threading?



#pragma omp parallel for num_threads(m_numthreads)
for (int i = 0; i < matrix.get_rowptr()->size() - 1; ++i)
{
for (int j = matrix.get_rowptr()->operator(i); j < matrix.get_rowptr()->operator(i + 1); ++j)
{
result[matrix.get_columnindex()->operator(j)] += matrix.get_value()->operator(j) * vector1[i];
}
}









share|improve this question

























  • Nobody can tell, without knowing the types of result and matrix at least, and whether there's any aliasing of the two.

    – Toby Speight
    Nov 22 '18 at 9:19











  • result and vector1 are stl double vectors. Matrix is a class which holds the three vectors that are needed to describe a sparse matrix in CSR format.

    – vydesaster
    Nov 22 '18 at 18:12











  • A Minimal, Complete, and Verifiable example would make the question much clearer - could you edit so that others can reproduce and diagnose the issue you're seeing?

    – Toby Speight
    Nov 22 '18 at 18:18











  • BTW, if you're comparing int against std::size_t like that, it suggests you have nowhere near enough compiler warnings enabled.

    – Toby Speight
    Nov 22 '18 at 18:22
















-1















The following code performs a transpose matrix-vector multiplication where the matrix is sparse and stored in CSR format. Depending on the number of threads the result is different. I think the reason is the concurrent memory access and addition.
Is there a way to use multithreading but keep the result the same as for single threading?



#pragma omp parallel for num_threads(m_numthreads)
for (int i = 0; i < matrix.get_rowptr()->size() - 1; ++i)
{
for (int j = matrix.get_rowptr()->operator(i); j < matrix.get_rowptr()->operator(i + 1); ++j)
{
result[matrix.get_columnindex()->operator(j)] += matrix.get_value()->operator(j) * vector1[i];
}
}









share|improve this question

























  • Nobody can tell, without knowing the types of result and matrix at least, and whether there's any aliasing of the two.

    – Toby Speight
    Nov 22 '18 at 9:19











  • result and vector1 are stl double vectors. Matrix is a class which holds the three vectors that are needed to describe a sparse matrix in CSR format.

    – vydesaster
    Nov 22 '18 at 18:12











  • A Minimal, Complete, and Verifiable example would make the question much clearer - could you edit so that others can reproduce and diagnose the issue you're seeing?

    – Toby Speight
    Nov 22 '18 at 18:18











  • BTW, if you're comparing int against std::size_t like that, it suggests you have nowhere near enough compiler warnings enabled.

    – Toby Speight
    Nov 22 '18 at 18:22














-1












-1








-1








The following code performs a transpose matrix-vector multiplication where the matrix is sparse and stored in CSR format. Depending on the number of threads the result is different. I think the reason is the concurrent memory access and addition.
Is there a way to use multithreading but keep the result the same as for single threading?



#pragma omp parallel for num_threads(m_numthreads)
for (int i = 0; i < matrix.get_rowptr()->size() - 1; ++i)
{
for (int j = matrix.get_rowptr()->operator(i); j < matrix.get_rowptr()->operator(i + 1); ++j)
{
result[matrix.get_columnindex()->operator(j)] += matrix.get_value()->operator(j) * vector1[i];
}
}









share|improve this question
















The following code performs a transpose matrix-vector multiplication where the matrix is sparse and stored in CSR format. Depending on the number of threads the result is different. I think the reason is the concurrent memory access and addition.
Is there a way to use multithreading but keep the result the same as for single threading?



#pragma omp parallel for num_threads(m_numthreads)
for (int i = 0; i < matrix.get_rowptr()->size() - 1; ++i)
{
for (int j = matrix.get_rowptr()->operator(i); j < matrix.get_rowptr()->operator(i + 1); ++j)
{
result[matrix.get_columnindex()->operator(j)] += matrix.get_value()->operator(j) * vector1[i];
}
}






c++ vector openmp






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 22 '18 at 18:35









Toby Speight

17.4k134368




17.4k134368










asked Nov 21 '18 at 20:10









vydesastervydesaster

175




175













  • Nobody can tell, without knowing the types of result and matrix at least, and whether there's any aliasing of the two.

    – Toby Speight
    Nov 22 '18 at 9:19











  • result and vector1 are stl double vectors. Matrix is a class which holds the three vectors that are needed to describe a sparse matrix in CSR format.

    – vydesaster
    Nov 22 '18 at 18:12











  • A Minimal, Complete, and Verifiable example would make the question much clearer - could you edit so that others can reproduce and diagnose the issue you're seeing?

    – Toby Speight
    Nov 22 '18 at 18:18











  • BTW, if you're comparing int against std::size_t like that, it suggests you have nowhere near enough compiler warnings enabled.

    – Toby Speight
    Nov 22 '18 at 18:22



















  • Nobody can tell, without knowing the types of result and matrix at least, and whether there's any aliasing of the two.

    – Toby Speight
    Nov 22 '18 at 9:19











  • result and vector1 are stl double vectors. Matrix is a class which holds the three vectors that are needed to describe a sparse matrix in CSR format.

    – vydesaster
    Nov 22 '18 at 18:12











  • A Minimal, Complete, and Verifiable example would make the question much clearer - could you edit so that others can reproduce and diagnose the issue you're seeing?

    – Toby Speight
    Nov 22 '18 at 18:18











  • BTW, if you're comparing int against std::size_t like that, it suggests you have nowhere near enough compiler warnings enabled.

    – Toby Speight
    Nov 22 '18 at 18:22

















Nobody can tell, without knowing the types of result and matrix at least, and whether there's any aliasing of the two.

– Toby Speight
Nov 22 '18 at 9:19





Nobody can tell, without knowing the types of result and matrix at least, and whether there's any aliasing of the two.

– Toby Speight
Nov 22 '18 at 9:19













result and vector1 are stl double vectors. Matrix is a class which holds the three vectors that are needed to describe a sparse matrix in CSR format.

– vydesaster
Nov 22 '18 at 18:12





result and vector1 are stl double vectors. Matrix is a class which holds the three vectors that are needed to describe a sparse matrix in CSR format.

– vydesaster
Nov 22 '18 at 18:12













A Minimal, Complete, and Verifiable example would make the question much clearer - could you edit so that others can reproduce and diagnose the issue you're seeing?

– Toby Speight
Nov 22 '18 at 18:18





A Minimal, Complete, and Verifiable example would make the question much clearer - could you edit so that others can reproduce and diagnose the issue you're seeing?

– Toby Speight
Nov 22 '18 at 18:18













BTW, if you're comparing int against std::size_t like that, it suggests you have nowhere near enough compiler warnings enabled.

– Toby Speight
Nov 22 '18 at 18:22





BTW, if you're comparing int against std::size_t like that, it suggests you have nowhere near enough compiler warnings enabled.

– Toby Speight
Nov 22 '18 at 18:22












2 Answers
2






active

oldest

votes


















0














The increment operation should indeed be made atomic to make sure two threads are not updating the value at the same time, which would be a race condition. This will make the code correct, but since atomic increment is slower than standard increment, it may make the parallel code slower than the serial one overall.

Performance may also be affected by false-sharing: if the vector size is not much greater than the number of threads, it will often happen that two threads attempt to increment values belonging to the same cache line and much time is spent synchronizing the cache between CPUs.



#pragma omp parallel for num_threads(m_numthreads)
for (int i = 0; i < matrix.get_rowptr()->size() - 1; ++i)
for (int j = matrix.get_rowptr()->operator(i); j < matrix.get_rowptr()->operator(i + 1); ++j)
{
#pragma omp atomic
result[matrix.get_columnindex()->operator(j)] += matrix.get_value()->operator(j) * vector1[i];
}


This being said, for a given element in the result vector, the order in which the various products are added to make that element will not be the same in the serial and parallel cases. Rounding errors will add differently and one should not expect the serial and parallel results to be the same, but rather the difference between these results to be small relative to the precision of the float or double format.






share|improve this answer
























  • You should probably point out that there will be very little (possibly negative) speed increase from parallelisation when the entire body of the loop is in an atomic block.

    – Toby Speight
    Nov 22 '18 at 18:21











  • It's probably more efficient to make the elements of result be some suitable std::atomic<> type instead. If not, at least bring the computation of lhs and rhs ahead of the atomic section, so only the += is affected.

    – Toby Speight
    Nov 22 '18 at 18:32











  • I have already stated that since atomic increment is slower than standard increment, it may make the parallel code slower than the serial one. Then I do not know for sure in which direction the balance between one fast op at a time and several slow ones in parallel will point. This depends on too many factors and there may still be a small performance improvement.

    – Brice
    Nov 23 '18 at 8:42











  • I have tried the atomic pragma and it worked fine. But as stated, the performance decreased. But I am totaly fine with that. At least I am getting reproducable results.

    – vydesaster
    Nov 23 '18 at 19:20



















0














Without a reproducible example, I've had to guess at the context of this code:



#include <vector>

class SparseMatrix
{
std::vector<double> values = { 5, 8, 3, 6, };
std::vector<std::size_t> rows = { 0, 0, 2, 3, 4, };
std::vector<std::size_t> cols = { 1, 2, 1, 1, };

public:
const std::vector<std::size_t> *get_rowptr() const { return &rows; };
const std::vector<std::size_t> *get_columnindex() const { return &cols; };
const std::vector<double> *get_value() const { return &values; };
};

#include <array>
#include <iostream>
int main()
{
SparseMatrix matrix;

std::array<double, 4> result{};
std::array<double, 4> vector1{ 1, 2, 3, 4 };

#pragma omp parallel for
for (int i = 0; i < matrix.get_rowptr()->size() - 1; ++i)
{
for (int j = matrix.get_rowptr()->operator(i); j < matrix.get_rowptr()->operator(i + 1); ++j)
{
result[matrix.get_columnindex()->operator(j)] += matrix.get_value()->operator(j) * vector1[i];
}
}

for (auto const& i: result)
std::cout << i << " ";
std::cout << 'n';
}


With some suitable variables we can simplify the code so we see what's going on:



    auto const& rows = *matrix.get_rowptr();
auto const& cols = *matrix.get_columnindex();
auto const& values = *matrix.get_value();

#pragma omp parallel for
for (std::size_t i = 0; i < rows.size() - 1; ++i)
{
for (std::size_t j = rows[i]; j < rows[i+1]; ++j)
{
result[cols[j]] += values[j] * vector1[i];
}
}


Now we can see in the body of the loop that we're assigning to a result entry that other threads may also be writing to. We need to serialize the access to result[cols[j]] so that only one thread at a time is executing the +=. We can do that by marking that operation indivisible, using the OMP atomic keyword:



#pragma omp parallel for
for (std::size_t i = 0; i < rows.size() - 1; ++i)
{
for (std::size_t j = rows[i]; j < rows[i+1]; ++j)
{
auto& dest = result[cols[j]];
auto const term = values[j] * vector1[i];
#pragma omp atomic
dest += term;
}
}





share|improve this answer
























  • Thank you very much, Toby. You nailed it. Sorry I was not able to post the code in the last days.

    – vydesaster
    Dec 4 '18 at 20:13












Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53419780%2fdifferent-results-when-using-more-than-one-thread%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














The increment operation should indeed be made atomic to make sure two threads are not updating the value at the same time, which would be a race condition. This will make the code correct, but since atomic increment is slower than standard increment, it may make the parallel code slower than the serial one overall.

Performance may also be affected by false-sharing: if the vector size is not much greater than the number of threads, it will often happen that two threads attempt to increment values belonging to the same cache line and much time is spent synchronizing the cache between CPUs.



#pragma omp parallel for num_threads(m_numthreads)
for (int i = 0; i < matrix.get_rowptr()->size() - 1; ++i)
for (int j = matrix.get_rowptr()->operator(i); j < matrix.get_rowptr()->operator(i + 1); ++j)
{
#pragma omp atomic
result[matrix.get_columnindex()->operator(j)] += matrix.get_value()->operator(j) * vector1[i];
}


This being said, for a given element in the result vector, the order in which the various products are added to make that element will not be the same in the serial and parallel cases. Rounding errors will add differently and one should not expect the serial and parallel results to be the same, but rather the difference between these results to be small relative to the precision of the float or double format.






share|improve this answer
























  • You should probably point out that there will be very little (possibly negative) speed increase from parallelisation when the entire body of the loop is in an atomic block.

    – Toby Speight
    Nov 22 '18 at 18:21











  • It's probably more efficient to make the elements of result be some suitable std::atomic<> type instead. If not, at least bring the computation of lhs and rhs ahead of the atomic section, so only the += is affected.

    – Toby Speight
    Nov 22 '18 at 18:32











  • I have already stated that since atomic increment is slower than standard increment, it may make the parallel code slower than the serial one. Then I do not know for sure in which direction the balance between one fast op at a time and several slow ones in parallel will point. This depends on too many factors and there may still be a small performance improvement.

    – Brice
    Nov 23 '18 at 8:42











  • I have tried the atomic pragma and it worked fine. But as stated, the performance decreased. But I am totaly fine with that. At least I am getting reproducable results.

    – vydesaster
    Nov 23 '18 at 19:20
















0














The increment operation should indeed be made atomic to make sure two threads are not updating the value at the same time, which would be a race condition. This will make the code correct, but since atomic increment is slower than standard increment, it may make the parallel code slower than the serial one overall.

Performance may also be affected by false-sharing: if the vector size is not much greater than the number of threads, it will often happen that two threads attempt to increment values belonging to the same cache line and much time is spent synchronizing the cache between CPUs.



#pragma omp parallel for num_threads(m_numthreads)
for (int i = 0; i < matrix.get_rowptr()->size() - 1; ++i)
for (int j = matrix.get_rowptr()->operator(i); j < matrix.get_rowptr()->operator(i + 1); ++j)
{
#pragma omp atomic
result[matrix.get_columnindex()->operator(j)] += matrix.get_value()->operator(j) * vector1[i];
}


This being said, for a given element in the result vector, the order in which the various products are added to make that element will not be the same in the serial and parallel cases. Rounding errors will add differently and one should not expect the serial and parallel results to be the same, but rather the difference between these results to be small relative to the precision of the float or double format.






share|improve this answer
























  • You should probably point out that there will be very little (possibly negative) speed increase from parallelisation when the entire body of the loop is in an atomic block.

    – Toby Speight
    Nov 22 '18 at 18:21











  • It's probably more efficient to make the elements of result be some suitable std::atomic<> type instead. If not, at least bring the computation of lhs and rhs ahead of the atomic section, so only the += is affected.

    – Toby Speight
    Nov 22 '18 at 18:32











  • I have already stated that since atomic increment is slower than standard increment, it may make the parallel code slower than the serial one. Then I do not know for sure in which direction the balance between one fast op at a time and several slow ones in parallel will point. This depends on too many factors and there may still be a small performance improvement.

    – Brice
    Nov 23 '18 at 8:42











  • I have tried the atomic pragma and it worked fine. But as stated, the performance decreased. But I am totaly fine with that. At least I am getting reproducable results.

    – vydesaster
    Nov 23 '18 at 19:20














0












0








0







The increment operation should indeed be made atomic to make sure two threads are not updating the value at the same time, which would be a race condition. This will make the code correct, but since atomic increment is slower than standard increment, it may make the parallel code slower than the serial one overall.

Performance may also be affected by false-sharing: if the vector size is not much greater than the number of threads, it will often happen that two threads attempt to increment values belonging to the same cache line and much time is spent synchronizing the cache between CPUs.



#pragma omp parallel for num_threads(m_numthreads)
for (int i = 0; i < matrix.get_rowptr()->size() - 1; ++i)
for (int j = matrix.get_rowptr()->operator(i); j < matrix.get_rowptr()->operator(i + 1); ++j)
{
#pragma omp atomic
result[matrix.get_columnindex()->operator(j)] += matrix.get_value()->operator(j) * vector1[i];
}


This being said, for a given element in the result vector, the order in which the various products are added to make that element will not be the same in the serial and parallel cases. Rounding errors will add differently and one should not expect the serial and parallel results to be the same, but rather the difference between these results to be small relative to the precision of the float or double format.






share|improve this answer













The increment operation should indeed be made atomic to make sure two threads are not updating the value at the same time, which would be a race condition. This will make the code correct, but since atomic increment is slower than standard increment, it may make the parallel code slower than the serial one overall.

Performance may also be affected by false-sharing: if the vector size is not much greater than the number of threads, it will often happen that two threads attempt to increment values belonging to the same cache line and much time is spent synchronizing the cache between CPUs.



#pragma omp parallel for num_threads(m_numthreads)
for (int i = 0; i < matrix.get_rowptr()->size() - 1; ++i)
for (int j = matrix.get_rowptr()->operator(i); j < matrix.get_rowptr()->operator(i + 1); ++j)
{
#pragma omp atomic
result[matrix.get_columnindex()->operator(j)] += matrix.get_value()->operator(j) * vector1[i];
}


This being said, for a given element in the result vector, the order in which the various products are added to make that element will not be the same in the serial and parallel cases. Rounding errors will add differently and one should not expect the serial and parallel results to be the same, but rather the difference between these results to be small relative to the precision of the float or double format.







share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 22 '18 at 9:07









BriceBrice

1,415110




1,415110













  • You should probably point out that there will be very little (possibly negative) speed increase from parallelisation when the entire body of the loop is in an atomic block.

    – Toby Speight
    Nov 22 '18 at 18:21











  • It's probably more efficient to make the elements of result be some suitable std::atomic<> type instead. If not, at least bring the computation of lhs and rhs ahead of the atomic section, so only the += is affected.

    – Toby Speight
    Nov 22 '18 at 18:32











  • I have already stated that since atomic increment is slower than standard increment, it may make the parallel code slower than the serial one. Then I do not know for sure in which direction the balance between one fast op at a time and several slow ones in parallel will point. This depends on too many factors and there may still be a small performance improvement.

    – Brice
    Nov 23 '18 at 8:42











  • I have tried the atomic pragma and it worked fine. But as stated, the performance decreased. But I am totaly fine with that. At least I am getting reproducable results.

    – vydesaster
    Nov 23 '18 at 19:20



















  • You should probably point out that there will be very little (possibly negative) speed increase from parallelisation when the entire body of the loop is in an atomic block.

    – Toby Speight
    Nov 22 '18 at 18:21











  • It's probably more efficient to make the elements of result be some suitable std::atomic<> type instead. If not, at least bring the computation of lhs and rhs ahead of the atomic section, so only the += is affected.

    – Toby Speight
    Nov 22 '18 at 18:32











  • I have already stated that since atomic increment is slower than standard increment, it may make the parallel code slower than the serial one. Then I do not know for sure in which direction the balance between one fast op at a time and several slow ones in parallel will point. This depends on too many factors and there may still be a small performance improvement.

    – Brice
    Nov 23 '18 at 8:42











  • I have tried the atomic pragma and it worked fine. But as stated, the performance decreased. But I am totaly fine with that. At least I am getting reproducable results.

    – vydesaster
    Nov 23 '18 at 19:20

















You should probably point out that there will be very little (possibly negative) speed increase from parallelisation when the entire body of the loop is in an atomic block.

– Toby Speight
Nov 22 '18 at 18:21





You should probably point out that there will be very little (possibly negative) speed increase from parallelisation when the entire body of the loop is in an atomic block.

– Toby Speight
Nov 22 '18 at 18:21













It's probably more efficient to make the elements of result be some suitable std::atomic<> type instead. If not, at least bring the computation of lhs and rhs ahead of the atomic section, so only the += is affected.

– Toby Speight
Nov 22 '18 at 18:32





It's probably more efficient to make the elements of result be some suitable std::atomic<> type instead. If not, at least bring the computation of lhs and rhs ahead of the atomic section, so only the += is affected.

– Toby Speight
Nov 22 '18 at 18:32













I have already stated that since atomic increment is slower than standard increment, it may make the parallel code slower than the serial one. Then I do not know for sure in which direction the balance between one fast op at a time and several slow ones in parallel will point. This depends on too many factors and there may still be a small performance improvement.

– Brice
Nov 23 '18 at 8:42





I have already stated that since atomic increment is slower than standard increment, it may make the parallel code slower than the serial one. Then I do not know for sure in which direction the balance between one fast op at a time and several slow ones in parallel will point. This depends on too many factors and there may still be a small performance improvement.

– Brice
Nov 23 '18 at 8:42













I have tried the atomic pragma and it worked fine. But as stated, the performance decreased. But I am totaly fine with that. At least I am getting reproducable results.

– vydesaster
Nov 23 '18 at 19:20





I have tried the atomic pragma and it worked fine. But as stated, the performance decreased. But I am totaly fine with that. At least I am getting reproducable results.

– vydesaster
Nov 23 '18 at 19:20













0














Without a reproducible example, I've had to guess at the context of this code:



#include <vector>

class SparseMatrix
{
std::vector<double> values = { 5, 8, 3, 6, };
std::vector<std::size_t> rows = { 0, 0, 2, 3, 4, };
std::vector<std::size_t> cols = { 1, 2, 1, 1, };

public:
const std::vector<std::size_t> *get_rowptr() const { return &rows; };
const std::vector<std::size_t> *get_columnindex() const { return &cols; };
const std::vector<double> *get_value() const { return &values; };
};

#include <array>
#include <iostream>
int main()
{
SparseMatrix matrix;

std::array<double, 4> result{};
std::array<double, 4> vector1{ 1, 2, 3, 4 };

#pragma omp parallel for
for (int i = 0; i < matrix.get_rowptr()->size() - 1; ++i)
{
for (int j = matrix.get_rowptr()->operator(i); j < matrix.get_rowptr()->operator(i + 1); ++j)
{
result[matrix.get_columnindex()->operator(j)] += matrix.get_value()->operator(j) * vector1[i];
}
}

for (auto const& i: result)
std::cout << i << " ";
std::cout << 'n';
}


With some suitable variables we can simplify the code so we see what's going on:



    auto const& rows = *matrix.get_rowptr();
auto const& cols = *matrix.get_columnindex();
auto const& values = *matrix.get_value();

#pragma omp parallel for
for (std::size_t i = 0; i < rows.size() - 1; ++i)
{
for (std::size_t j = rows[i]; j < rows[i+1]; ++j)
{
result[cols[j]] += values[j] * vector1[i];
}
}


Now we can see in the body of the loop that we're assigning to a result entry that other threads may also be writing to. We need to serialize the access to result[cols[j]] so that only one thread at a time is executing the +=. We can do that by marking that operation indivisible, using the OMP atomic keyword:



#pragma omp parallel for
for (std::size_t i = 0; i < rows.size() - 1; ++i)
{
for (std::size_t j = rows[i]; j < rows[i+1]; ++j)
{
auto& dest = result[cols[j]];
auto const term = values[j] * vector1[i];
#pragma omp atomic
dest += term;
}
}





share|improve this answer
























  • Thank you very much, Toby. You nailed it. Sorry I was not able to post the code in the last days.

    – vydesaster
    Dec 4 '18 at 20:13
















0














Without a reproducible example, I've had to guess at the context of this code:



#include <vector>

class SparseMatrix
{
std::vector<double> values = { 5, 8, 3, 6, };
std::vector<std::size_t> rows = { 0, 0, 2, 3, 4, };
std::vector<std::size_t> cols = { 1, 2, 1, 1, };

public:
const std::vector<std::size_t> *get_rowptr() const { return &rows; };
const std::vector<std::size_t> *get_columnindex() const { return &cols; };
const std::vector<double> *get_value() const { return &values; };
};

#include <array>
#include <iostream>
int main()
{
SparseMatrix matrix;

std::array<double, 4> result{};
std::array<double, 4> vector1{ 1, 2, 3, 4 };

#pragma omp parallel for
for (int i = 0; i < matrix.get_rowptr()->size() - 1; ++i)
{
for (int j = matrix.get_rowptr()->operator(i); j < matrix.get_rowptr()->operator(i + 1); ++j)
{
result[matrix.get_columnindex()->operator(j)] += matrix.get_value()->operator(j) * vector1[i];
}
}

for (auto const& i: result)
std::cout << i << " ";
std::cout << 'n';
}


With some suitable variables we can simplify the code so we see what's going on:



    auto const& rows = *matrix.get_rowptr();
auto const& cols = *matrix.get_columnindex();
auto const& values = *matrix.get_value();

#pragma omp parallel for
for (std::size_t i = 0; i < rows.size() - 1; ++i)
{
for (std::size_t j = rows[i]; j < rows[i+1]; ++j)
{
result[cols[j]] += values[j] * vector1[i];
}
}


Now we can see in the body of the loop that we're assigning to a result entry that other threads may also be writing to. We need to serialize the access to result[cols[j]] so that only one thread at a time is executing the +=. We can do that by marking that operation indivisible, using the OMP atomic keyword:



#pragma omp parallel for
for (std::size_t i = 0; i < rows.size() - 1; ++i)
{
for (std::size_t j = rows[i]; j < rows[i+1]; ++j)
{
auto& dest = result[cols[j]];
auto const term = values[j] * vector1[i];
#pragma omp atomic
dest += term;
}
}





share|improve this answer
























  • Thank you very much, Toby. You nailed it. Sorry I was not able to post the code in the last days.

    – vydesaster
    Dec 4 '18 at 20:13














0












0








0







Without a reproducible example, I've had to guess at the context of this code:



#include <vector>

class SparseMatrix
{
std::vector<double> values = { 5, 8, 3, 6, };
std::vector<std::size_t> rows = { 0, 0, 2, 3, 4, };
std::vector<std::size_t> cols = { 1, 2, 1, 1, };

public:
const std::vector<std::size_t> *get_rowptr() const { return &rows; };
const std::vector<std::size_t> *get_columnindex() const { return &cols; };
const std::vector<double> *get_value() const { return &values; };
};

#include <array>
#include <iostream>
int main()
{
SparseMatrix matrix;

std::array<double, 4> result{};
std::array<double, 4> vector1{ 1, 2, 3, 4 };

#pragma omp parallel for
for (int i = 0; i < matrix.get_rowptr()->size() - 1; ++i)
{
for (int j = matrix.get_rowptr()->operator(i); j < matrix.get_rowptr()->operator(i + 1); ++j)
{
result[matrix.get_columnindex()->operator(j)] += matrix.get_value()->operator(j) * vector1[i];
}
}

for (auto const& i: result)
std::cout << i << " ";
std::cout << 'n';
}


With some suitable variables we can simplify the code so we see what's going on:



    auto const& rows = *matrix.get_rowptr();
auto const& cols = *matrix.get_columnindex();
auto const& values = *matrix.get_value();

#pragma omp parallel for
for (std::size_t i = 0; i < rows.size() - 1; ++i)
{
for (std::size_t j = rows[i]; j < rows[i+1]; ++j)
{
result[cols[j]] += values[j] * vector1[i];
}
}


Now we can see in the body of the loop that we're assigning to a result entry that other threads may also be writing to. We need to serialize the access to result[cols[j]] so that only one thread at a time is executing the +=. We can do that by marking that operation indivisible, using the OMP atomic keyword:



#pragma omp parallel for
for (std::size_t i = 0; i < rows.size() - 1; ++i)
{
for (std::size_t j = rows[i]; j < rows[i+1]; ++j)
{
auto& dest = result[cols[j]];
auto const term = values[j] * vector1[i];
#pragma omp atomic
dest += term;
}
}





share|improve this answer













Without a reproducible example, I've had to guess at the context of this code:



#include <vector>

class SparseMatrix
{
std::vector<double> values = { 5, 8, 3, 6, };
std::vector<std::size_t> rows = { 0, 0, 2, 3, 4, };
std::vector<std::size_t> cols = { 1, 2, 1, 1, };

public:
const std::vector<std::size_t> *get_rowptr() const { return &rows; };
const std::vector<std::size_t> *get_columnindex() const { return &cols; };
const std::vector<double> *get_value() const { return &values; };
};

#include <array>
#include <iostream>
int main()
{
SparseMatrix matrix;

std::array<double, 4> result{};
std::array<double, 4> vector1{ 1, 2, 3, 4 };

#pragma omp parallel for
for (int i = 0; i < matrix.get_rowptr()->size() - 1; ++i)
{
for (int j = matrix.get_rowptr()->operator(i); j < matrix.get_rowptr()->operator(i + 1); ++j)
{
result[matrix.get_columnindex()->operator(j)] += matrix.get_value()->operator(j) * vector1[i];
}
}

for (auto const& i: result)
std::cout << i << " ";
std::cout << 'n';
}


With some suitable variables we can simplify the code so we see what's going on:



    auto const& rows = *matrix.get_rowptr();
auto const& cols = *matrix.get_columnindex();
auto const& values = *matrix.get_value();

#pragma omp parallel for
for (std::size_t i = 0; i < rows.size() - 1; ++i)
{
for (std::size_t j = rows[i]; j < rows[i+1]; ++j)
{
result[cols[j]] += values[j] * vector1[i];
}
}


Now we can see in the body of the loop that we're assigning to a result entry that other threads may also be writing to. We need to serialize the access to result[cols[j]] so that only one thread at a time is executing the +=. We can do that by marking that operation indivisible, using the OMP atomic keyword:



#pragma omp parallel for
for (std::size_t i = 0; i < rows.size() - 1; ++i)
{
for (std::size_t j = rows[i]; j < rows[i+1]; ++j)
{
auto& dest = result[cols[j]];
auto const term = values[j] * vector1[i];
#pragma omp atomic
dest += term;
}
}






share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 30 '18 at 8:42









Toby SpeightToby Speight

17.4k134368




17.4k134368













  • Thank you very much, Toby. You nailed it. Sorry I was not able to post the code in the last days.

    – vydesaster
    Dec 4 '18 at 20:13



















  • Thank you very much, Toby. You nailed it. Sorry I was not able to post the code in the last days.

    – vydesaster
    Dec 4 '18 at 20:13

















Thank you very much, Toby. You nailed it. Sorry I was not able to post the code in the last days.

– vydesaster
Dec 4 '18 at 20:13





Thank you very much, Toby. You nailed it. Sorry I was not able to post the code in the last days.

– vydesaster
Dec 4 '18 at 20:13


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53419780%2fdifferent-results-when-using-more-than-one-thread%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

鏡平學校

ꓛꓣだゔៀៅຸ໢ທຮ໕໒ ,ໂ'໥໓າ໼ឨឲ៵៭ៈゎゔit''䖳𥁄卿' ☨₤₨こゎもょの;ꜹꟚꞖꞵꟅꞛေၦေɯ,ɨɡ𛃵𛁹ޝ޳ޠ޾,ޤޒޯ޾𫝒𫠁သ𛅤チョ'サノބޘދ𛁐ᶿᶇᶀᶋᶠ㨑㽹⻮ꧬ꧹؍۩وَؠ㇕㇃㇪ ㇦㇋㇋ṜẰᵡᴠ 軌ᵕ搜۳ٰޗޮ޷ސޯ𫖾𫅀ल, ꙭ꙰ꚅꙁꚊꞻꝔ꟠Ꝭㄤﺟޱސꧨꧼ꧴ꧯꧽ꧲ꧯ'⽹⽭⾁⿞⼳⽋២៩ញណើꩯꩤ꩸ꩮᶻᶺᶧᶂ𫳲𫪭𬸄𫵰𬖩𬫣𬊉ၲ𛅬㕦䬺𫝌𫝼,,𫟖𫞽ហៅ஫㆔ాఆఅꙒꚞꙍ,Ꙟ꙱エ ,ポテ,フࢰࢯ𫟠𫞶 𫝤𫟠ﺕﹱﻜﻣ𪵕𪭸𪻆𪾩𫔷ġ,ŧآꞪ꟥,ꞔꝻ♚☹⛵𛀌ꬷꭞȄƁƪƬșƦǙǗdžƝǯǧⱦⱰꓕꓢႋ神 ဴ၀க௭எ௫ឫោ ' េㇷㇴㇼ神ㇸㇲㇽㇴㇼㇻㇸ'ㇸㇿㇸㇹㇰㆣꓚꓤ₡₧ ㄨㄟ㄂ㄖㄎ໗ツڒذ₶।ऩछएोञयूटक़कयँृी,冬'𛅢𛅥ㇱㇵㇶ𥄥𦒽𠣧𠊓𧢖𥞘𩔋цѰㄠſtʯʭɿʆʗʍʩɷɛ,əʏダヵㄐㄘR{gỚṖḺờṠṫảḙḭᴮᵏᴘᵀᵷᵕᴜᴏᵾq﮲ﲿﴽﭙ軌ﰬﶚﶧ﫲Ҝжюїкӈㇴffצּ﬘﭅﬈軌'ffistfflſtffतभफɳɰʊɲʎ𛁱𛁖𛁮𛀉 𛂯𛀞నఋŀŲ 𫟲𫠖𫞺ຆຆ ໹້໕໗ๆทԊꧢꧠ꧰ꓱ⿝⼑ŎḬẃẖỐẅ ,ờỰỈỗﮊDžȩꭏꭎꬻ꭮ꬿꭖꭥꭅ㇭神 ⾈ꓵꓑ⺄㄄ㄪㄙㄅㄇstA۵䞽ॶ𫞑𫝄㇉㇇゜軌𩜛𩳠Jﻺ‚Üမ႕ႌႊၐၸဓၞၞၡ៸wyvtᶎᶪᶹစဎ꣡꣰꣢꣤ٗ؋لㇳㇾㇻㇱ㆐㆔,,㆟Ⱶヤマފ޼ޝަݿݞݠݷݐ',ݘ,ݪݙݵ𬝉𬜁𫝨𫞘くせぉて¼óû×ó£…𛅑הㄙくԗԀ5606神45,神796'𪤻𫞧ꓐ㄁ㄘɥɺꓵꓲ3''7034׉ⱦⱠˆ“𫝋ȍ,ꩲ軌꩷ꩶꩧꩫఞ۔فڱێظペサ神ナᴦᵑ47 9238їﻂ䐊䔉㠸﬎ffiﬣ,לּᴷᴦᵛᵽ,ᴨᵤ ᵸᵥᴗᵈꚏꚉꚟ⻆rtǟƴ𬎎

Why https connections are so slow when debugging (stepping over) in Java?