What is the meaning of “non temporal” memory accesses in x86











up vote
92
down vote

favorite
36












This is a somewhat low-level question. In x86 assembly there are two SSE instructions:




MOVDQA xmmi, m128




and




MOVNTDQA xmmi, m128




The IA-32 Software Developer's Manual says that the NT in MOVNTDQA stands for Non-Temporal, and that otherwise it's the same as MOVDQA.



My question is, what does Non-Temporal mean?










share|improve this question




















  • 4




    Note that SSE4.1 MOVNTDQA xmmi, m128 is an NT load, while all the other NT instructions are stores, except for prefetchnta. The accepted answer here only seems to be talking about stores. This is what I've been able to turn up about NT loads. TL:DR: hopefully the CPU does something useful with the NT hint to minimize cache pollution, but they don't override the strongly-ordered semantics of "normal" WB memory, so they do have to use the cache.
    – Peter Cordes
    Nov 23 '16 at 7:25






  • 3




    Update: NT loads may not do anything useful except on UCSW memory regions on most CPUs (e.g. Intel SnB family). NT/streaming stores definitely work on normal memory, though.
    – Peter Cordes
    Jul 8 '17 at 1:18






  • 1




    @Peter: You mean USWC memory right? I've never heard of UCSW or USWC memory before. Googling the wrong acronym wasn't helpful :-)
    – Andrew Bainbridge
    Aug 15 '17 at 11:29















up vote
92
down vote

favorite
36












This is a somewhat low-level question. In x86 assembly there are two SSE instructions:




MOVDQA xmmi, m128




and




MOVNTDQA xmmi, m128




The IA-32 Software Developer's Manual says that the NT in MOVNTDQA stands for Non-Temporal, and that otherwise it's the same as MOVDQA.



My question is, what does Non-Temporal mean?










share|improve this question




















  • 4




    Note that SSE4.1 MOVNTDQA xmmi, m128 is an NT load, while all the other NT instructions are stores, except for prefetchnta. The accepted answer here only seems to be talking about stores. This is what I've been able to turn up about NT loads. TL:DR: hopefully the CPU does something useful with the NT hint to minimize cache pollution, but they don't override the strongly-ordered semantics of "normal" WB memory, so they do have to use the cache.
    – Peter Cordes
    Nov 23 '16 at 7:25






  • 3




    Update: NT loads may not do anything useful except on UCSW memory regions on most CPUs (e.g. Intel SnB family). NT/streaming stores definitely work on normal memory, though.
    – Peter Cordes
    Jul 8 '17 at 1:18






  • 1




    @Peter: You mean USWC memory right? I've never heard of UCSW or USWC memory before. Googling the wrong acronym wasn't helpful :-)
    – Andrew Bainbridge
    Aug 15 '17 at 11:29













up vote
92
down vote

favorite
36









up vote
92
down vote

favorite
36






36





This is a somewhat low-level question. In x86 assembly there are two SSE instructions:




MOVDQA xmmi, m128




and




MOVNTDQA xmmi, m128




The IA-32 Software Developer's Manual says that the NT in MOVNTDQA stands for Non-Temporal, and that otherwise it's the same as MOVDQA.



My question is, what does Non-Temporal mean?










share|improve this question















This is a somewhat low-level question. In x86 assembly there are two SSE instructions:




MOVDQA xmmi, m128




and




MOVNTDQA xmmi, m128




The IA-32 Software Developer's Manual says that the NT in MOVNTDQA stands for Non-Temporal, and that otherwise it's the same as MOVDQA.



My question is, what does Non-Temporal mean?







x86 sse assembly






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited May 2 '12 at 11:10









kristianp

3,1492138




3,1492138










asked Aug 31 '08 at 20:18









Nathan Fellman

59.6k77221290




59.6k77221290








  • 4




    Note that SSE4.1 MOVNTDQA xmmi, m128 is an NT load, while all the other NT instructions are stores, except for prefetchnta. The accepted answer here only seems to be talking about stores. This is what I've been able to turn up about NT loads. TL:DR: hopefully the CPU does something useful with the NT hint to minimize cache pollution, but they don't override the strongly-ordered semantics of "normal" WB memory, so they do have to use the cache.
    – Peter Cordes
    Nov 23 '16 at 7:25






  • 3




    Update: NT loads may not do anything useful except on UCSW memory regions on most CPUs (e.g. Intel SnB family). NT/streaming stores definitely work on normal memory, though.
    – Peter Cordes
    Jul 8 '17 at 1:18






  • 1




    @Peter: You mean USWC memory right? I've never heard of UCSW or USWC memory before. Googling the wrong acronym wasn't helpful :-)
    – Andrew Bainbridge
    Aug 15 '17 at 11:29














  • 4




    Note that SSE4.1 MOVNTDQA xmmi, m128 is an NT load, while all the other NT instructions are stores, except for prefetchnta. The accepted answer here only seems to be talking about stores. This is what I've been able to turn up about NT loads. TL:DR: hopefully the CPU does something useful with the NT hint to minimize cache pollution, but they don't override the strongly-ordered semantics of "normal" WB memory, so they do have to use the cache.
    – Peter Cordes
    Nov 23 '16 at 7:25






  • 3




    Update: NT loads may not do anything useful except on UCSW memory regions on most CPUs (e.g. Intel SnB family). NT/streaming stores definitely work on normal memory, though.
    – Peter Cordes
    Jul 8 '17 at 1:18






  • 1




    @Peter: You mean USWC memory right? I've never heard of UCSW or USWC memory before. Googling the wrong acronym wasn't helpful :-)
    – Andrew Bainbridge
    Aug 15 '17 at 11:29








4




4




Note that SSE4.1 MOVNTDQA xmmi, m128 is an NT load, while all the other NT instructions are stores, except for prefetchnta. The accepted answer here only seems to be talking about stores. This is what I've been able to turn up about NT loads. TL:DR: hopefully the CPU does something useful with the NT hint to minimize cache pollution, but they don't override the strongly-ordered semantics of "normal" WB memory, so they do have to use the cache.
– Peter Cordes
Nov 23 '16 at 7:25




Note that SSE4.1 MOVNTDQA xmmi, m128 is an NT load, while all the other NT instructions are stores, except for prefetchnta. The accepted answer here only seems to be talking about stores. This is what I've been able to turn up about NT loads. TL:DR: hopefully the CPU does something useful with the NT hint to minimize cache pollution, but they don't override the strongly-ordered semantics of "normal" WB memory, so they do have to use the cache.
– Peter Cordes
Nov 23 '16 at 7:25




3




3




Update: NT loads may not do anything useful except on UCSW memory regions on most CPUs (e.g. Intel SnB family). NT/streaming stores definitely work on normal memory, though.
– Peter Cordes
Jul 8 '17 at 1:18




Update: NT loads may not do anything useful except on UCSW memory regions on most CPUs (e.g. Intel SnB family). NT/streaming stores definitely work on normal memory, though.
– Peter Cordes
Jul 8 '17 at 1:18




1




1




@Peter: You mean USWC memory right? I've never heard of UCSW or USWC memory before. Googling the wrong acronym wasn't helpful :-)
– Andrew Bainbridge
Aug 15 '17 at 11:29




@Peter: You mean USWC memory right? I've never heard of UCSW or USWC memory before. Googling the wrong acronym wasn't helpful :-)
– Andrew Bainbridge
Aug 15 '17 at 11:29












3 Answers
3






active

oldest

votes

















up vote
115
down vote



accepted










Non-Temporal SSE instructions (MOVNTI, MOVNTQ, etc.), don't follow the normal cache-coherency rules. Therefore non-temporal stores must be followed by an SFENCE instruction in order for their results to be seen by other processors in a timely fashion.



When data is produced and not (immediately) consumed again, the fact that memory store operations read a full cache line first and then modify the cached data is detrimental to performance. This operation pushes data out of the caches which might be needed again in favor of data which will not be used soon. This is especially true for large data structures, like matrices, which are filled and then used later. Before the last element of the matrix is filled the sheer size evicts the first elements, making caching of the writes ineffective.



For this and similar situations, processors provide support for non-temporal write operations. Non-temporal in this context means the data will not be reused soon, so there is no reason to cache it. These non-temporal write operations do not read a cache line and then modify it; instead, the new content is directly written to memory.



Source: http://lwn.net/Articles/255364/






share|improve this answer

















  • 11




    Nice answer, I would just like to point out that on the kind of processor with NT instructions, even with a non-non-temporal instruction (i.e. a normal instruction), the line cache is not "read and then modified". For a normal instruction writing to a line that is not in the cache, a line is reserved in the cache and a mask indicates what parts of the line are up-to-date. This webpage calls it "no stall on store": ptlsim.org/Documentation/html/node30.html . I couldn't find more precise references, I only heard about this from guys whose job is to implement processor simulators.
    – Pascal Cuoq
    May 4 '10 at 20:03






  • 2




    Actually ptlsim.org is a web site about a cycle-accurate processor simulator, exactly the same kind of thing the guys who told me about "no stall on store" are doing. I'd better mention them too in case they ever see this comment: unisim.org
    – Pascal Cuoq
    May 4 '10 at 20:06










  • From the answers and comments here stackoverflow.com/questions/44864033/… it seems SFENCE may be not needed. At least in the same thread. Could you also look?
    – Serge Rogatch
    Jul 2 '17 at 10:32










  • @SergeRogatch it depends on what scenario you are talking about, but yes there are scenarios where sfence is required for NT stores, whereas it is never required just for normal stores. NT stores aren't ordered with respect to other stores (NT or not), as seen by other threads, without an sfence. For reads from the same thread that did the stores, however, you never need sfence: a given thread will always see its own stores in program order, regardless of whether they are NT stores or not.
    – BeeOnRope
    Nov 8 at 18:59


















up vote
32
down vote













Espo is pretty much bang on target. Just wanted to add my two cents:



The "non temporal" phrase means lacking temporal locality. Caches exploit two kinds of locality - spatial and temporal, and by using a non-temporal instruction you're signaling to the processor that you don't expect the data item be used in the near future.



I am a little skeptical about the hand-coded assembly that uses the cache control instructions. In my experience these things lead to more evil bugs than any effective performance increases.






share|improve this answer





















  • question about "hand-coded assembly that uses the cache control instructions." I know you explicitly said "hand-coded" what about something like a JavaVM. Is this a better use case? The JavaVM/Compiler has analyzed the static and dynamic behavior of the program and uses these non-temporal instructions.
    – Pat
    Dec 1 '15 at 18:21






  • 1




    Exploiting known locality properties (or lack thereof) of your problem domain, algorithm or application shouldn't be shunned. Avoiding cache pollution is indeed a very attractive and effective optimisation task. Also, why the aversion toward assembly? There are vast amounts of opportunities for gains available which a compiler cannot possibly capitalise on
    – awdz9nld
    Dec 21 '15 at 23:44








  • 3




    It's definitely true that a knowledgeable low-level programmer can outperform a compiler for small kernels. This is great for publishing papers and blogposts and I've done both. They're also good didactic tools, and help understand what "really" going on. In my experience though, in practice, where you have a real system with many programmers working on it and correctness and maintainability are important, the benefit of low-level coding is almost always outweighed by the risks.
    – Pramod
    Dec 22 '15 at 22:23








  • 1




    @Pramod that same argument easily generalises to optimisation in general and is not really in scope of the discussion -- clearly that trade-off has already been considered or otherwise been deemed irrelevant given the fact that we are already talking about non-temporal instructions
    – awdz9nld
    Oct 31 '16 at 13:59




















up vote
2
down vote













According to the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1: Basic Architecture,
"Programming with Intel Streaming SIMD Extensions (Intel SSE)" chapter:



Caching of Temporal vs. Non-Temporal Data




Data referenced by a program can be temporal (data will be used again) or non-temporal (data will be referenced once and not reused in the immediate future). For example, program code is generally temporal, whereas, multimedia data, such as the display list in a 3-D graphics application, is often non-temporal. To make efficient use of the processor’s caches, it is generally desirable to cache temporal data and not cache non-temporal data. Overloading the processor’s caches with non-temporal data is sometimes referred to as "polluting the caches". The SSE and SSE2 cacheability control instructions enable a program to write non-temporal data to memory in a manner that minimizes pollution of caches.




Description of non-temporal load and store instructions.
Source: Intel 64 and IA-32 Architectures Software Developer’s Manual, Volume 2: Instruction Set Reference



LOAD (MOVNTDQA—Load Double Quadword Non-Temporal Aligned Hint)




Loads a double quadword from the source operand (second operand) to the destination operand (first operand) using a non-temporal hint if the memory source is WC (write combining) memory type [...]



[...] the processor does not read the data into the cache hierarchy, nor does it fetch the corresponding cache line from memory into the cache hierarchy.




Note that, as Peter Cordes comments, it's not useful on normal WB (write-back) memory on current processors because the NT hint is ignored (probably because there are no NT-aware HW prefetchers) and the full strongly-ordered load semantics apply. prefetchnta can be used as a pollution-reducing load from WB memory



STORE (MOVNTDQ—Store Packed Integers Using Non-Temporal Hint)




Moves the packed integers in the source operand (second operand) to the destination operand (first operand) using a non-temporal hint to prevent caching of the data during the write to memory.



[...] the processor does not write the data into the cache hierarchy, nor does it fetch the corresponding cache line from memory into the cache hierarchy.




Using the terminology defined in Cache Write Policies and Performance, they can be considered as write-around (no-write-allocate, no-fetch-on-write-miss).



Finally, it may be interesting to review John McAlpin notes about non-temporal stores.






share|improve this answer



















  • 1




    SSE4.1 MOVNTDQA only does anything special on WC (uncacheable Write-Combining) memory regions, e.g. video RAM. It's not at all useful on normal WB (write-back) memory on current HW, the NT hint is ignored and the full strongly-ordered load semantics apply. prefetchnta can be useful, though, as a pollution-reducing load from WB memory. Do current x86 architectures support non-temporal loads (from "normal" memory)?.
    – Peter Cordes
    Nov 8 at 8:11






  • 1




    That's correct, NT stores work fine on WB memory, and are weakly-ordered, and are usually a good choice for writing large regions of memory. But NT loads aren't. The x86 manual on paper allows for the NT hint to do something for loads from WB memory, but in current CPUs it does nothing. (Probably because there are no NT-aware HW prefetchers.)
    – Peter Cordes
    Nov 8 at 11:02










  • I have added that relevant info to the answer. Thank you very much.
    – chus
    Nov 8 at 11:26











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














 

draft saved


draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f37070%2fwhat-is-the-meaning-of-non-temporal-memory-accesses-in-x86%23new-answer', 'question_page');
}
);

Post as a guest
































3 Answers
3






active

oldest

votes








3 Answers
3






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
115
down vote



accepted










Non-Temporal SSE instructions (MOVNTI, MOVNTQ, etc.), don't follow the normal cache-coherency rules. Therefore non-temporal stores must be followed by an SFENCE instruction in order for their results to be seen by other processors in a timely fashion.



When data is produced and not (immediately) consumed again, the fact that memory store operations read a full cache line first and then modify the cached data is detrimental to performance. This operation pushes data out of the caches which might be needed again in favor of data which will not be used soon. This is especially true for large data structures, like matrices, which are filled and then used later. Before the last element of the matrix is filled the sheer size evicts the first elements, making caching of the writes ineffective.



For this and similar situations, processors provide support for non-temporal write operations. Non-temporal in this context means the data will not be reused soon, so there is no reason to cache it. These non-temporal write operations do not read a cache line and then modify it; instead, the new content is directly written to memory.



Source: http://lwn.net/Articles/255364/






share|improve this answer

















  • 11




    Nice answer, I would just like to point out that on the kind of processor with NT instructions, even with a non-non-temporal instruction (i.e. a normal instruction), the line cache is not "read and then modified". For a normal instruction writing to a line that is not in the cache, a line is reserved in the cache and a mask indicates what parts of the line are up-to-date. This webpage calls it "no stall on store": ptlsim.org/Documentation/html/node30.html . I couldn't find more precise references, I only heard about this from guys whose job is to implement processor simulators.
    – Pascal Cuoq
    May 4 '10 at 20:03






  • 2




    Actually ptlsim.org is a web site about a cycle-accurate processor simulator, exactly the same kind of thing the guys who told me about "no stall on store" are doing. I'd better mention them too in case they ever see this comment: unisim.org
    – Pascal Cuoq
    May 4 '10 at 20:06










  • From the answers and comments here stackoverflow.com/questions/44864033/… it seems SFENCE may be not needed. At least in the same thread. Could you also look?
    – Serge Rogatch
    Jul 2 '17 at 10:32










  • @SergeRogatch it depends on what scenario you are talking about, but yes there are scenarios where sfence is required for NT stores, whereas it is never required just for normal stores. NT stores aren't ordered with respect to other stores (NT or not), as seen by other threads, without an sfence. For reads from the same thread that did the stores, however, you never need sfence: a given thread will always see its own stores in program order, regardless of whether they are NT stores or not.
    – BeeOnRope
    Nov 8 at 18:59















up vote
115
down vote



accepted










Non-Temporal SSE instructions (MOVNTI, MOVNTQ, etc.), don't follow the normal cache-coherency rules. Therefore non-temporal stores must be followed by an SFENCE instruction in order for their results to be seen by other processors in a timely fashion.



When data is produced and not (immediately) consumed again, the fact that memory store operations read a full cache line first and then modify the cached data is detrimental to performance. This operation pushes data out of the caches which might be needed again in favor of data which will not be used soon. This is especially true for large data structures, like matrices, which are filled and then used later. Before the last element of the matrix is filled the sheer size evicts the first elements, making caching of the writes ineffective.



For this and similar situations, processors provide support for non-temporal write operations. Non-temporal in this context means the data will not be reused soon, so there is no reason to cache it. These non-temporal write operations do not read a cache line and then modify it; instead, the new content is directly written to memory.



Source: http://lwn.net/Articles/255364/






share|improve this answer

















  • 11




    Nice answer, I would just like to point out that on the kind of processor with NT instructions, even with a non-non-temporal instruction (i.e. a normal instruction), the line cache is not "read and then modified". For a normal instruction writing to a line that is not in the cache, a line is reserved in the cache and a mask indicates what parts of the line are up-to-date. This webpage calls it "no stall on store": ptlsim.org/Documentation/html/node30.html . I couldn't find more precise references, I only heard about this from guys whose job is to implement processor simulators.
    – Pascal Cuoq
    May 4 '10 at 20:03






  • 2




    Actually ptlsim.org is a web site about a cycle-accurate processor simulator, exactly the same kind of thing the guys who told me about "no stall on store" are doing. I'd better mention them too in case they ever see this comment: unisim.org
    – Pascal Cuoq
    May 4 '10 at 20:06










  • From the answers and comments here stackoverflow.com/questions/44864033/… it seems SFENCE may be not needed. At least in the same thread. Could you also look?
    – Serge Rogatch
    Jul 2 '17 at 10:32










  • @SergeRogatch it depends on what scenario you are talking about, but yes there are scenarios where sfence is required for NT stores, whereas it is never required just for normal stores. NT stores aren't ordered with respect to other stores (NT or not), as seen by other threads, without an sfence. For reads from the same thread that did the stores, however, you never need sfence: a given thread will always see its own stores in program order, regardless of whether they are NT stores or not.
    – BeeOnRope
    Nov 8 at 18:59













up vote
115
down vote



accepted







up vote
115
down vote



accepted






Non-Temporal SSE instructions (MOVNTI, MOVNTQ, etc.), don't follow the normal cache-coherency rules. Therefore non-temporal stores must be followed by an SFENCE instruction in order for their results to be seen by other processors in a timely fashion.



When data is produced and not (immediately) consumed again, the fact that memory store operations read a full cache line first and then modify the cached data is detrimental to performance. This operation pushes data out of the caches which might be needed again in favor of data which will not be used soon. This is especially true for large data structures, like matrices, which are filled and then used later. Before the last element of the matrix is filled the sheer size evicts the first elements, making caching of the writes ineffective.



For this and similar situations, processors provide support for non-temporal write operations. Non-temporal in this context means the data will not be reused soon, so there is no reason to cache it. These non-temporal write operations do not read a cache line and then modify it; instead, the new content is directly written to memory.



Source: http://lwn.net/Articles/255364/






share|improve this answer












Non-Temporal SSE instructions (MOVNTI, MOVNTQ, etc.), don't follow the normal cache-coherency rules. Therefore non-temporal stores must be followed by an SFENCE instruction in order for their results to be seen by other processors in a timely fashion.



When data is produced and not (immediately) consumed again, the fact that memory store operations read a full cache line first and then modify the cached data is detrimental to performance. This operation pushes data out of the caches which might be needed again in favor of data which will not be used soon. This is especially true for large data structures, like matrices, which are filled and then used later. Before the last element of the matrix is filled the sheer size evicts the first elements, making caching of the writes ineffective.



For this and similar situations, processors provide support for non-temporal write operations. Non-temporal in this context means the data will not be reused soon, so there is no reason to cache it. These non-temporal write operations do not read a cache line and then modify it; instead, the new content is directly written to memory.



Source: http://lwn.net/Articles/255364/







share|improve this answer












share|improve this answer



share|improve this answer










answered Aug 31 '08 at 20:50









Espo

33.8k18122149




33.8k18122149








  • 11




    Nice answer, I would just like to point out that on the kind of processor with NT instructions, even with a non-non-temporal instruction (i.e. a normal instruction), the line cache is not "read and then modified". For a normal instruction writing to a line that is not in the cache, a line is reserved in the cache and a mask indicates what parts of the line are up-to-date. This webpage calls it "no stall on store": ptlsim.org/Documentation/html/node30.html . I couldn't find more precise references, I only heard about this from guys whose job is to implement processor simulators.
    – Pascal Cuoq
    May 4 '10 at 20:03






  • 2




    Actually ptlsim.org is a web site about a cycle-accurate processor simulator, exactly the same kind of thing the guys who told me about "no stall on store" are doing. I'd better mention them too in case they ever see this comment: unisim.org
    – Pascal Cuoq
    May 4 '10 at 20:06










  • From the answers and comments here stackoverflow.com/questions/44864033/… it seems SFENCE may be not needed. At least in the same thread. Could you also look?
    – Serge Rogatch
    Jul 2 '17 at 10:32










  • @SergeRogatch it depends on what scenario you are talking about, but yes there are scenarios where sfence is required for NT stores, whereas it is never required just for normal stores. NT stores aren't ordered with respect to other stores (NT or not), as seen by other threads, without an sfence. For reads from the same thread that did the stores, however, you never need sfence: a given thread will always see its own stores in program order, regardless of whether they are NT stores or not.
    – BeeOnRope
    Nov 8 at 18:59














  • 11




    Nice answer, I would just like to point out that on the kind of processor with NT instructions, even with a non-non-temporal instruction (i.e. a normal instruction), the line cache is not "read and then modified". For a normal instruction writing to a line that is not in the cache, a line is reserved in the cache and a mask indicates what parts of the line are up-to-date. This webpage calls it "no stall on store": ptlsim.org/Documentation/html/node30.html . I couldn't find more precise references, I only heard about this from guys whose job is to implement processor simulators.
    – Pascal Cuoq
    May 4 '10 at 20:03






  • 2




    Actually ptlsim.org is a web site about a cycle-accurate processor simulator, exactly the same kind of thing the guys who told me about "no stall on store" are doing. I'd better mention them too in case they ever see this comment: unisim.org
    – Pascal Cuoq
    May 4 '10 at 20:06










  • From the answers and comments here stackoverflow.com/questions/44864033/… it seems SFENCE may be not needed. At least in the same thread. Could you also look?
    – Serge Rogatch
    Jul 2 '17 at 10:32










  • @SergeRogatch it depends on what scenario you are talking about, but yes there are scenarios where sfence is required for NT stores, whereas it is never required just for normal stores. NT stores aren't ordered with respect to other stores (NT or not), as seen by other threads, without an sfence. For reads from the same thread that did the stores, however, you never need sfence: a given thread will always see its own stores in program order, regardless of whether they are NT stores or not.
    – BeeOnRope
    Nov 8 at 18:59








11




11




Nice answer, I would just like to point out that on the kind of processor with NT instructions, even with a non-non-temporal instruction (i.e. a normal instruction), the line cache is not "read and then modified". For a normal instruction writing to a line that is not in the cache, a line is reserved in the cache and a mask indicates what parts of the line are up-to-date. This webpage calls it "no stall on store": ptlsim.org/Documentation/html/node30.html . I couldn't find more precise references, I only heard about this from guys whose job is to implement processor simulators.
– Pascal Cuoq
May 4 '10 at 20:03




Nice answer, I would just like to point out that on the kind of processor with NT instructions, even with a non-non-temporal instruction (i.e. a normal instruction), the line cache is not "read and then modified". For a normal instruction writing to a line that is not in the cache, a line is reserved in the cache and a mask indicates what parts of the line are up-to-date. This webpage calls it "no stall on store": ptlsim.org/Documentation/html/node30.html . I couldn't find more precise references, I only heard about this from guys whose job is to implement processor simulators.
– Pascal Cuoq
May 4 '10 at 20:03




2




2




Actually ptlsim.org is a web site about a cycle-accurate processor simulator, exactly the same kind of thing the guys who told me about "no stall on store" are doing. I'd better mention them too in case they ever see this comment: unisim.org
– Pascal Cuoq
May 4 '10 at 20:06




Actually ptlsim.org is a web site about a cycle-accurate processor simulator, exactly the same kind of thing the guys who told me about "no stall on store" are doing. I'd better mention them too in case they ever see this comment: unisim.org
– Pascal Cuoq
May 4 '10 at 20:06












From the answers and comments here stackoverflow.com/questions/44864033/… it seems SFENCE may be not needed. At least in the same thread. Could you also look?
– Serge Rogatch
Jul 2 '17 at 10:32




From the answers and comments here stackoverflow.com/questions/44864033/… it seems SFENCE may be not needed. At least in the same thread. Could you also look?
– Serge Rogatch
Jul 2 '17 at 10:32












@SergeRogatch it depends on what scenario you are talking about, but yes there are scenarios where sfence is required for NT stores, whereas it is never required just for normal stores. NT stores aren't ordered with respect to other stores (NT or not), as seen by other threads, without an sfence. For reads from the same thread that did the stores, however, you never need sfence: a given thread will always see its own stores in program order, regardless of whether they are NT stores or not.
– BeeOnRope
Nov 8 at 18:59




@SergeRogatch it depends on what scenario you are talking about, but yes there are scenarios where sfence is required for NT stores, whereas it is never required just for normal stores. NT stores aren't ordered with respect to other stores (NT or not), as seen by other threads, without an sfence. For reads from the same thread that did the stores, however, you never need sfence: a given thread will always see its own stores in program order, regardless of whether they are NT stores or not.
– BeeOnRope
Nov 8 at 18:59












up vote
32
down vote













Espo is pretty much bang on target. Just wanted to add my two cents:



The "non temporal" phrase means lacking temporal locality. Caches exploit two kinds of locality - spatial and temporal, and by using a non-temporal instruction you're signaling to the processor that you don't expect the data item be used in the near future.



I am a little skeptical about the hand-coded assembly that uses the cache control instructions. In my experience these things lead to more evil bugs than any effective performance increases.






share|improve this answer





















  • question about "hand-coded assembly that uses the cache control instructions." I know you explicitly said "hand-coded" what about something like a JavaVM. Is this a better use case? The JavaVM/Compiler has analyzed the static and dynamic behavior of the program and uses these non-temporal instructions.
    – Pat
    Dec 1 '15 at 18:21






  • 1




    Exploiting known locality properties (or lack thereof) of your problem domain, algorithm or application shouldn't be shunned. Avoiding cache pollution is indeed a very attractive and effective optimisation task. Also, why the aversion toward assembly? There are vast amounts of opportunities for gains available which a compiler cannot possibly capitalise on
    – awdz9nld
    Dec 21 '15 at 23:44








  • 3




    It's definitely true that a knowledgeable low-level programmer can outperform a compiler for small kernels. This is great for publishing papers and blogposts and I've done both. They're also good didactic tools, and help understand what "really" going on. In my experience though, in practice, where you have a real system with many programmers working on it and correctness and maintainability are important, the benefit of low-level coding is almost always outweighed by the risks.
    – Pramod
    Dec 22 '15 at 22:23








  • 1




    @Pramod that same argument easily generalises to optimisation in general and is not really in scope of the discussion -- clearly that trade-off has already been considered or otherwise been deemed irrelevant given the fact that we are already talking about non-temporal instructions
    – awdz9nld
    Oct 31 '16 at 13:59

















up vote
32
down vote













Espo is pretty much bang on target. Just wanted to add my two cents:



The "non temporal" phrase means lacking temporal locality. Caches exploit two kinds of locality - spatial and temporal, and by using a non-temporal instruction you're signaling to the processor that you don't expect the data item be used in the near future.



I am a little skeptical about the hand-coded assembly that uses the cache control instructions. In my experience these things lead to more evil bugs than any effective performance increases.






share|improve this answer





















  • question about "hand-coded assembly that uses the cache control instructions." I know you explicitly said "hand-coded" what about something like a JavaVM. Is this a better use case? The JavaVM/Compiler has analyzed the static and dynamic behavior of the program and uses these non-temporal instructions.
    – Pat
    Dec 1 '15 at 18:21






  • 1




    Exploiting known locality properties (or lack thereof) of your problem domain, algorithm or application shouldn't be shunned. Avoiding cache pollution is indeed a very attractive and effective optimisation task. Also, why the aversion toward assembly? There are vast amounts of opportunities for gains available which a compiler cannot possibly capitalise on
    – awdz9nld
    Dec 21 '15 at 23:44








  • 3




    It's definitely true that a knowledgeable low-level programmer can outperform a compiler for small kernels. This is great for publishing papers and blogposts and I've done both. They're also good didactic tools, and help understand what "really" going on. In my experience though, in practice, where you have a real system with many programmers working on it and correctness and maintainability are important, the benefit of low-level coding is almost always outweighed by the risks.
    – Pramod
    Dec 22 '15 at 22:23








  • 1




    @Pramod that same argument easily generalises to optimisation in general and is not really in scope of the discussion -- clearly that trade-off has already been considered or otherwise been deemed irrelevant given the fact that we are already talking about non-temporal instructions
    – awdz9nld
    Oct 31 '16 at 13:59















up vote
32
down vote










up vote
32
down vote









Espo is pretty much bang on target. Just wanted to add my two cents:



The "non temporal" phrase means lacking temporal locality. Caches exploit two kinds of locality - spatial and temporal, and by using a non-temporal instruction you're signaling to the processor that you don't expect the data item be used in the near future.



I am a little skeptical about the hand-coded assembly that uses the cache control instructions. In my experience these things lead to more evil bugs than any effective performance increases.






share|improve this answer












Espo is pretty much bang on target. Just wanted to add my two cents:



The "non temporal" phrase means lacking temporal locality. Caches exploit two kinds of locality - spatial and temporal, and by using a non-temporal instruction you're signaling to the processor that you don't expect the data item be used in the near future.



I am a little skeptical about the hand-coded assembly that uses the cache control instructions. In my experience these things lead to more evil bugs than any effective performance increases.







share|improve this answer












share|improve this answer



share|improve this answer










answered Sep 1 '08 at 16:03









Pramod

6,71731825




6,71731825












  • question about "hand-coded assembly that uses the cache control instructions." I know you explicitly said "hand-coded" what about something like a JavaVM. Is this a better use case? The JavaVM/Compiler has analyzed the static and dynamic behavior of the program and uses these non-temporal instructions.
    – Pat
    Dec 1 '15 at 18:21






  • 1




    Exploiting known locality properties (or lack thereof) of your problem domain, algorithm or application shouldn't be shunned. Avoiding cache pollution is indeed a very attractive and effective optimisation task. Also, why the aversion toward assembly? There are vast amounts of opportunities for gains available which a compiler cannot possibly capitalise on
    – awdz9nld
    Dec 21 '15 at 23:44








  • 3




    It's definitely true that a knowledgeable low-level programmer can outperform a compiler for small kernels. This is great for publishing papers and blogposts and I've done both. They're also good didactic tools, and help understand what "really" going on. In my experience though, in practice, where you have a real system with many programmers working on it and correctness and maintainability are important, the benefit of low-level coding is almost always outweighed by the risks.
    – Pramod
    Dec 22 '15 at 22:23








  • 1




    @Pramod that same argument easily generalises to optimisation in general and is not really in scope of the discussion -- clearly that trade-off has already been considered or otherwise been deemed irrelevant given the fact that we are already talking about non-temporal instructions
    – awdz9nld
    Oct 31 '16 at 13:59




















  • question about "hand-coded assembly that uses the cache control instructions." I know you explicitly said "hand-coded" what about something like a JavaVM. Is this a better use case? The JavaVM/Compiler has analyzed the static and dynamic behavior of the program and uses these non-temporal instructions.
    – Pat
    Dec 1 '15 at 18:21






  • 1




    Exploiting known locality properties (or lack thereof) of your problem domain, algorithm or application shouldn't be shunned. Avoiding cache pollution is indeed a very attractive and effective optimisation task. Also, why the aversion toward assembly? There are vast amounts of opportunities for gains available which a compiler cannot possibly capitalise on
    – awdz9nld
    Dec 21 '15 at 23:44








  • 3




    It's definitely true that a knowledgeable low-level programmer can outperform a compiler for small kernels. This is great for publishing papers and blogposts and I've done both. They're also good didactic tools, and help understand what "really" going on. In my experience though, in practice, where you have a real system with many programmers working on it and correctness and maintainability are important, the benefit of low-level coding is almost always outweighed by the risks.
    – Pramod
    Dec 22 '15 at 22:23








  • 1




    @Pramod that same argument easily generalises to optimisation in general and is not really in scope of the discussion -- clearly that trade-off has already been considered or otherwise been deemed irrelevant given the fact that we are already talking about non-temporal instructions
    – awdz9nld
    Oct 31 '16 at 13:59


















question about "hand-coded assembly that uses the cache control instructions." I know you explicitly said "hand-coded" what about something like a JavaVM. Is this a better use case? The JavaVM/Compiler has analyzed the static and dynamic behavior of the program and uses these non-temporal instructions.
– Pat
Dec 1 '15 at 18:21




question about "hand-coded assembly that uses the cache control instructions." I know you explicitly said "hand-coded" what about something like a JavaVM. Is this a better use case? The JavaVM/Compiler has analyzed the static and dynamic behavior of the program and uses these non-temporal instructions.
– Pat
Dec 1 '15 at 18:21




1




1




Exploiting known locality properties (or lack thereof) of your problem domain, algorithm or application shouldn't be shunned. Avoiding cache pollution is indeed a very attractive and effective optimisation task. Also, why the aversion toward assembly? There are vast amounts of opportunities for gains available which a compiler cannot possibly capitalise on
– awdz9nld
Dec 21 '15 at 23:44






Exploiting known locality properties (or lack thereof) of your problem domain, algorithm or application shouldn't be shunned. Avoiding cache pollution is indeed a very attractive and effective optimisation task. Also, why the aversion toward assembly? There are vast amounts of opportunities for gains available which a compiler cannot possibly capitalise on
– awdz9nld
Dec 21 '15 at 23:44






3




3




It's definitely true that a knowledgeable low-level programmer can outperform a compiler for small kernels. This is great for publishing papers and blogposts and I've done both. They're also good didactic tools, and help understand what "really" going on. In my experience though, in practice, where you have a real system with many programmers working on it and correctness and maintainability are important, the benefit of low-level coding is almost always outweighed by the risks.
– Pramod
Dec 22 '15 at 22:23






It's definitely true that a knowledgeable low-level programmer can outperform a compiler for small kernels. This is great for publishing papers and blogposts and I've done both. They're also good didactic tools, and help understand what "really" going on. In my experience though, in practice, where you have a real system with many programmers working on it and correctness and maintainability are important, the benefit of low-level coding is almost always outweighed by the risks.
– Pramod
Dec 22 '15 at 22:23






1




1




@Pramod that same argument easily generalises to optimisation in general and is not really in scope of the discussion -- clearly that trade-off has already been considered or otherwise been deemed irrelevant given the fact that we are already talking about non-temporal instructions
– awdz9nld
Oct 31 '16 at 13:59






@Pramod that same argument easily generalises to optimisation in general and is not really in scope of the discussion -- clearly that trade-off has already been considered or otherwise been deemed irrelevant given the fact that we are already talking about non-temporal instructions
– awdz9nld
Oct 31 '16 at 13:59












up vote
2
down vote













According to the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1: Basic Architecture,
"Programming with Intel Streaming SIMD Extensions (Intel SSE)" chapter:



Caching of Temporal vs. Non-Temporal Data




Data referenced by a program can be temporal (data will be used again) or non-temporal (data will be referenced once and not reused in the immediate future). For example, program code is generally temporal, whereas, multimedia data, such as the display list in a 3-D graphics application, is often non-temporal. To make efficient use of the processor’s caches, it is generally desirable to cache temporal data and not cache non-temporal data. Overloading the processor’s caches with non-temporal data is sometimes referred to as "polluting the caches". The SSE and SSE2 cacheability control instructions enable a program to write non-temporal data to memory in a manner that minimizes pollution of caches.




Description of non-temporal load and store instructions.
Source: Intel 64 and IA-32 Architectures Software Developer’s Manual, Volume 2: Instruction Set Reference



LOAD (MOVNTDQA—Load Double Quadword Non-Temporal Aligned Hint)




Loads a double quadword from the source operand (second operand) to the destination operand (first operand) using a non-temporal hint if the memory source is WC (write combining) memory type [...]



[...] the processor does not read the data into the cache hierarchy, nor does it fetch the corresponding cache line from memory into the cache hierarchy.




Note that, as Peter Cordes comments, it's not useful on normal WB (write-back) memory on current processors because the NT hint is ignored (probably because there are no NT-aware HW prefetchers) and the full strongly-ordered load semantics apply. prefetchnta can be used as a pollution-reducing load from WB memory



STORE (MOVNTDQ—Store Packed Integers Using Non-Temporal Hint)




Moves the packed integers in the source operand (second operand) to the destination operand (first operand) using a non-temporal hint to prevent caching of the data during the write to memory.



[...] the processor does not write the data into the cache hierarchy, nor does it fetch the corresponding cache line from memory into the cache hierarchy.




Using the terminology defined in Cache Write Policies and Performance, they can be considered as write-around (no-write-allocate, no-fetch-on-write-miss).



Finally, it may be interesting to review John McAlpin notes about non-temporal stores.






share|improve this answer



















  • 1




    SSE4.1 MOVNTDQA only does anything special on WC (uncacheable Write-Combining) memory regions, e.g. video RAM. It's not at all useful on normal WB (write-back) memory on current HW, the NT hint is ignored and the full strongly-ordered load semantics apply. prefetchnta can be useful, though, as a pollution-reducing load from WB memory. Do current x86 architectures support non-temporal loads (from "normal" memory)?.
    – Peter Cordes
    Nov 8 at 8:11






  • 1




    That's correct, NT stores work fine on WB memory, and are weakly-ordered, and are usually a good choice for writing large regions of memory. But NT loads aren't. The x86 manual on paper allows for the NT hint to do something for loads from WB memory, but in current CPUs it does nothing. (Probably because there are no NT-aware HW prefetchers.)
    – Peter Cordes
    Nov 8 at 11:02










  • I have added that relevant info to the answer. Thank you very much.
    – chus
    Nov 8 at 11:26















up vote
2
down vote













According to the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1: Basic Architecture,
"Programming with Intel Streaming SIMD Extensions (Intel SSE)" chapter:



Caching of Temporal vs. Non-Temporal Data




Data referenced by a program can be temporal (data will be used again) or non-temporal (data will be referenced once and not reused in the immediate future). For example, program code is generally temporal, whereas, multimedia data, such as the display list in a 3-D graphics application, is often non-temporal. To make efficient use of the processor’s caches, it is generally desirable to cache temporal data and not cache non-temporal data. Overloading the processor’s caches with non-temporal data is sometimes referred to as "polluting the caches". The SSE and SSE2 cacheability control instructions enable a program to write non-temporal data to memory in a manner that minimizes pollution of caches.




Description of non-temporal load and store instructions.
Source: Intel 64 and IA-32 Architectures Software Developer’s Manual, Volume 2: Instruction Set Reference



LOAD (MOVNTDQA—Load Double Quadword Non-Temporal Aligned Hint)




Loads a double quadword from the source operand (second operand) to the destination operand (first operand) using a non-temporal hint if the memory source is WC (write combining) memory type [...]



[...] the processor does not read the data into the cache hierarchy, nor does it fetch the corresponding cache line from memory into the cache hierarchy.




Note that, as Peter Cordes comments, it's not useful on normal WB (write-back) memory on current processors because the NT hint is ignored (probably because there are no NT-aware HW prefetchers) and the full strongly-ordered load semantics apply. prefetchnta can be used as a pollution-reducing load from WB memory



STORE (MOVNTDQ—Store Packed Integers Using Non-Temporal Hint)




Moves the packed integers in the source operand (second operand) to the destination operand (first operand) using a non-temporal hint to prevent caching of the data during the write to memory.



[...] the processor does not write the data into the cache hierarchy, nor does it fetch the corresponding cache line from memory into the cache hierarchy.




Using the terminology defined in Cache Write Policies and Performance, they can be considered as write-around (no-write-allocate, no-fetch-on-write-miss).



Finally, it may be interesting to review John McAlpin notes about non-temporal stores.






share|improve this answer



















  • 1




    SSE4.1 MOVNTDQA only does anything special on WC (uncacheable Write-Combining) memory regions, e.g. video RAM. It's not at all useful on normal WB (write-back) memory on current HW, the NT hint is ignored and the full strongly-ordered load semantics apply. prefetchnta can be useful, though, as a pollution-reducing load from WB memory. Do current x86 architectures support non-temporal loads (from "normal" memory)?.
    – Peter Cordes
    Nov 8 at 8:11






  • 1




    That's correct, NT stores work fine on WB memory, and are weakly-ordered, and are usually a good choice for writing large regions of memory. But NT loads aren't. The x86 manual on paper allows for the NT hint to do something for loads from WB memory, but in current CPUs it does nothing. (Probably because there are no NT-aware HW prefetchers.)
    – Peter Cordes
    Nov 8 at 11:02










  • I have added that relevant info to the answer. Thank you very much.
    – chus
    Nov 8 at 11:26













up vote
2
down vote










up vote
2
down vote









According to the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1: Basic Architecture,
"Programming with Intel Streaming SIMD Extensions (Intel SSE)" chapter:



Caching of Temporal vs. Non-Temporal Data




Data referenced by a program can be temporal (data will be used again) or non-temporal (data will be referenced once and not reused in the immediate future). For example, program code is generally temporal, whereas, multimedia data, such as the display list in a 3-D graphics application, is often non-temporal. To make efficient use of the processor’s caches, it is generally desirable to cache temporal data and not cache non-temporal data. Overloading the processor’s caches with non-temporal data is sometimes referred to as "polluting the caches". The SSE and SSE2 cacheability control instructions enable a program to write non-temporal data to memory in a manner that minimizes pollution of caches.




Description of non-temporal load and store instructions.
Source: Intel 64 and IA-32 Architectures Software Developer’s Manual, Volume 2: Instruction Set Reference



LOAD (MOVNTDQA—Load Double Quadword Non-Temporal Aligned Hint)




Loads a double quadword from the source operand (second operand) to the destination operand (first operand) using a non-temporal hint if the memory source is WC (write combining) memory type [...]



[...] the processor does not read the data into the cache hierarchy, nor does it fetch the corresponding cache line from memory into the cache hierarchy.




Note that, as Peter Cordes comments, it's not useful on normal WB (write-back) memory on current processors because the NT hint is ignored (probably because there are no NT-aware HW prefetchers) and the full strongly-ordered load semantics apply. prefetchnta can be used as a pollution-reducing load from WB memory



STORE (MOVNTDQ—Store Packed Integers Using Non-Temporal Hint)




Moves the packed integers in the source operand (second operand) to the destination operand (first operand) using a non-temporal hint to prevent caching of the data during the write to memory.



[...] the processor does not write the data into the cache hierarchy, nor does it fetch the corresponding cache line from memory into the cache hierarchy.




Using the terminology defined in Cache Write Policies and Performance, they can be considered as write-around (no-write-allocate, no-fetch-on-write-miss).



Finally, it may be interesting to review John McAlpin notes about non-temporal stores.






share|improve this answer














According to the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1: Basic Architecture,
"Programming with Intel Streaming SIMD Extensions (Intel SSE)" chapter:



Caching of Temporal vs. Non-Temporal Data




Data referenced by a program can be temporal (data will be used again) or non-temporal (data will be referenced once and not reused in the immediate future). For example, program code is generally temporal, whereas, multimedia data, such as the display list in a 3-D graphics application, is often non-temporal. To make efficient use of the processor’s caches, it is generally desirable to cache temporal data and not cache non-temporal data. Overloading the processor’s caches with non-temporal data is sometimes referred to as "polluting the caches". The SSE and SSE2 cacheability control instructions enable a program to write non-temporal data to memory in a manner that minimizes pollution of caches.




Description of non-temporal load and store instructions.
Source: Intel 64 and IA-32 Architectures Software Developer’s Manual, Volume 2: Instruction Set Reference



LOAD (MOVNTDQA—Load Double Quadword Non-Temporal Aligned Hint)




Loads a double quadword from the source operand (second operand) to the destination operand (first operand) using a non-temporal hint if the memory source is WC (write combining) memory type [...]



[...] the processor does not read the data into the cache hierarchy, nor does it fetch the corresponding cache line from memory into the cache hierarchy.




Note that, as Peter Cordes comments, it's not useful on normal WB (write-back) memory on current processors because the NT hint is ignored (probably because there are no NT-aware HW prefetchers) and the full strongly-ordered load semantics apply. prefetchnta can be used as a pollution-reducing load from WB memory



STORE (MOVNTDQ—Store Packed Integers Using Non-Temporal Hint)




Moves the packed integers in the source operand (second operand) to the destination operand (first operand) using a non-temporal hint to prevent caching of the data during the write to memory.



[...] the processor does not write the data into the cache hierarchy, nor does it fetch the corresponding cache line from memory into the cache hierarchy.




Using the terminology defined in Cache Write Policies and Performance, they can be considered as write-around (no-write-allocate, no-fetch-on-write-miss).



Finally, it may be interesting to review John McAlpin notes about non-temporal stores.







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 8 at 23:12

























answered Nov 7 at 23:59









chus

970920




970920








  • 1




    SSE4.1 MOVNTDQA only does anything special on WC (uncacheable Write-Combining) memory regions, e.g. video RAM. It's not at all useful on normal WB (write-back) memory on current HW, the NT hint is ignored and the full strongly-ordered load semantics apply. prefetchnta can be useful, though, as a pollution-reducing load from WB memory. Do current x86 architectures support non-temporal loads (from "normal" memory)?.
    – Peter Cordes
    Nov 8 at 8:11






  • 1




    That's correct, NT stores work fine on WB memory, and are weakly-ordered, and are usually a good choice for writing large regions of memory. But NT loads aren't. The x86 manual on paper allows for the NT hint to do something for loads from WB memory, but in current CPUs it does nothing. (Probably because there are no NT-aware HW prefetchers.)
    – Peter Cordes
    Nov 8 at 11:02










  • I have added that relevant info to the answer. Thank you very much.
    – chus
    Nov 8 at 11:26














  • 1




    SSE4.1 MOVNTDQA only does anything special on WC (uncacheable Write-Combining) memory regions, e.g. video RAM. It's not at all useful on normal WB (write-back) memory on current HW, the NT hint is ignored and the full strongly-ordered load semantics apply. prefetchnta can be useful, though, as a pollution-reducing load from WB memory. Do current x86 architectures support non-temporal loads (from "normal" memory)?.
    – Peter Cordes
    Nov 8 at 8:11






  • 1




    That's correct, NT stores work fine on WB memory, and are weakly-ordered, and are usually a good choice for writing large regions of memory. But NT loads aren't. The x86 manual on paper allows for the NT hint to do something for loads from WB memory, but in current CPUs it does nothing. (Probably because there are no NT-aware HW prefetchers.)
    – Peter Cordes
    Nov 8 at 11:02










  • I have added that relevant info to the answer. Thank you very much.
    – chus
    Nov 8 at 11:26








1




1




SSE4.1 MOVNTDQA only does anything special on WC (uncacheable Write-Combining) memory regions, e.g. video RAM. It's not at all useful on normal WB (write-back) memory on current HW, the NT hint is ignored and the full strongly-ordered load semantics apply. prefetchnta can be useful, though, as a pollution-reducing load from WB memory. Do current x86 architectures support non-temporal loads (from "normal" memory)?.
– Peter Cordes
Nov 8 at 8:11




SSE4.1 MOVNTDQA only does anything special on WC (uncacheable Write-Combining) memory regions, e.g. video RAM. It's not at all useful on normal WB (write-back) memory on current HW, the NT hint is ignored and the full strongly-ordered load semantics apply. prefetchnta can be useful, though, as a pollution-reducing load from WB memory. Do current x86 architectures support non-temporal loads (from "normal" memory)?.
– Peter Cordes
Nov 8 at 8:11




1




1




That's correct, NT stores work fine on WB memory, and are weakly-ordered, and are usually a good choice for writing large regions of memory. But NT loads aren't. The x86 manual on paper allows for the NT hint to do something for loads from WB memory, but in current CPUs it does nothing. (Probably because there are no NT-aware HW prefetchers.)
– Peter Cordes
Nov 8 at 11:02




That's correct, NT stores work fine on WB memory, and are weakly-ordered, and are usually a good choice for writing large regions of memory. But NT loads aren't. The x86 manual on paper allows for the NT hint to do something for loads from WB memory, but in current CPUs it does nothing. (Probably because there are no NT-aware HW prefetchers.)
– Peter Cordes
Nov 8 at 11:02












I have added that relevant info to the answer. Thank you very much.
– chus
Nov 8 at 11:26




I have added that relevant info to the answer. Thank you very much.
– chus
Nov 8 at 11:26


















 

draft saved


draft discarded



















































 


draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f37070%2fwhat-is-the-meaning-of-non-temporal-memory-accesses-in-x86%23new-answer', 'question_page');
}
);

Post as a guest




















































































Popular posts from this blog

鏡平學校

ꓛꓣだゔៀៅຸ໢ທຮ໕໒ ,ໂ'໥໓າ໼ឨឲ៵៭ៈゎゔit''䖳𥁄卿' ☨₤₨こゎもょの;ꜹꟚꞖꞵꟅꞛေၦေɯ,ɨɡ𛃵𛁹ޝ޳ޠ޾,ޤޒޯ޾𫝒𫠁သ𛅤チョ'サノބޘދ𛁐ᶿᶇᶀᶋᶠ㨑㽹⻮ꧬ꧹؍۩وَؠ㇕㇃㇪ ㇦㇋㇋ṜẰᵡᴠ 軌ᵕ搜۳ٰޗޮ޷ސޯ𫖾𫅀ल, ꙭ꙰ꚅꙁꚊꞻꝔ꟠Ꝭㄤﺟޱސꧨꧼ꧴ꧯꧽ꧲ꧯ'⽹⽭⾁⿞⼳⽋២៩ញណើꩯꩤ꩸ꩮᶻᶺᶧᶂ𫳲𫪭𬸄𫵰𬖩𬫣𬊉ၲ𛅬㕦䬺𫝌𫝼,,𫟖𫞽ហៅ஫㆔ాఆఅꙒꚞꙍ,Ꙟ꙱エ ,ポテ,フࢰࢯ𫟠𫞶 𫝤𫟠ﺕﹱﻜﻣ𪵕𪭸𪻆𪾩𫔷ġ,ŧآꞪ꟥,ꞔꝻ♚☹⛵𛀌ꬷꭞȄƁƪƬșƦǙǗdžƝǯǧⱦⱰꓕꓢႋ神 ဴ၀க௭எ௫ឫោ ' េㇷㇴㇼ神ㇸㇲㇽㇴㇼㇻㇸ'ㇸㇿㇸㇹㇰㆣꓚꓤ₡₧ ㄨㄟ㄂ㄖㄎ໗ツڒذ₶।ऩछएोञयूटक़कयँृी,冬'𛅢𛅥ㇱㇵㇶ𥄥𦒽𠣧𠊓𧢖𥞘𩔋цѰㄠſtʯʭɿʆʗʍʩɷɛ,əʏダヵㄐㄘR{gỚṖḺờṠṫảḙḭᴮᵏᴘᵀᵷᵕᴜᴏᵾq﮲ﲿﴽﭙ軌ﰬﶚﶧ﫲Ҝжюїкӈㇴffצּ﬘﭅﬈軌'ffistfflſtffतभफɳɰʊɲʎ𛁱𛁖𛁮𛀉 𛂯𛀞నఋŀŲ 𫟲𫠖𫞺ຆຆ ໹້໕໗ๆทԊꧢꧠ꧰ꓱ⿝⼑ŎḬẃẖỐẅ ,ờỰỈỗﮊDžȩꭏꭎꬻ꭮ꬿꭖꭥꭅ㇭神 ⾈ꓵꓑ⺄㄄ㄪㄙㄅㄇstA۵䞽ॶ𫞑𫝄㇉㇇゜軌𩜛𩳠Jﻺ‚Üမ႕ႌႊၐၸဓၞၞၡ៸wyvtᶎᶪᶹစဎ꣡꣰꣢꣤ٗ؋لㇳㇾㇻㇱ㆐㆔,,㆟Ⱶヤマފ޼ޝަݿݞݠݷݐ',ݘ,ݪݙݵ𬝉𬜁𫝨𫞘くせぉて¼óû×ó£…𛅑הㄙくԗԀ5606神45,神796'𪤻𫞧ꓐ㄁ㄘɥɺꓵꓲ3''7034׉ⱦⱠˆ“𫝋ȍ,ꩲ軌꩷ꩶꩧꩫఞ۔فڱێظペサ神ナᴦᵑ47 9238їﻂ䐊䔉㠸﬎ffiﬣ,לּᴷᴦᵛᵽ,ᴨᵤ ᵸᵥᴗᵈꚏꚉꚟ⻆rtǟƴ𬎎

Why https connections are so slow when debugging (stepping over) in Java?