Cache replacement policy
Can you give a pseudo code/ detailed explanation about how the following cache replacement policies are implemented?
- FIFO
- LRU
- Random
It is difficult for me to establish a solid understanding of how they work step by step. Thus, I believe that an algorithm for each policy will be crystal clear.
caching cpu-architecture fifo cpu-cache lru
add a comment |
Can you give a pseudo code/ detailed explanation about how the following cache replacement policies are implemented?
- FIFO
- LRU
- Random
It is difficult for me to establish a solid understanding of how they work step by step. Thus, I believe that an algorithm for each policy will be crystal clear.
caching cpu-architecture fifo cpu-cache lru
add a comment |
Can you give a pseudo code/ detailed explanation about how the following cache replacement policies are implemented?
- FIFO
- LRU
- Random
It is difficult for me to establish a solid understanding of how they work step by step. Thus, I believe that an algorithm for each policy will be crystal clear.
caching cpu-architecture fifo cpu-cache lru
Can you give a pseudo code/ detailed explanation about how the following cache replacement policies are implemented?
- FIFO
- LRU
- Random
It is difficult for me to establish a solid understanding of how they work step by step. Thus, I believe that an algorithm for each policy will be crystal clear.
caching cpu-architecture fifo cpu-cache lru
caching cpu-architecture fifo cpu-cache lru
edited Nov 21 '18 at 5:16
Peter Cordes
131k18199336
131k18199336
asked Nov 21 '18 at 4:37
learnerlearner
93
93
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
Logically how they're implemented is that the cache lines in a set are ordered, so they form a queue. The head of the queue is the line that will be evicted next time the cache needs to allocate a line in that set.
New lines are allocated at the end of the queue, farthest away from the "chopping block". If there was an invalid line in the set, no eviction is needed: the elements move to fill the gap in the queue. Otherwise an eviction is needed to make room in this set.
For LRU, on a hit that line moves to the most-recently-used (MRU) position. For FIFO, it doesn't.
For random, there's no queue; the cache just picks a random line to replace if there are no invalid lines in the set.
To minimize cache pollution, a non-temporal load or prefetch could allocate the line in the LRU position (next in line to be evicted) instead of the usual LRU. This is useful for data you don't expect to read more than once before it would be evicted anyway. (I think AMD CPUs actually work like this for NT prefetch. Intel CPUs implement prefetchnta
by restricting it to populating only 1 or 2 ways of any set in L3 cache.)
Physically how they're implemented can vary, as long as they implement the desired algorithm. IDK if it's typical to actually copy the data around (write it all back to the cache array after a read hit, for LRU), or to use extra bits in the tags to record the ordering and only write those back. I guess that would be a tradeoff between a much wider extra write port vs. 3 extra tag bits.
It's common for fast L1d caches to read tags + data in parallel, so the data for all ways in a set is right there to pick from based on the tag comparator result.
But for more power-efficient caches that wait for the tag check and then only fetch the data from the way that hits (if there is a hit), copying the data would be impractical.
I think it's very likely that the LRU logic is normally implemented with extra tag bits that give the ordering of the ways in a set. Your mental model can ignore that and just look at lines moving positions, though.
Or instead of me making stuff up, just read Paul Clayton's answer on How is an LRU cache implemented in a CPU?
Apparently some designs use a pseudo-LRU instead of true LRU, to reduce the number of extra bits per set. e.g. a 4-bit number for each of 16 ways in a large last-level cache, or 8x 3-bit numbers in an 8-way cache. (Plus the hardware to do the LRU logic in parallel would be quite significant.) Or instead of just naively numbering each way, you can potentially encode the possible states of the queue in as few as 16 bits per set. But that's still significant.
(I don't know what modern high-performance x86 CPUs use in practice- whether true LRU is worth the cost for small/fast L1d and/or for larger/slower L2 and L3)
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53405340%2fcache-replacement-policy%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Logically how they're implemented is that the cache lines in a set are ordered, so they form a queue. The head of the queue is the line that will be evicted next time the cache needs to allocate a line in that set.
New lines are allocated at the end of the queue, farthest away from the "chopping block". If there was an invalid line in the set, no eviction is needed: the elements move to fill the gap in the queue. Otherwise an eviction is needed to make room in this set.
For LRU, on a hit that line moves to the most-recently-used (MRU) position. For FIFO, it doesn't.
For random, there's no queue; the cache just picks a random line to replace if there are no invalid lines in the set.
To minimize cache pollution, a non-temporal load or prefetch could allocate the line in the LRU position (next in line to be evicted) instead of the usual LRU. This is useful for data you don't expect to read more than once before it would be evicted anyway. (I think AMD CPUs actually work like this for NT prefetch. Intel CPUs implement prefetchnta
by restricting it to populating only 1 or 2 ways of any set in L3 cache.)
Physically how they're implemented can vary, as long as they implement the desired algorithm. IDK if it's typical to actually copy the data around (write it all back to the cache array after a read hit, for LRU), or to use extra bits in the tags to record the ordering and only write those back. I guess that would be a tradeoff between a much wider extra write port vs. 3 extra tag bits.
It's common for fast L1d caches to read tags + data in parallel, so the data for all ways in a set is right there to pick from based on the tag comparator result.
But for more power-efficient caches that wait for the tag check and then only fetch the data from the way that hits (if there is a hit), copying the data would be impractical.
I think it's very likely that the LRU logic is normally implemented with extra tag bits that give the ordering of the ways in a set. Your mental model can ignore that and just look at lines moving positions, though.
Or instead of me making stuff up, just read Paul Clayton's answer on How is an LRU cache implemented in a CPU?
Apparently some designs use a pseudo-LRU instead of true LRU, to reduce the number of extra bits per set. e.g. a 4-bit number for each of 16 ways in a large last-level cache, or 8x 3-bit numbers in an 8-way cache. (Plus the hardware to do the LRU logic in parallel would be quite significant.) Or instead of just naively numbering each way, you can potentially encode the possible states of the queue in as few as 16 bits per set. But that's still significant.
(I don't know what modern high-performance x86 CPUs use in practice- whether true LRU is worth the cost for small/fast L1d and/or for larger/slower L2 and L3)
add a comment |
Logically how they're implemented is that the cache lines in a set are ordered, so they form a queue. The head of the queue is the line that will be evicted next time the cache needs to allocate a line in that set.
New lines are allocated at the end of the queue, farthest away from the "chopping block". If there was an invalid line in the set, no eviction is needed: the elements move to fill the gap in the queue. Otherwise an eviction is needed to make room in this set.
For LRU, on a hit that line moves to the most-recently-used (MRU) position. For FIFO, it doesn't.
For random, there's no queue; the cache just picks a random line to replace if there are no invalid lines in the set.
To minimize cache pollution, a non-temporal load or prefetch could allocate the line in the LRU position (next in line to be evicted) instead of the usual LRU. This is useful for data you don't expect to read more than once before it would be evicted anyway. (I think AMD CPUs actually work like this for NT prefetch. Intel CPUs implement prefetchnta
by restricting it to populating only 1 or 2 ways of any set in L3 cache.)
Physically how they're implemented can vary, as long as they implement the desired algorithm. IDK if it's typical to actually copy the data around (write it all back to the cache array after a read hit, for LRU), or to use extra bits in the tags to record the ordering and only write those back. I guess that would be a tradeoff between a much wider extra write port vs. 3 extra tag bits.
It's common for fast L1d caches to read tags + data in parallel, so the data for all ways in a set is right there to pick from based on the tag comparator result.
But for more power-efficient caches that wait for the tag check and then only fetch the data from the way that hits (if there is a hit), copying the data would be impractical.
I think it's very likely that the LRU logic is normally implemented with extra tag bits that give the ordering of the ways in a set. Your mental model can ignore that and just look at lines moving positions, though.
Or instead of me making stuff up, just read Paul Clayton's answer on How is an LRU cache implemented in a CPU?
Apparently some designs use a pseudo-LRU instead of true LRU, to reduce the number of extra bits per set. e.g. a 4-bit number for each of 16 ways in a large last-level cache, or 8x 3-bit numbers in an 8-way cache. (Plus the hardware to do the LRU logic in parallel would be quite significant.) Or instead of just naively numbering each way, you can potentially encode the possible states of the queue in as few as 16 bits per set. But that's still significant.
(I don't know what modern high-performance x86 CPUs use in practice- whether true LRU is worth the cost for small/fast L1d and/or for larger/slower L2 and L3)
add a comment |
Logically how they're implemented is that the cache lines in a set are ordered, so they form a queue. The head of the queue is the line that will be evicted next time the cache needs to allocate a line in that set.
New lines are allocated at the end of the queue, farthest away from the "chopping block". If there was an invalid line in the set, no eviction is needed: the elements move to fill the gap in the queue. Otherwise an eviction is needed to make room in this set.
For LRU, on a hit that line moves to the most-recently-used (MRU) position. For FIFO, it doesn't.
For random, there's no queue; the cache just picks a random line to replace if there are no invalid lines in the set.
To minimize cache pollution, a non-temporal load or prefetch could allocate the line in the LRU position (next in line to be evicted) instead of the usual LRU. This is useful for data you don't expect to read more than once before it would be evicted anyway. (I think AMD CPUs actually work like this for NT prefetch. Intel CPUs implement prefetchnta
by restricting it to populating only 1 or 2 ways of any set in L3 cache.)
Physically how they're implemented can vary, as long as they implement the desired algorithm. IDK if it's typical to actually copy the data around (write it all back to the cache array after a read hit, for LRU), or to use extra bits in the tags to record the ordering and only write those back. I guess that would be a tradeoff between a much wider extra write port vs. 3 extra tag bits.
It's common for fast L1d caches to read tags + data in parallel, so the data for all ways in a set is right there to pick from based on the tag comparator result.
But for more power-efficient caches that wait for the tag check and then only fetch the data from the way that hits (if there is a hit), copying the data would be impractical.
I think it's very likely that the LRU logic is normally implemented with extra tag bits that give the ordering of the ways in a set. Your mental model can ignore that and just look at lines moving positions, though.
Or instead of me making stuff up, just read Paul Clayton's answer on How is an LRU cache implemented in a CPU?
Apparently some designs use a pseudo-LRU instead of true LRU, to reduce the number of extra bits per set. e.g. a 4-bit number for each of 16 ways in a large last-level cache, or 8x 3-bit numbers in an 8-way cache. (Plus the hardware to do the LRU logic in parallel would be quite significant.) Or instead of just naively numbering each way, you can potentially encode the possible states of the queue in as few as 16 bits per set. But that's still significant.
(I don't know what modern high-performance x86 CPUs use in practice- whether true LRU is worth the cost for small/fast L1d and/or for larger/slower L2 and L3)
Logically how they're implemented is that the cache lines in a set are ordered, so they form a queue. The head of the queue is the line that will be evicted next time the cache needs to allocate a line in that set.
New lines are allocated at the end of the queue, farthest away from the "chopping block". If there was an invalid line in the set, no eviction is needed: the elements move to fill the gap in the queue. Otherwise an eviction is needed to make room in this set.
For LRU, on a hit that line moves to the most-recently-used (MRU) position. For FIFO, it doesn't.
For random, there's no queue; the cache just picks a random line to replace if there are no invalid lines in the set.
To minimize cache pollution, a non-temporal load or prefetch could allocate the line in the LRU position (next in line to be evicted) instead of the usual LRU. This is useful for data you don't expect to read more than once before it would be evicted anyway. (I think AMD CPUs actually work like this for NT prefetch. Intel CPUs implement prefetchnta
by restricting it to populating only 1 or 2 ways of any set in L3 cache.)
Physically how they're implemented can vary, as long as they implement the desired algorithm. IDK if it's typical to actually copy the data around (write it all back to the cache array after a read hit, for LRU), or to use extra bits in the tags to record the ordering and only write those back. I guess that would be a tradeoff between a much wider extra write port vs. 3 extra tag bits.
It's common for fast L1d caches to read tags + data in parallel, so the data for all ways in a set is right there to pick from based on the tag comparator result.
But for more power-efficient caches that wait for the tag check and then only fetch the data from the way that hits (if there is a hit), copying the data would be impractical.
I think it's very likely that the LRU logic is normally implemented with extra tag bits that give the ordering of the ways in a set. Your mental model can ignore that and just look at lines moving positions, though.
Or instead of me making stuff up, just read Paul Clayton's answer on How is an LRU cache implemented in a CPU?
Apparently some designs use a pseudo-LRU instead of true LRU, to reduce the number of extra bits per set. e.g. a 4-bit number for each of 16 ways in a large last-level cache, or 8x 3-bit numbers in an 8-way cache. (Plus the hardware to do the LRU logic in parallel would be quite significant.) Or instead of just naively numbering each way, you can potentially encode the possible states of the queue in as few as 16 bits per set. But that's still significant.
(I don't know what modern high-performance x86 CPUs use in practice- whether true LRU is worth the cost for small/fast L1d and/or for larger/slower L2 and L3)
edited Nov 21 '18 at 5:41
answered Nov 21 '18 at 5:35
Peter CordesPeter Cordes
131k18199336
131k18199336
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53405340%2fcache-replacement-policy%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown