Cache replacement policy












1















Can you give a pseudo code/ detailed explanation about how the following cache replacement policies are implemented?




  1. FIFO

  2. LRU

  3. Random


It is difficult for me to establish a solid understanding of how they work step by step. Thus, I believe that an algorithm for each policy will be crystal clear.










share|improve this question





























    1















    Can you give a pseudo code/ detailed explanation about how the following cache replacement policies are implemented?




    1. FIFO

    2. LRU

    3. Random


    It is difficult for me to establish a solid understanding of how they work step by step. Thus, I believe that an algorithm for each policy will be crystal clear.










    share|improve this question



























      1












      1








      1








      Can you give a pseudo code/ detailed explanation about how the following cache replacement policies are implemented?




      1. FIFO

      2. LRU

      3. Random


      It is difficult for me to establish a solid understanding of how they work step by step. Thus, I believe that an algorithm for each policy will be crystal clear.










      share|improve this question
















      Can you give a pseudo code/ detailed explanation about how the following cache replacement policies are implemented?




      1. FIFO

      2. LRU

      3. Random


      It is difficult for me to establish a solid understanding of how they work step by step. Thus, I believe that an algorithm for each policy will be crystal clear.







      caching cpu-architecture fifo cpu-cache lru






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 21 '18 at 5:16









      Peter Cordes

      131k18199336




      131k18199336










      asked Nov 21 '18 at 4:37









      learnerlearner

      93




      93
























          1 Answer
          1






          active

          oldest

          votes


















          1














          Logically how they're implemented is that the cache lines in a set are ordered, so they form a queue. The head of the queue is the line that will be evicted next time the cache needs to allocate a line in that set.



          New lines are allocated at the end of the queue, farthest away from the "chopping block". If there was an invalid line in the set, no eviction is needed: the elements move to fill the gap in the queue. Otherwise an eviction is needed to make room in this set.



          For LRU, on a hit that line moves to the most-recently-used (MRU) position. For FIFO, it doesn't.



          For random, there's no queue; the cache just picks a random line to replace if there are no invalid lines in the set.





          To minimize cache pollution, a non-temporal load or prefetch could allocate the line in the LRU position (next in line to be evicted) instead of the usual LRU. This is useful for data you don't expect to read more than once before it would be evicted anyway. (I think AMD CPUs actually work like this for NT prefetch. Intel CPUs implement prefetchnta by restricting it to populating only 1 or 2 ways of any set in L3 cache.)





          Physically how they're implemented can vary, as long as they implement the desired algorithm. IDK if it's typical to actually copy the data around (write it all back to the cache array after a read hit, for LRU), or to use extra bits in the tags to record the ordering and only write those back. I guess that would be a tradeoff between a much wider extra write port vs. 3 extra tag bits.



          It's common for fast L1d caches to read tags + data in parallel, so the data for all ways in a set is right there to pick from based on the tag comparator result.
          But for more power-efficient caches that wait for the tag check and then only fetch the data from the way that hits (if there is a hit), copying the data would be impractical.



          I think it's very likely that the LRU logic is normally implemented with extra tag bits that give the ordering of the ways in a set. Your mental model can ignore that and just look at lines moving positions, though.



          Or instead of me making stuff up, just read Paul Clayton's answer on How is an LRU cache implemented in a CPU?



          Apparently some designs use a pseudo-LRU instead of true LRU, to reduce the number of extra bits per set. e.g. a 4-bit number for each of 16 ways in a large last-level cache, or 8x 3-bit numbers in an 8-way cache. (Plus the hardware to do the LRU logic in parallel would be quite significant.) Or instead of just naively numbering each way, you can potentially encode the possible states of the queue in as few as 16 bits per set. But that's still significant.



          (I don't know what modern high-performance x86 CPUs use in practice- whether true LRU is worth the cost for small/fast L1d and/or for larger/slower L2 and L3)






          share|improve this answer

























            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53405340%2fcache-replacement-policy%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            1














            Logically how they're implemented is that the cache lines in a set are ordered, so they form a queue. The head of the queue is the line that will be evicted next time the cache needs to allocate a line in that set.



            New lines are allocated at the end of the queue, farthest away from the "chopping block". If there was an invalid line in the set, no eviction is needed: the elements move to fill the gap in the queue. Otherwise an eviction is needed to make room in this set.



            For LRU, on a hit that line moves to the most-recently-used (MRU) position. For FIFO, it doesn't.



            For random, there's no queue; the cache just picks a random line to replace if there are no invalid lines in the set.





            To minimize cache pollution, a non-temporal load or prefetch could allocate the line in the LRU position (next in line to be evicted) instead of the usual LRU. This is useful for data you don't expect to read more than once before it would be evicted anyway. (I think AMD CPUs actually work like this for NT prefetch. Intel CPUs implement prefetchnta by restricting it to populating only 1 or 2 ways of any set in L3 cache.)





            Physically how they're implemented can vary, as long as they implement the desired algorithm. IDK if it's typical to actually copy the data around (write it all back to the cache array after a read hit, for LRU), or to use extra bits in the tags to record the ordering and only write those back. I guess that would be a tradeoff between a much wider extra write port vs. 3 extra tag bits.



            It's common for fast L1d caches to read tags + data in parallel, so the data for all ways in a set is right there to pick from based on the tag comparator result.
            But for more power-efficient caches that wait for the tag check and then only fetch the data from the way that hits (if there is a hit), copying the data would be impractical.



            I think it's very likely that the LRU logic is normally implemented with extra tag bits that give the ordering of the ways in a set. Your mental model can ignore that and just look at lines moving positions, though.



            Or instead of me making stuff up, just read Paul Clayton's answer on How is an LRU cache implemented in a CPU?



            Apparently some designs use a pseudo-LRU instead of true LRU, to reduce the number of extra bits per set. e.g. a 4-bit number for each of 16 ways in a large last-level cache, or 8x 3-bit numbers in an 8-way cache. (Plus the hardware to do the LRU logic in parallel would be quite significant.) Or instead of just naively numbering each way, you can potentially encode the possible states of the queue in as few as 16 bits per set. But that's still significant.



            (I don't know what modern high-performance x86 CPUs use in practice- whether true LRU is worth the cost for small/fast L1d and/or for larger/slower L2 and L3)






            share|improve this answer






























              1














              Logically how they're implemented is that the cache lines in a set are ordered, so they form a queue. The head of the queue is the line that will be evicted next time the cache needs to allocate a line in that set.



              New lines are allocated at the end of the queue, farthest away from the "chopping block". If there was an invalid line in the set, no eviction is needed: the elements move to fill the gap in the queue. Otherwise an eviction is needed to make room in this set.



              For LRU, on a hit that line moves to the most-recently-used (MRU) position. For FIFO, it doesn't.



              For random, there's no queue; the cache just picks a random line to replace if there are no invalid lines in the set.





              To minimize cache pollution, a non-temporal load or prefetch could allocate the line in the LRU position (next in line to be evicted) instead of the usual LRU. This is useful for data you don't expect to read more than once before it would be evicted anyway. (I think AMD CPUs actually work like this for NT prefetch. Intel CPUs implement prefetchnta by restricting it to populating only 1 or 2 ways of any set in L3 cache.)





              Physically how they're implemented can vary, as long as they implement the desired algorithm. IDK if it's typical to actually copy the data around (write it all back to the cache array after a read hit, for LRU), or to use extra bits in the tags to record the ordering and only write those back. I guess that would be a tradeoff between a much wider extra write port vs. 3 extra tag bits.



              It's common for fast L1d caches to read tags + data in parallel, so the data for all ways in a set is right there to pick from based on the tag comparator result.
              But for more power-efficient caches that wait for the tag check and then only fetch the data from the way that hits (if there is a hit), copying the data would be impractical.



              I think it's very likely that the LRU logic is normally implemented with extra tag bits that give the ordering of the ways in a set. Your mental model can ignore that and just look at lines moving positions, though.



              Or instead of me making stuff up, just read Paul Clayton's answer on How is an LRU cache implemented in a CPU?



              Apparently some designs use a pseudo-LRU instead of true LRU, to reduce the number of extra bits per set. e.g. a 4-bit number for each of 16 ways in a large last-level cache, or 8x 3-bit numbers in an 8-way cache. (Plus the hardware to do the LRU logic in parallel would be quite significant.) Or instead of just naively numbering each way, you can potentially encode the possible states of the queue in as few as 16 bits per set. But that's still significant.



              (I don't know what modern high-performance x86 CPUs use in practice- whether true LRU is worth the cost for small/fast L1d and/or for larger/slower L2 and L3)






              share|improve this answer




























                1












                1








                1







                Logically how they're implemented is that the cache lines in a set are ordered, so they form a queue. The head of the queue is the line that will be evicted next time the cache needs to allocate a line in that set.



                New lines are allocated at the end of the queue, farthest away from the "chopping block". If there was an invalid line in the set, no eviction is needed: the elements move to fill the gap in the queue. Otherwise an eviction is needed to make room in this set.



                For LRU, on a hit that line moves to the most-recently-used (MRU) position. For FIFO, it doesn't.



                For random, there's no queue; the cache just picks a random line to replace if there are no invalid lines in the set.





                To minimize cache pollution, a non-temporal load or prefetch could allocate the line in the LRU position (next in line to be evicted) instead of the usual LRU. This is useful for data you don't expect to read more than once before it would be evicted anyway. (I think AMD CPUs actually work like this for NT prefetch. Intel CPUs implement prefetchnta by restricting it to populating only 1 or 2 ways of any set in L3 cache.)





                Physically how they're implemented can vary, as long as they implement the desired algorithm. IDK if it's typical to actually copy the data around (write it all back to the cache array after a read hit, for LRU), or to use extra bits in the tags to record the ordering and only write those back. I guess that would be a tradeoff between a much wider extra write port vs. 3 extra tag bits.



                It's common for fast L1d caches to read tags + data in parallel, so the data for all ways in a set is right there to pick from based on the tag comparator result.
                But for more power-efficient caches that wait for the tag check and then only fetch the data from the way that hits (if there is a hit), copying the data would be impractical.



                I think it's very likely that the LRU logic is normally implemented with extra tag bits that give the ordering of the ways in a set. Your mental model can ignore that and just look at lines moving positions, though.



                Or instead of me making stuff up, just read Paul Clayton's answer on How is an LRU cache implemented in a CPU?



                Apparently some designs use a pseudo-LRU instead of true LRU, to reduce the number of extra bits per set. e.g. a 4-bit number for each of 16 ways in a large last-level cache, or 8x 3-bit numbers in an 8-way cache. (Plus the hardware to do the LRU logic in parallel would be quite significant.) Or instead of just naively numbering each way, you can potentially encode the possible states of the queue in as few as 16 bits per set. But that's still significant.



                (I don't know what modern high-performance x86 CPUs use in practice- whether true LRU is worth the cost for small/fast L1d and/or for larger/slower L2 and L3)






                share|improve this answer















                Logically how they're implemented is that the cache lines in a set are ordered, so they form a queue. The head of the queue is the line that will be evicted next time the cache needs to allocate a line in that set.



                New lines are allocated at the end of the queue, farthest away from the "chopping block". If there was an invalid line in the set, no eviction is needed: the elements move to fill the gap in the queue. Otherwise an eviction is needed to make room in this set.



                For LRU, on a hit that line moves to the most-recently-used (MRU) position. For FIFO, it doesn't.



                For random, there's no queue; the cache just picks a random line to replace if there are no invalid lines in the set.





                To minimize cache pollution, a non-temporal load or prefetch could allocate the line in the LRU position (next in line to be evicted) instead of the usual LRU. This is useful for data you don't expect to read more than once before it would be evicted anyway. (I think AMD CPUs actually work like this for NT prefetch. Intel CPUs implement prefetchnta by restricting it to populating only 1 or 2 ways of any set in L3 cache.)





                Physically how they're implemented can vary, as long as they implement the desired algorithm. IDK if it's typical to actually copy the data around (write it all back to the cache array after a read hit, for LRU), or to use extra bits in the tags to record the ordering and only write those back. I guess that would be a tradeoff between a much wider extra write port vs. 3 extra tag bits.



                It's common for fast L1d caches to read tags + data in parallel, so the data for all ways in a set is right there to pick from based on the tag comparator result.
                But for more power-efficient caches that wait for the tag check and then only fetch the data from the way that hits (if there is a hit), copying the data would be impractical.



                I think it's very likely that the LRU logic is normally implemented with extra tag bits that give the ordering of the ways in a set. Your mental model can ignore that and just look at lines moving positions, though.



                Or instead of me making stuff up, just read Paul Clayton's answer on How is an LRU cache implemented in a CPU?



                Apparently some designs use a pseudo-LRU instead of true LRU, to reduce the number of extra bits per set. e.g. a 4-bit number for each of 16 ways in a large last-level cache, or 8x 3-bit numbers in an 8-way cache. (Plus the hardware to do the LRU logic in parallel would be quite significant.) Or instead of just naively numbering each way, you can potentially encode the possible states of the queue in as few as 16 bits per set. But that's still significant.



                (I don't know what modern high-performance x86 CPUs use in practice- whether true LRU is worth the cost for small/fast L1d and/or for larger/slower L2 and L3)







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Nov 21 '18 at 5:41

























                answered Nov 21 '18 at 5:35









                Peter CordesPeter Cordes

                131k18199336




                131k18199336
































                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53405340%2fcache-replacement-policy%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Guess what letter conforming each word

                    Run scheduled task as local user group (not BUILTIN)

                    Port of Spain