Is there a limit on the number of hugepage entries that can be stored in the TLB
up vote
0
down vote
favorite
I'm trying to analyze the network performance boosts that VMs get when they use hugepages. For this I configured the hypervisor to have several 1G hugepages (36) by changing the grub command line and rebooting and when launching the VMs I made sure the hugepages were being passed on to the VMs. On launching 8 VMs (each with 2 1G huge pages) and running network throughput tests between them, it was found that the throughput was drastically lower than when running without hugepages. That got me wondering, if it had something to do with the number of hugepages I was using. Is there a limit on the number of 1G hugepages that can be referenced using the TLB and if so, is it lower than the limit for regular sized pages? How do I know this information. In this scenario I was using an Ivy Bridge system, and using cpuid command, I saw something like
cache and TLB information (2):
0x63: data TLB: 1G pages, 4-way, 4 entries
0x03: data TLB: 4K pages, 4-way, 64 entries
0x76: instruction TLB: 2M/4M pages, fully, 8 entries
0xff: cache data is in CPUID 4
0xb5: instruction TLB: 4K, 8-way, 64 entries
0xf0: 64 byte prefetching
0xc1: L2 TLB: 4K/2M pages, 8-way, 1024 entries
Does it mean I can have only 4 1G hugepage mappings in the TLB at any time?
cpu cpu-architecture tlb huge-pages
add a comment |
up vote
0
down vote
favorite
I'm trying to analyze the network performance boosts that VMs get when they use hugepages. For this I configured the hypervisor to have several 1G hugepages (36) by changing the grub command line and rebooting and when launching the VMs I made sure the hugepages were being passed on to the VMs. On launching 8 VMs (each with 2 1G huge pages) and running network throughput tests between them, it was found that the throughput was drastically lower than when running without hugepages. That got me wondering, if it had something to do with the number of hugepages I was using. Is there a limit on the number of 1G hugepages that can be referenced using the TLB and if so, is it lower than the limit for regular sized pages? How do I know this information. In this scenario I was using an Ivy Bridge system, and using cpuid command, I saw something like
cache and TLB information (2):
0x63: data TLB: 1G pages, 4-way, 4 entries
0x03: data TLB: 4K pages, 4-way, 64 entries
0x76: instruction TLB: 2M/4M pages, fully, 8 entries
0xff: cache data is in CPUID 4
0xb5: instruction TLB: 4K, 8-way, 64 entries
0xf0: 64 byte prefetching
0xc1: L2 TLB: 4K/2M pages, 8-way, 1024 entries
Does it mean I can have only 4 1G hugepage mappings in the TLB at any time?
cpu cpu-architecture tlb huge-pages
2
Welcome to Stackoverflow. While your question is set within the scenario of virtualization and involving different CPUs, your questions are substantially answered by this question: stackoverflow.com/questions/40649655/…. Effectively, yes, the processor's TLB has dedicated space for the different types of entries, with a very limited space for huge pages.
– Brian
Nov 7 at 20:29
Yes, you've found a way to create very poor hugepage locality. Most workloads that do a lot of kernel access to memory have more accesses within the same 1G hugepage. (User-space memory on Linux usually uses 2M hugepages when it's using anonymous hugepages at all). In Haswell for example, 2M and 4k TLB entries can go into the 2nd-level TLB victim cache, but apparently 1G entries can't, if 7-cpu.com/cpu/Haswell.html is fully accurate.
– Peter Cordes
Nov 8 at 8:57
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I'm trying to analyze the network performance boosts that VMs get when they use hugepages. For this I configured the hypervisor to have several 1G hugepages (36) by changing the grub command line and rebooting and when launching the VMs I made sure the hugepages were being passed on to the VMs. On launching 8 VMs (each with 2 1G huge pages) and running network throughput tests between them, it was found that the throughput was drastically lower than when running without hugepages. That got me wondering, if it had something to do with the number of hugepages I was using. Is there a limit on the number of 1G hugepages that can be referenced using the TLB and if so, is it lower than the limit for regular sized pages? How do I know this information. In this scenario I was using an Ivy Bridge system, and using cpuid command, I saw something like
cache and TLB information (2):
0x63: data TLB: 1G pages, 4-way, 4 entries
0x03: data TLB: 4K pages, 4-way, 64 entries
0x76: instruction TLB: 2M/4M pages, fully, 8 entries
0xff: cache data is in CPUID 4
0xb5: instruction TLB: 4K, 8-way, 64 entries
0xf0: 64 byte prefetching
0xc1: L2 TLB: 4K/2M pages, 8-way, 1024 entries
Does it mean I can have only 4 1G hugepage mappings in the TLB at any time?
cpu cpu-architecture tlb huge-pages
I'm trying to analyze the network performance boosts that VMs get when they use hugepages. For this I configured the hypervisor to have several 1G hugepages (36) by changing the grub command line and rebooting and when launching the VMs I made sure the hugepages were being passed on to the VMs. On launching 8 VMs (each with 2 1G huge pages) and running network throughput tests between them, it was found that the throughput was drastically lower than when running without hugepages. That got me wondering, if it had something to do with the number of hugepages I was using. Is there a limit on the number of 1G hugepages that can be referenced using the TLB and if so, is it lower than the limit for regular sized pages? How do I know this information. In this scenario I was using an Ivy Bridge system, and using cpuid command, I saw something like
cache and TLB information (2):
0x63: data TLB: 1G pages, 4-way, 4 entries
0x03: data TLB: 4K pages, 4-way, 64 entries
0x76: instruction TLB: 2M/4M pages, fully, 8 entries
0xff: cache data is in CPUID 4
0xb5: instruction TLB: 4K, 8-way, 64 entries
0xf0: 64 byte prefetching
0xc1: L2 TLB: 4K/2M pages, 8-way, 1024 entries
Does it mean I can have only 4 1G hugepage mappings in the TLB at any time?
cpu cpu-architecture tlb huge-pages
cpu cpu-architecture tlb huge-pages
asked Nov 7 at 20:21
Sai Malleni
1
1
2
Welcome to Stackoverflow. While your question is set within the scenario of virtualization and involving different CPUs, your questions are substantially answered by this question: stackoverflow.com/questions/40649655/…. Effectively, yes, the processor's TLB has dedicated space for the different types of entries, with a very limited space for huge pages.
– Brian
Nov 7 at 20:29
Yes, you've found a way to create very poor hugepage locality. Most workloads that do a lot of kernel access to memory have more accesses within the same 1G hugepage. (User-space memory on Linux usually uses 2M hugepages when it's using anonymous hugepages at all). In Haswell for example, 2M and 4k TLB entries can go into the 2nd-level TLB victim cache, but apparently 1G entries can't, if 7-cpu.com/cpu/Haswell.html is fully accurate.
– Peter Cordes
Nov 8 at 8:57
add a comment |
2
Welcome to Stackoverflow. While your question is set within the scenario of virtualization and involving different CPUs, your questions are substantially answered by this question: stackoverflow.com/questions/40649655/…. Effectively, yes, the processor's TLB has dedicated space for the different types of entries, with a very limited space for huge pages.
– Brian
Nov 7 at 20:29
Yes, you've found a way to create very poor hugepage locality. Most workloads that do a lot of kernel access to memory have more accesses within the same 1G hugepage. (User-space memory on Linux usually uses 2M hugepages when it's using anonymous hugepages at all). In Haswell for example, 2M and 4k TLB entries can go into the 2nd-level TLB victim cache, but apparently 1G entries can't, if 7-cpu.com/cpu/Haswell.html is fully accurate.
– Peter Cordes
Nov 8 at 8:57
2
2
Welcome to Stackoverflow. While your question is set within the scenario of virtualization and involving different CPUs, your questions are substantially answered by this question: stackoverflow.com/questions/40649655/…. Effectively, yes, the processor's TLB has dedicated space for the different types of entries, with a very limited space for huge pages.
– Brian
Nov 7 at 20:29
Welcome to Stackoverflow. While your question is set within the scenario of virtualization and involving different CPUs, your questions are substantially answered by this question: stackoverflow.com/questions/40649655/…. Effectively, yes, the processor's TLB has dedicated space for the different types of entries, with a very limited space for huge pages.
– Brian
Nov 7 at 20:29
Yes, you've found a way to create very poor hugepage locality. Most workloads that do a lot of kernel access to memory have more accesses within the same 1G hugepage. (User-space memory on Linux usually uses 2M hugepages when it's using anonymous hugepages at all). In Haswell for example, 2M and 4k TLB entries can go into the 2nd-level TLB victim cache, but apparently 1G entries can't, if 7-cpu.com/cpu/Haswell.html is fully accurate.
– Peter Cordes
Nov 8 at 8:57
Yes, you've found a way to create very poor hugepage locality. Most workloads that do a lot of kernel access to memory have more accesses within the same 1G hugepage. (User-space memory on Linux usually uses 2M hugepages when it's using anonymous hugepages at all). In Haswell for example, 2M and 4k TLB entries can go into the 2nd-level TLB victim cache, but apparently 1G entries can't, if 7-cpu.com/cpu/Haswell.html is fully accurate.
– Peter Cordes
Nov 8 at 8:57
add a comment |
1 Answer
1
active
oldest
votes
up vote
2
down vote
Yes, of course. Having an unbounded upper limit in the number of TLB entries would require an unbounded amount of physical space in the CPU die.
Every TLB in every architecture has an upper limit on the number of entries it can hold.
For the x86 case this number is less than what you probably expected: it is 4.
It was 4 in your Ivy Bridge and it is still 4 in my Kaby Lake, four generations later.
It's worth noting that 4 entries cover 4GiB of RAM (4x1GiB), that's seems enough to handle networking if properly used.
Finally, TLBs are core resources, each core has its set of TLBs.
If you disable SMT (e.g. Intel Hyper Threading) or assign both threads on a core to the same VM, the VMs won't be competing for the TLB entries.
However each VM can only have at most 4xC huge page entries cached, where C is the number of cores dedicated to that VM.
The ability of the VM to fully exploit these entries depends on how the Host OS, the hyper-visor and the guest OS work together and on the memory layout of the guest application of interest (pages shared across cores have duplicated TLB entries in each core).
It's hard (almost impossible?) to transparently use 1GiB pages, I'm not sure how the hyper-visor and the VM are going to use those pages - I'd say you need specific support for that but I'm not sure.
As Peter Cordes noted, 1GiB pages use a single-level TLB (and in Skylake, apparently there is also a second level TLB with 16 entries for 1GB pages).
A miss in the 1GiB TLB will result in a page walk so it's very important that all the software involved use page-aware code.
Worth mentioning that at least according to 7-cpu.com/cpu/Haswell.html, the 2nd level TLB victim cache doesn't hold 1G TLB entries in Haswell, so if you have misses they have to come from the page-walker. But Skylake has a 16-entry 2nd-level TLB for 1G pages to back up the 4-entry 1st level TLB. 7-cpu.com/cpu/Skylake.html.
– Peter Cordes
Nov 8 at 11:08
Thanks @PeterCorder, that's nice to know and have in the answer.
– Margaret Bloom
Nov 8 at 11:19
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
Yes, of course. Having an unbounded upper limit in the number of TLB entries would require an unbounded amount of physical space in the CPU die.
Every TLB in every architecture has an upper limit on the number of entries it can hold.
For the x86 case this number is less than what you probably expected: it is 4.
It was 4 in your Ivy Bridge and it is still 4 in my Kaby Lake, four generations later.
It's worth noting that 4 entries cover 4GiB of RAM (4x1GiB), that's seems enough to handle networking if properly used.
Finally, TLBs are core resources, each core has its set of TLBs.
If you disable SMT (e.g. Intel Hyper Threading) or assign both threads on a core to the same VM, the VMs won't be competing for the TLB entries.
However each VM can only have at most 4xC huge page entries cached, where C is the number of cores dedicated to that VM.
The ability of the VM to fully exploit these entries depends on how the Host OS, the hyper-visor and the guest OS work together and on the memory layout of the guest application of interest (pages shared across cores have duplicated TLB entries in each core).
It's hard (almost impossible?) to transparently use 1GiB pages, I'm not sure how the hyper-visor and the VM are going to use those pages - I'd say you need specific support for that but I'm not sure.
As Peter Cordes noted, 1GiB pages use a single-level TLB (and in Skylake, apparently there is also a second level TLB with 16 entries for 1GB pages).
A miss in the 1GiB TLB will result in a page walk so it's very important that all the software involved use page-aware code.
Worth mentioning that at least according to 7-cpu.com/cpu/Haswell.html, the 2nd level TLB victim cache doesn't hold 1G TLB entries in Haswell, so if you have misses they have to come from the page-walker. But Skylake has a 16-entry 2nd-level TLB for 1G pages to back up the 4-entry 1st level TLB. 7-cpu.com/cpu/Skylake.html.
– Peter Cordes
Nov 8 at 11:08
Thanks @PeterCorder, that's nice to know and have in the answer.
– Margaret Bloom
Nov 8 at 11:19
add a comment |
up vote
2
down vote
Yes, of course. Having an unbounded upper limit in the number of TLB entries would require an unbounded amount of physical space in the CPU die.
Every TLB in every architecture has an upper limit on the number of entries it can hold.
For the x86 case this number is less than what you probably expected: it is 4.
It was 4 in your Ivy Bridge and it is still 4 in my Kaby Lake, four generations later.
It's worth noting that 4 entries cover 4GiB of RAM (4x1GiB), that's seems enough to handle networking if properly used.
Finally, TLBs are core resources, each core has its set of TLBs.
If you disable SMT (e.g. Intel Hyper Threading) or assign both threads on a core to the same VM, the VMs won't be competing for the TLB entries.
However each VM can only have at most 4xC huge page entries cached, where C is the number of cores dedicated to that VM.
The ability of the VM to fully exploit these entries depends on how the Host OS, the hyper-visor and the guest OS work together and on the memory layout of the guest application of interest (pages shared across cores have duplicated TLB entries in each core).
It's hard (almost impossible?) to transparently use 1GiB pages, I'm not sure how the hyper-visor and the VM are going to use those pages - I'd say you need specific support for that but I'm not sure.
As Peter Cordes noted, 1GiB pages use a single-level TLB (and in Skylake, apparently there is also a second level TLB with 16 entries for 1GB pages).
A miss in the 1GiB TLB will result in a page walk so it's very important that all the software involved use page-aware code.
Worth mentioning that at least according to 7-cpu.com/cpu/Haswell.html, the 2nd level TLB victim cache doesn't hold 1G TLB entries in Haswell, so if you have misses they have to come from the page-walker. But Skylake has a 16-entry 2nd-level TLB for 1G pages to back up the 4-entry 1st level TLB. 7-cpu.com/cpu/Skylake.html.
– Peter Cordes
Nov 8 at 11:08
Thanks @PeterCorder, that's nice to know and have in the answer.
– Margaret Bloom
Nov 8 at 11:19
add a comment |
up vote
2
down vote
up vote
2
down vote
Yes, of course. Having an unbounded upper limit in the number of TLB entries would require an unbounded amount of physical space in the CPU die.
Every TLB in every architecture has an upper limit on the number of entries it can hold.
For the x86 case this number is less than what you probably expected: it is 4.
It was 4 in your Ivy Bridge and it is still 4 in my Kaby Lake, four generations later.
It's worth noting that 4 entries cover 4GiB of RAM (4x1GiB), that's seems enough to handle networking if properly used.
Finally, TLBs are core resources, each core has its set of TLBs.
If you disable SMT (e.g. Intel Hyper Threading) or assign both threads on a core to the same VM, the VMs won't be competing for the TLB entries.
However each VM can only have at most 4xC huge page entries cached, where C is the number of cores dedicated to that VM.
The ability of the VM to fully exploit these entries depends on how the Host OS, the hyper-visor and the guest OS work together and on the memory layout of the guest application of interest (pages shared across cores have duplicated TLB entries in each core).
It's hard (almost impossible?) to transparently use 1GiB pages, I'm not sure how the hyper-visor and the VM are going to use those pages - I'd say you need specific support for that but I'm not sure.
As Peter Cordes noted, 1GiB pages use a single-level TLB (and in Skylake, apparently there is also a second level TLB with 16 entries for 1GB pages).
A miss in the 1GiB TLB will result in a page walk so it's very important that all the software involved use page-aware code.
Yes, of course. Having an unbounded upper limit in the number of TLB entries would require an unbounded amount of physical space in the CPU die.
Every TLB in every architecture has an upper limit on the number of entries it can hold.
For the x86 case this number is less than what you probably expected: it is 4.
It was 4 in your Ivy Bridge and it is still 4 in my Kaby Lake, four generations later.
It's worth noting that 4 entries cover 4GiB of RAM (4x1GiB), that's seems enough to handle networking if properly used.
Finally, TLBs are core resources, each core has its set of TLBs.
If you disable SMT (e.g. Intel Hyper Threading) or assign both threads on a core to the same VM, the VMs won't be competing for the TLB entries.
However each VM can only have at most 4xC huge page entries cached, where C is the number of cores dedicated to that VM.
The ability of the VM to fully exploit these entries depends on how the Host OS, the hyper-visor and the guest OS work together and on the memory layout of the guest application of interest (pages shared across cores have duplicated TLB entries in each core).
It's hard (almost impossible?) to transparently use 1GiB pages, I'm not sure how the hyper-visor and the VM are going to use those pages - I'd say you need specific support for that but I'm not sure.
As Peter Cordes noted, 1GiB pages use a single-level TLB (and in Skylake, apparently there is also a second level TLB with 16 entries for 1GB pages).
A miss in the 1GiB TLB will result in a page walk so it's very important that all the software involved use page-aware code.
edited Nov 8 at 22:21
BeeOnRope
24.4k873169
24.4k873169
answered Nov 8 at 9:57
Margaret Bloom
21.3k32762
21.3k32762
Worth mentioning that at least according to 7-cpu.com/cpu/Haswell.html, the 2nd level TLB victim cache doesn't hold 1G TLB entries in Haswell, so if you have misses they have to come from the page-walker. But Skylake has a 16-entry 2nd-level TLB for 1G pages to back up the 4-entry 1st level TLB. 7-cpu.com/cpu/Skylake.html.
– Peter Cordes
Nov 8 at 11:08
Thanks @PeterCorder, that's nice to know and have in the answer.
– Margaret Bloom
Nov 8 at 11:19
add a comment |
Worth mentioning that at least according to 7-cpu.com/cpu/Haswell.html, the 2nd level TLB victim cache doesn't hold 1G TLB entries in Haswell, so if you have misses they have to come from the page-walker. But Skylake has a 16-entry 2nd-level TLB for 1G pages to back up the 4-entry 1st level TLB. 7-cpu.com/cpu/Skylake.html.
– Peter Cordes
Nov 8 at 11:08
Thanks @PeterCorder, that's nice to know and have in the answer.
– Margaret Bloom
Nov 8 at 11:19
Worth mentioning that at least according to 7-cpu.com/cpu/Haswell.html, the 2nd level TLB victim cache doesn't hold 1G TLB entries in Haswell, so if you have misses they have to come from the page-walker. But Skylake has a 16-entry 2nd-level TLB for 1G pages to back up the 4-entry 1st level TLB. 7-cpu.com/cpu/Skylake.html.
– Peter Cordes
Nov 8 at 11:08
Worth mentioning that at least according to 7-cpu.com/cpu/Haswell.html, the 2nd level TLB victim cache doesn't hold 1G TLB entries in Haswell, so if you have misses they have to come from the page-walker. But Skylake has a 16-entry 2nd-level TLB for 1G pages to back up the 4-entry 1st level TLB. 7-cpu.com/cpu/Skylake.html.
– Peter Cordes
Nov 8 at 11:08
Thanks @PeterCorder, that's nice to know and have in the answer.
– Margaret Bloom
Nov 8 at 11:19
Thanks @PeterCorder, that's nice to know and have in the answer.
– Margaret Bloom
Nov 8 at 11:19
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53197226%2fis-there-a-limit-on-the-number-of-hugepage-entries-that-can-be-stored-in-the-tlb%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
Welcome to Stackoverflow. While your question is set within the scenario of virtualization and involving different CPUs, your questions are substantially answered by this question: stackoverflow.com/questions/40649655/…. Effectively, yes, the processor's TLB has dedicated space for the different types of entries, with a very limited space for huge pages.
– Brian
Nov 7 at 20:29
Yes, you've found a way to create very poor hugepage locality. Most workloads that do a lot of kernel access to memory have more accesses within the same 1G hugepage. (User-space memory on Linux usually uses 2M hugepages when it's using anonymous hugepages at all). In Haswell for example, 2M and 4k TLB entries can go into the 2nd-level TLB victim cache, but apparently 1G entries can't, if 7-cpu.com/cpu/Haswell.html is fully accurate.
– Peter Cordes
Nov 8 at 8:57