内存系统的趋势

Sina WeiboBaiduLinkedInQQGoogle+RedditEvernote分享




[编者注:这个是Bruce Jacob 教授关于内存系统趋势的一个报告。Bruce 是 Memory Systems Cache, DRAM, Disk 一书的作者。现在内存子系统是非常复杂也是瓶颈所在,这个就是为什么 B教授的书能忽悠1K页了。]

Trends in Memory systems

Here’s a talk I just gave today to people from Sun, DARPA, Oracle, nVIDIA, and academia. Since the slides are mostly illustration and not necessarily self-explanatory, here’s a shortened version of the narrative [one paragraph per slide].

Hey there — I was invited to talk to you guys today about memory systems in general, and since it is a diverse group [spanned everything from process technologists to applications developers], I figured I’d go high-level and tell you about what I see are the important open problems and some potential solutions.

Many people in the audience (including the web audience checking the slides) are perhaps unfamiliar with my group. We study the memory system exclusively and have for the past twelve years, so we’ve given the matter a bit of thought … here’s the PhD theses that have come out of our group in that time. [they all investigate various aspects of the memory system]

About five years ago, one of my students, Ankush Varma, now at Intel, said the following quote. I thought it was pretty insightful, so I wrote it on the whiteboard in my office and still haven’t erased it. Probably won’t come off now … Anyway, what he meant was that technology trickles down from high performance to consumer-level devices, and today we see things like pipelines, caches, branch prediction, VLIW, and even superscalar in low-cost embedded processors. These technologies all started out as very high-end mechanisms. However, the flip side is that the issues we face trickle upwards. The embedded domain has had to deal with resource constraints (energy use, heat extraction, reliability, cost, physical size, etc.) since its inception, and it’s only recently that these issues have arisen in the high end.

This more or less says what I just said.

One of the primary issues facing us is the capacity problem. The bottom graph shows the improvement in DRAM signaling rates over time. The top graph shows the price we paid for these improvements. When SDRAM was introduced in the mid 1990s you could put 8 DIMMs in a memory channel. When DDR was introduced, it doubled the datarate, and you could put 4 DIMMs in a channel. It is hard to drive a wide bus fast, especially when it is multi-drop, and the easiest way to address the signal integrity issues is to reduce the number of drops on the bus. 8 DIMMs goes to 4. When DDR2 came out, it doubled the signaling rate again, and the number of DIMMs per channel was cut to 2. When DDR3 was first discussed at JEDEC they seriously considered reducing that to 1, yes 1, DIMM per channel. I think the screaming could be heard in outer space, so they are currently working on getting to the higher speed grades and still retaining 2 DIMMs per channel — a big reason why you haven’t seen DDR3-1600 appear yet. So the net result is that the channel capacity has remained relatively flat over a long period of time (the blue line).

The systems guys have been tearing their hair out over that … for instance, in my $2000 desktop system, I would like roughly $600 or $700 of that to go into DRAM. That would buy me several hundred GB of DRAM at today’s prices, yet there is not a desktop on the planet (that I know of) that will allow me to put several hundred GB of DRAM into it. I can afford it, but I can’t shove it into the box. Intel tried to solve the problem by introducing Fully Buffered DIMM, a system architecture that puts a high-speed ASIC onto each DIMM to form the channel. The memory controller talks to the high-speed ASIC, and it talks to the next one down the channel … a daisy chain. The links are narrow and fast (six times faster signaling rate than the DRAM devices), so you can put three times the number of channels into a system, given the same pin/trace count. Since the channels are daisy-chained (not multi-drop), you can have more DIMMs per channel (arbitrarily limited to 8). Result: nearly an order of magnitude more DRAM per system. Problem: the high-speed ASICs have SERDES that burn a few Watts of power even if you’re not reading or writing to the DIMM, and so system-level power went up an order of magnitude. RIP fully buffered DIMM.

So capacity is ultimately a bandwidth problem … if we could have cheap bandwidth and high-density packaging, we would not have to resort to narrow/fast (and therefore hot) interfaces to the DRAM to get capacity. At the same time, bandwidth is ultimately a power/heat problem, and we do need bandwidth. Numerous studies have shown the rule of thumb is that you need about a GB/s of bandwidth per core … and the industry is relentlessly moving toward increasing numbers of cores on chip. Open problem, must be solved.

Another problem: TLB reach doesn’t scale at all (translation lookaside buffer — it’s a little cache inside the CPU that holds translation information, entries from the page table). The way we implement virtual memory, the TLB has to have mapping information in it before we can access data. Even if we have the data in our cache, if the mapping info is not in the TLB, we can’t get at the data. Have to go to the page table first, load the TLB with the info, and then access the cache. Problem: last-level caches are typically on the order of 10MB, while TLBs typically map on the order of 1MB. This is bad and helps to explain why modern machines spend roughly 20% of their time (and thus power) servicing TLB misses.

A trend, since it will figure into solutions to our problems: flash is beating up on disk, and PCM is expected to beat up on flash when it arrives. Even if it gets a little hamstrung to reduce manufacturing costs or increase reliability, who cares — non-volatile solid-state memory is a great resource.

Some obvious conclusions based on what I’ve said so far. Well, obvious to me, and clearly speculative. First, we want significantly more main-memory capacity, and we don’t want to give up bandwidth. It’s clear that can only happen with a system redesign; we need a new memory architecture. Here’s something our group proposed a few years back. Nothing novel; it harkens back to designs from the 1970s and 1980s. The interesting thing is that I see things like this already in production and in research/development right now. It’s pretty clear something like this will make it mainstream soon (hierarchical memory control) … the only question is details: speeds, widths, concurrency, packaging, signaling technology, etc.

Second, non-volatile memory (both flash and PCM) have better scaling characteristics than DRAM, which seems to be nearing its end of life as far as scaling goes. Thus we can expect increasing capacity given the same form factors, over time, from NV memory. Conclusion: let’s make non-volatile memory a first-class citizen in the memory hierarchy, like cache & DRAM. Pictured is a stylized memory hierarchy including cache, DRAM, and disk.

… and here, we’ve split main memory into two pieces: DRAM and FLASH (or PCM or whatever). Let’s have DIMMs made of non-volatile memory, and a NV controller sitting next to the DRAM controller. Access the NV memory via a load/store interface, not via the operating system’s file-system interface (which imposes a huge and unnecessary overhead). NV is denser than DRAM, so it addresses the capacity problem. DRAM acts as a write buffer for NV, addressing the write-leveling issue. It is also likely that the capacity requirement for DRAM would lessen, given the extra NV memory.

Last conclusion. Let’s reduce the translation overhead of the TLB … this will require us to redesign the VM (virtual memory) part of the operating system and will require a rethink of the computer architecture as well. There are numerous facilities and ideas to revisit, now that technology parameters are significantly different than they were when these facilities were first investigated. In particular, SASOS (single address space operating systems) I think will make a lot of sense … and they will enable us to move the translation point — notionally, where the TLB sits — from right next to the L1 cache out to the main memory, perhaps even further. We should be putting the file system into the address space (memory-mapped files, persistent objects, etc.) as well. This is do-able for high-end systems and would trickle down to the commodity space if we can show it to be worthwhile.

And, relative to the high-end … these things are effectively large embedded systems. Enterprise computers and supercomputers are effectively embedded systems (for all the reasons listed), so they should be treated as such. For instance, use the same optimization techniques, use low-power DSPs (they use Cray-style architectures, for one thing), etc.

… and that’s it for the meat. Hope you find this useful.

(3个打分, 平均:4.67 / 5)

雁过留声

“内存系统的趋势”有14个回复

  1. hello 于 2010-02-04 9:43 下午

    这么长的洋文,看下来太累了。能不能简单的说几句说了啥意思

  2. 1help1 于 2010-02-04 10:58 下午

    呼吁来个 中文 的abstract…

  3. 1 于 2010-02-05 7:11 上午

    确实是一个很激动人心的设计,希望这一想法能早日实现

  4. 无知者无畏 于 2010-02-05 8:36 上午

    好长,关键论点应该是说,既然FLASH在容量上的进展如此之快,那么应该把现在的内存分割为两个部分,split main memory into two pieces: DRAM and FLASH。

    以下是我的解释:

    这样机箱里就多了一种类型的存储,现在在磁盘上或者在内存里呆着的某些数据是非常合适存储在FLASH里的。

    也许某些程序根本不必要经历从磁盘上被读入内存的过程,直接就在FLASH内存中,新开辟的数据区才被设置到DRAM中。

  5. 杰夫 于 2010-02-05 10:47 上午

    我来试着给个Abstract:

    本文指出目前内存系统的几个重要缺点:

    1. 单个内存条的容量和速度在高速增长,但CPU和memory之间的Bandwidth没变,结果是每个Channel的DIMM数量在不断下降。SDRAM时候可以放8个,DDR放四个,DDR2放两个,DDR3放一个或两个。所以,即使你有足够多的内存,但由于Bandwidth的限制,你也用不上。

    那么,如何增加Bandwidth呢?Intel的一个办法是加一个ASIC,但结果是功耗增加几瓦。所以,目前内存系统的瓶颈不是容量,而是Bandwidth。而Bandwidth的问题归根结底是功耗和散热的问题。

    2. 目前的TLB不能很好Scale Up。随着内存容量的增大,TLB Miss越来越多,以至于现在的计算机有20%的时间和功耗用于处理TLB Miss。

    解决办法:
    第一个问题,新加一种内存,Flash,使Flash和RAM地位并列,各司其能,用不同的Load/Store Instructions来访问。现存的Flash是通过File System来访问的,Overhead较高。

    第二个问题,把TLB向后退,从L1 Cache推到Main Memory。

    杰夫点评:

    本文对内存的缺点分析非常到位,解决方案也很有新意,不过我对其效果好坏持怀疑态度。还有,他的Proposal是一个革命性的变革,而工业界不喜欢革命性的变革,牵涉面广,成本高,失败的可能性也大,不到万不得已不会为之。现在的内存还能支撑几年,还不到万不得已的时候。

  6. 陈怀临 于 2010-02-05 10:54 上午

    “Intel的一个办法是加一个ASIC”

    In case小同学们不知道杰叫兽在说什么,这是指FB-DIMM。特别贵。。。
    在Nehalem-EX的系统中,FB-DIMM被甩掉了。AMB 芯片被挪到主板上。

  7. Afantee 于 2010-07-23 1:12 下午

    I am not sure I agree about the TLB part. “cache. Problem: last-level caches are typically on the order of 10MB, while TLBs typically map on the order of 1MB. This is bad and helps to explain why modern machines spend roughly 20% of their time ”

    Nehalem supports two size of pages, big page could get 4MB. so, the total TLB-mapped memory space is much bigger than Last level cache

  8. Afantee 于 2010-07-23 1:14 下午

    BTW; the statement of Nehalem TLB is from another post of you.
    hoho,Coder, I likes your posts.

  9. coder 于 2010-07-24 2:58 上午

    @Afantee

    AMD has done multilevel TLB many years before.

  10. multithreaded 于 2010-07-29 7:39 下午

    As far as I remeber Nehalem supports two levels of TLB.

  11. multithreaded 于 2010-07-29 7:43 下午

    >第二个问题,把TLB向后退,从L1 Cache推到Main Memory。

    Nehalem has three levels of cache, L1, L2, and L3. Therfore, TLB has been pushed back already.

  12. coder 于 2010-07-29 7:54 下午

    @multithreaded
    Yeah. Nehalem has two levels TLB. L2TLB just contains small page entries.

  13. 删吧 于 2010-08-01 8:45 上午

    @all
    We have different ideas on hier-TLB ideas. You could search patent on the GPU stuff to have an idea :)

  14. Multithreaded 于 2010-08-01 10:29 上午

    So far I could NOT see any new idea that can make significant impcts on compurer systems.

    Please note an idea that is good for GPU may not be well applied on to general-purpose CPU or another field such as NPU.