FP2网络处理器架构师Clarke的设计理念

作者黑猫 | 2010-09-11 12:16 | 类型芯片技术, 通讯产品 | 18条用户评论 »

分享到：新浪微博腾讯微信开心人人 Live Digg FB Twitter

New Case for a Programmable Fastpath

转载自

http://www.eetimes.com/design/programmable-logic/4008959/New-Case-for-a-Programmable-Fastpath

Michael Clarke is a network processor architect at TiMetra Networks. He holds an MA in Engineering from Cambridge University, UK.

7年前的文章，放到今天仍不过时。Alcatel收购Timetra团队真是不错的Deal。该团队从Treseq/Bay/Nortel积累了一批高手。

该文章信息量大，内容丰富。从中可以看出FP1、FP2网络处理器的架构设计理念。

摘录出下述关键内容：

Increased packet rates and the relatively long latencies of modern RAM technologies require that an ever-increasing number of packets be processed simultaneously.

包速率高，RAM访问的时延大，故需要很多包并行处理。提高并行处理的包数目有两个方法：增加硬件线程数目；或者用Pipeline Stage。Timetra是用后者。

NPUs facilitate a number of generic lookup types, such as direct lookups, hash lookups, radix trie lookups, and CAM lookups. The size of the tables and the contents of the table entries and CAM keys are entirely defined in firmware.

查表要可编程。故IPv4 FIB等查表TRIE算法应该是用微码实现，而不是用硬件实现。这与思科思路不同，思科的Tree bitmap介绍是用硬件实现，而不是NPU的微码实现。

To provide sufficient processing power, NPUs use the fact that packets can be processed largely independently of one another. NPUs typically have multiple processor cores and a number of packets are processed in parallel or in a pipelined fashion. A great deal of effort can be focused on designing the individual processor cores, making them as speed, area and power efficient as possible. These optimized designs can then be replicated many times, reducing the overall implementation effort for a large proportion of the silicon area of the NPU.

专注于将每个Core在性能、面积和功耗上优化，优化后的Core可以以并行或串行的方式重复放很多次。

Because the performance of a single thread is not as critical in NPUs as in traditional CPUs, using additional silicon area to provide increased processor performance must be weighed against simply using the extra silicon area to add more processor cores.

花更大的代价将每个CORE做得很强，还是用同样的成本放更多的CORE。偏向于后者。

Table memories are not necessarily cached because the design will be expected to work at line-rate with any combination of input packets and must take into consideration the worst-case latency to random table entries.

外部表项线速查表，不用Cache。这与思科思路不同：查表会用上Cache。

Performance is now determined by instruction counts and lookup bandwidth.

NPU的整体表现是由两个指标决定：处理指令数目，查表次数。

between 100 and 1000 instructions can be executed for each packet.

处理性能：NPU对一个包大概可以执行100至1000条指令。

Typical fastpath programs make between 20 and 80 lookup table accesses per packet to implement a meaningful feature set

一个Packet要有20至80次查表机会，以便于业务扩展

Also, programs are heavily tied to the NPU’s specific hardware resources. Hence the requirement to support legacy code is not as strong for NPUs as it is for general purpose CPUs, and this affords the designer some freedom to implement novel architectures NPU是专用芯片，它无须支持传统的编程语言，而可以选择自己的ISA结构，以充分提高处理性能。这与思科的思想不一样，思科坚持用C语言编程。

Essentially NPUs execute many bit level compares and moves, interspersed with regular computation of table entry addresses for lookups. The mix of instructions in a typical application differs between NPUs and CPUs and this must be taken into account when assessing the effectiveness of alternative architectures.

NPU涉及大量的位域比较操作，故其指令集与通用CPU有所不同。思科是使用标准的Tensilica指令集。

Sufficient bandwidth for lookups must also be provided and the right balance must be struck between processing power and lookup bandwidth。

NPU处理性能与查表带宽之间要取得一个平衡。

NPUs may also be cascaded to provide additional processing performance and lookup bandwidth.

NPU可以级联起来，以提供更强的处理能力和更多的查表带宽。IOM2/IOM3正是这样做的。

New Case for a Programmable Fastpath

Kevin Macaluso and Michael Clarke, TiMetra Networks

5/15/2003 6:14 AM EDT

New Case for a Programmable Fastpath
You’ve heard all the debates about the relative merits of ASICs and Network Processor Units (NPUs). For the past three years, many vendors have run around saying that homegrown ASICs are inflexible and the redesign process is too long, while others have argued that NPUs simply don’t meet the required performance criteria with full feature sets.Things, however, are starting to change. With packet processing requirements increasing, feature sets up in the air, and ASIC design costs on the rise, it’s much easier to make a case today for a shift away from traditional hard-coded ASIC-based fastpaths toward programmable fastpath architectures.Service routers are a new generation of routers, delivering significantly more functionality, with all features and services supported at 10 Gbit/s line rates. These routers must be capable of providing additional inspection, classification and encapsulation well beyond that of conventionally routed best-effort traffic. Traffic management in a service router must also support per-service SLAs with PIR and CIR QoS commitments.Current 0.18- and 0.13-micron custom silicon processes, required to achieve the demanded improvements in performance, provide large amounts of silicon area, but are increasingly time-consuming and expensive to develop and modify. For the NPU designer, the opportunity to trade silicon area for flexibility and programmability has become extremely attractive.In this article, we’ll take a technical look at the design decisions affecting a custom, yet fully programmable fastpath for service routers. This article does not address merchant silicon based solutions which have, to date, failed to hit the market’s “performance versus flexibility” mark at an acceptable price point. We’ll start the discussion by looking at the packet classification requirements in today’s service router designs. We’ll then layout and compare traditional hard-coded and programmable fastpath designs.Today’s Packet Classification Process
Like many other elements in networking design, packet classification has gone through a stark transformation in the past several years. With the amount of traffic increasing and more information being embedded in a packet, designers must turn to deeper packet classification in their fastpath designs.Today’s packet classification begins with a series of table lookups based on various fields in the packet header. Examples include:

Lookups for Layer 2 services, including source and destination MAC addresses
Mapping to an egress port within a link aggregation group (LAG)
Layer 3 lookups to determine classless inter-domain routing (CIDR) and equal-cost multipath (ECMP) load balancing next hops
Searching access control list (ACL) filters and quality-of-service (QoS) policies based on IP and TCP header fields.

Many additional complexities arise, such as the location of fields in the packet varying depending on whether 802.1q tags or MPLS encapsulations are present. IP fragmentation may have to be performed to meet the MTU requirements of the destination interface.The results of the lookups are combined to determine how the packet should be directed through the switch fabric and its queuing requirements, such as forwarding class and QoS. Implementations tend to prepend this information to the packet in the form of a routing tag. The packet itself may have to be modified in different ways, ranging from simply decrementing the IP TTL field, to encapsulation with a multiprotocol label switching (MPLS) header. Statistics counters need to be maintained and traffic policing functions may have to be performed.Traditional Implementations
The traditional approach to fastpath implementation involves identifying all of the possible packet classification and manipulation requirements and committing them to silicon. This was done in previous generations of routers because these tasks were more readily defined and ASICs were relatively inexpensive to spin by today’s standards. Additionally, there simply wasn’t any spare area afforded by the silicon processes available at the time to do things differently. This approach also demands that crucial design decisions are made early in the development cycle. Packet formats, lookup table entries, and content-addressable memory (CAM) lookup keys all have to be defined. Resources such as internal and external RAMs and CAMs are selected and sized based on bandwidth and scalability requirements. A certain amount of configurability is implemented to mitigate against anticipated future requirements, such as providing facilities to turn individual features on or off, or to choose between subtle variations of a feature.A fundamental deficiency in the traditional approach is that protocols not considered or defined at this stage cannot be supported later on without new silicon. Arising problems, such as unforeseen interoperability issues, may also not be able to be satisfactorily resolved without new silicon. Some support for resizing tables may also be provided but can prove limiting if applications with unconventionally large table requirements are to be supported.Once the packet and table formats have been defined, accesses to resources are scheduled and the flow of packets through the device is arranged. This is achieved by considering the latency of the memories employed and the interdependencies of the lookups.Increased packet rates and the relatively long latencies of modern RAM technologies require that an ever-increasing number of packets be processed simultaneously. Straightforward deterministic scheduling of accesses to resources is complicated by non-deterministic lookups such as descending through a hash chain. Arbiters may have to be designed to handle cases where multiple state machines simultaneously require access to a shared resource. Counters and policers require atomic access to table entries. Rather than designing generic mechanisms to address these problems, optimized schemes for each operation are usually employed. Handling the complexity of such a design is difficult.As with any silicon design, the consequences of errors are significant. Small anomalies in the implementation of features can make those features unusable in production networks. A high-level engineering approach can be employed to achieve a high quality result by rigorously abstracting the definition of the packet and table formats and lookup algorithms from the detailed hardware designThere are two major benefits to this traditional approach to fastpath design. First, designs can be very efficient in terms of silicon area and power consumption. Second, because the precise arrangement of table accesses is known in advance, it is relatively straightforward to ensure that such a solution will perform at line-rate for a given set of enabled features.Making the Move to NPUs
In making the move from traditional ASIC-based approaches to programmable NPUs, designers must resolve the deficiencies of traditional approaches and achieve the same performance as a hardwired fastpath, while providing an order of magnitude increase (at least) in flexibility. The size and function of the lookup tables should be fully programmable. Any type of packet manipulation must be possible, including generically replacing large quantities of the packet with data derived from lookups.Solutions range from embedding FPGA-like regions into an ASIC, to the currently favored solution of NPUs. NPUs comprise a number of true processor cores, using an execution pipeline driven from an instruction set that is highly optimized for networking applications. While NPUs utilize silicon area less efficiently than a hardwired solution, they are not unreasonably inefficient when compared with the area of traditional packet handling logic and associated table memories.To provide sufficient processing power, NPUs use the fact that packets can be processed largely independently of one another. NPUs typically have multiple processor cores and a number of packets are processed in parallel or in a pipelined fashion. A great deal of effort can be focused on designing the individual processor cores, making them as speed, area and power efficient as possible. These optimized designs can then be replicated many times, reducing the overall implementation effort for a large proportion of the silicon area of the NPU.Another characteristic that distinguishes NPUs from a general-purpose processor is the inclusion of hardware to assist in passing packets from dedicated packet interfaces to the processor cores with very low processor overhead. The processor cores typically have very low latency and highly flexible access to the packets they are processing.NPUs facilitate a number of generic lookup types, such as direct lookups, hash lookups, radix trie lookups, and CAM lookups. The size of the tables and the contents of the table entries and CAM keys are entirely defined in firmware. Both internal and external memories may be provided for storing tables. Multiple external interfaces are provided to specific devices such as SDRAMs, SRAMs and CAMs, as well as in the form of generic lookaside interfaces.Table memories are not necessarily cached because the design will be expected to work at line-rate with any combination of input packets and must take into consideration the worst-case latency to random table entries. For the same reason SRAMs tend to be favored over SDRAMs for storing tables to avoid the bank-to-same-bank access latencies of SDRAMs.Double-data-rate (DDR) SRAMs may be preferred for tables where the processor is mainly reading, while quad-data-rate (QDR) SRAMs may be preferred for features like statistics counters, where there is an even balance between reads and writes. For ACL and QoS IP and TCP header lookups, or other n-tuple lookups where a number of independent fields are to be classified, ternary CAMs (TCAMs) may be employed.Design Tradeoffs
Trade-offs must still be made in the design of the NPU that will affect its ultimate flexibility and performance. Performance is now determined by instruction counts and lookup bandwidth.There are many options for differentiation in the design of the processor cores. A 10 Gigabit Ethernet interface has a worst-case packet arrival rate of 15 million packets per second (Mpps). Assuming a processor clock frequency between 100 and 1000 MHz and the number of processor cores being between 16 and 64, and taking power limitations into account, between 100 and 1000 instructions can be executed for each packet.Off-the-shelf instruction sets could be used or extended. With such a small number of instructions that can be executed for each packet, the overall size of programs is correspondingly small. Also, programs are heavily tied to the NPU’s specific hardware resources. Hence the requirement to support legacy code is not as strong for NPUs as it is for general purpose CPUs, and this affords the designer some freedom to implement novel architectures in an attempt to close the efficiency gap between a hardwired and a processor-based solution.Essentially NPUs execute many bit level compares and moves, interspersed with regular computation of table entry addresses for lookups. Examples include manipulating fields in tightly packed table entries and selecting packet fields for inclusion in a CAM lookup key.Increasing the effectiveness of individual instructions and increasing the average number of instructions executed per clock cycle are two ways of improving performance in processors for a given clock frequency, so alternatives such as heavy microcoding and VLIW-style fetching of two instructions per clock cycle may be considered. Solutions akin to those used in DSPs, such as incorporating very specialized instructions for comparing multiple fields simultaneously, may also be considered. The mix of instructions in a typical application differs between NPUs and CPUs and this must be taken into account when assessing the effectiveness of alternative architectures. As broad a range as possible of NPU applications, such as both IP and ATM forwarding, must be considered.A strength of RISC-based architectures that must be maintained is that as many functional units as possible are kept fully utilized in typical code flows. Because the performance of a single thread is not as critical in NPUs as in traditional CPUs, using additional silicon area to provide increased processor performance must be weighed against simply using the extra silicon area to add more processor cores.Sufficient bandwidth for lookups must also be provided and the right balance must be struck between processing power and lookup bandwidth. Typical fastpath programs make between 20 and 80 lookup table accesses per packet to implement a meaningful feature set. Multiple requests to shared resources will be outstanding simultaneously. This can be addressed in hardware by enforcing fairness between requesters and in firmware by ensuring that the bandwidth of individual resources is not exceeded.Service routers generally require more complex algorithms than traditional router applications. Enough overhead must be provided at the hardware design stage for future feature and scalability requirements. NPUs can be scaled by adding special purpose peripheral chips that attach to the NPU using its lookaside bus to provide additional features. NPUs may also be cascaded to provide additional processing performance and lookup bandwidth.Tools Start Answering the Call
Critics often say that NPUs are difficult to program. This is because complex design decisions still exist. Previously, hardware designers were solely involved with scheduling accesses to resources such as table memories. Now this is encompassed by the firmware.The tools used to develop the firmware for NPUs now contribute to the effectiveness of the solution. These tools can present a simplified and single threaded programming model for the processor array. They can help with managing packet and table format definitions throughout the system and with scheduling accesses to resources. Protocol design can be abstracted further from the hardware, in contrast to a traditional fastpath implementation where the protocol design and the hardware design are considered simultaneously.Queuing/Switch Fabric Access
While the NPU’s packet classification and manipulation functions form the heart of a programmable fastpath, a complete forwarding-plane solution also requires queuing and, in the case of a system with multiple linecards and a switch fabric, interfacing to the switch fabric through a backplane. A traffic management (TM) device that implements the QoS policies determined by the NPU performs the queuing function.Queuing in a service router is more demanding than in a traditional router. Tens of thousands of queues have to be supported because each service is queued separately. In fact, each service may have multiple queues allocated to it. Each queue has an independent PIR shaper and CIR marker in order to provide SLA-based services.Such queuing operations involve very specialized atomic operations on large amounts of data and as such are unsuitable for implementation using a processor array. The goal is for the NPU to retain detailed control over how the TM queues its packets, thus the NPU selects destination queues, forwarding classes, drop preferences, and other such parameters.Benefits of a Programmable Fastpath
There are already real-world examples that demonstrate the benefits of NPU flexibility. Recently, the draft-Martini encapsulation was adopted for virtual private LAN service (VPLS). Moving away from a pre-standard or proprietary encapsulation for transparent LAN services to support this emerging standard would have been impossible in a fully hardwired design. In contrast, NPU-based solutions would require relatively straightforward modifications to header formats and suitable changes to the program flow.Another benefit of NPU flexibility is that it allows for a phased implementation of complex features that are currently not broad market requirements but may become so in the future. Instead of having to devote resources and intellectual effort to providing a service now, such as hardwired IPv6 routing, a vendor can ship hardware that is guaranteed to be upgradeable in the field via microcode to support IPv6 routing in the future. Time to market for new features is improved, as is flexibility in meeting specific market and customer demand.Service routers enable providers to consolidate a broad range of revenue-generating services onto a common IP/MPLS infrastructure and support new service offerings that enhance their competitive position in the market. Deploying service routers that use a programmable 10-Gbit/s fastpath benefits service providers in the following ways:

Reduces CAPEX by adding features into the fastpath without hardware (I/O module) replacement
Reduces operational disruption and delivers significant OPEX savings by eliminating truck rolls for new feature upgrades
Enables faster time-to-market for new features and services because developing firmware is months faster than spinning ASICs
Offers ATM-like QoS capabilities for SLA-based services over IP/MPLS networks
Delivers any service at any speed on any port — contrasting feature-specific I/O modules with varying performance depending on ASIC version and features required

With the availability of both fully programmable 10 Gbit/s NPUs and TMs that support per-service queuing, the new generation of service routers that are now being deployed will help carriers satisfy customer demand for a wide range of cost-effective, SLA-based data services over IP/MPLS.About the Authors

Kevin Macaluso is the vice president of product management at TiMetra Networks. He holds a BS in Electrical Engineering from Santa Clara University and an MBA from Harvard Business School. He can be reached at kmacaluso@timetra.com.Michael Clarke is a network processor architect at TiMetra Networks. He holds an MA in Engineering from Cambridge University, UK.

(2个打分，平均：4.00 / 5)

工具箱
本文链接 | | 打印此页 | 18条用户评论 »

雁过留声

“FP2网络处理器架构师Clarke的设计理念”有18个回复

吴辉于 2010-09-12 8:45 下午

Increased packet rates and the relatively long latencies of modern RAM technologies require that an ever-increasing number of packets be processed simultaneously.

包速率高，RAM访问的时延大，故需要很多包并行处理。提高并行处理的包数目有两个方法：增加硬件线程数目；或者用Pipeline Stage。Timetra是用后者。

选择后者原因是啥喃？
southbayer 于 2010-09-12 10:36 下午

我觉得是为了line rate processing
吴辉于 2010-09-12 10:57 下午

这两种报文处理模型和能不能line rate是两回事。
droplet 于 2010-09-13 1:17 上午

pipeline可以增加编程的灵活性吧。
aaa 于 2010-09-13 3:29 上午

为什么路由查找cisco和al用硬件或np查找而不用tcam呢，因为太贵，功耗太大么
SDH 于 2010-09-13 6:59 上午

pipeline编程比较简单啊，我爱用，调优各个stage之间的load麻烦一些，做好不容易。
阿牛于 2010-09-13 10:27 上午

tcam 成本高。flexible 可能也是因素之一。
陈怀临 于 2010-09-13 10:44 上午

>>提高并行处理的包数目有两个方法：增加硬件线程数目；或者用Pipeline Stage。

在体系结构范畴里，增加ILP的方法有而且只有两种：

–SuperScalar，例如多个整数计算单元，多个浮点计算单元。这就是所谓多发射的有来。什么Out of Order Execution等等都是从这方面变换或者引出来的。
–Pipeline，比较经典。前提是：各个Stage必须even划分。例如所谓的每个cycle必须完成一个。从而可以流起来。否则，短板就是那个跑的最慢的熊（人）。这里面最烦人的技术就是所谓的interlock，或者等等

这两个技术线路上并行的。但又是合一的。。。

这也是为什么现在大搞TLP的原因（解决Memory Access Latency）。

所以，如果是一个单纯的L2/L3 Switch或者Router，流水线的架构设计是可以的，合理的。

但如果一个系统要做DPI，或者乱七八糟，例如ASR1K等。单纯的流水线是要考量的。。。

Again，世界上其实没有美女；是你的心中有美女。
吴辉于 2010-09-13 9:16 下午

首席说的是勒
黑猫于 2010-09-29 8:42 上午

站在多CPU系统的角度看NP有两种：SMP和ASMP。
1)Run to Completion属于SMP
对称多处理系统，每个CPU Core自带L1 Cahce，运行相同程序处理相同任务。CPU Cores通过XBAR/BUS共享DRAM和外设（存放路由表、Counter等），故又称为UMA。
2)pipeline stage属于ASMP
非对称多处理器，不同pipeline stage上的CPU处理不同的任务，stage之间通过FIFO相连。
FP2属于ASMP，将112个Core划分成7个pipeline stage.

http://en.wikipedia.org/wiki/Asymmetric_multiprocessing
http://en.wikipedia.org/wiki/Symmetric_Multi-Processing
fpeking 于 2010-11-21 1:43 上午

bitmap如何实现的不详，不过radix tree是有明显的局限性的。
HJ 于 2011-06-28 10:08 下午

今天 ALU发布了FP3，400G.
freshfruit 于 2011-06-29 5:57 上午

ALU高端、核心路由器份额排名第三，有必要开发自己的处理器。
黑猫于 2011-07-05 12:42 上午

Alcatel在NP上跑得很快，值得学习；但在交换网上跑得太慢，估计是人少钱少投入不给力。预计在2012年FP3上市时，其200G/slot在坏掉一块网板时只有100G/slot。

到今天为上，如果坏掉一块网板，业界实得性能如下：
ASR9K: 90G/slot
MX960: 80G/slot
AL7750: 50G/slot (CFM3)
AL7750: 100G/slot (即将发布的CFM4)
NE40EX: 100G/slot（2010年已经发布的SF2）

更多信息，
Cisco is 180Gbps/slot today but not full-duplex, so they can get roughly 90G/slot today full-duplex in a redundant fashion. Their announced 24x10GE and 2x100GE blades will be accompanied by a new RSP when they are actually shipping (EOY?), which should boost the ASR9K to 220Gb/slot (440G/s half-duplex).

The new ASR9922 behemoth will be ~500G/slot full-duplex day one and do so redundantly.

ALU will not have a 200G/slot full-duplex option when the first FP3 cards ship (1Q 2012), they likely will not have the option until the end of next year.

Juniper will have a next-gen SCB for the MX960 but who knows when that will be out.

ALU will eventually have a leg up with the SR12 due to its 10 slots and the ability to do 20x10GE per slot. They had a huge lead on the other vendors with the FP2 but they have been stuck at 50G/slot for a long time now. The CPM4/SFM4 just coming out now (9.0R4) finally puts them ahead of everyone else in shipping a true full-duplex 100G/slot box, just in time to go behind again…
http://www.lightreading.com/document.asp?doc_id=209408&piddl_msgpage=1#msgs
http://juniper.cluepon.net/Trio_oversubscription
boblee2000 于 2011-07-14 8:49 下午

提高并行处理的包数目有两个方法：增加硬件线程数目；或者用Pipeline Stage
——————————–
就开发人员来说，更喜欢用前者，并且前者使用起来更灵活，可以针对一些比较复杂的应用场景，但是提升性能的话比较难。对于一些相对比较简单的场景，后者可能好些，而且性能提升起来比较快，流程相对简单些。感觉他们选用后者的原因，就是因为他们面向的对象就是一些类似于交换路由之类的对查表应用比较依赖性比较强，流程比较简单的场景，而且性能优势可以很快的发挥出来
multithreaded 于 2011-07-15 12:36 下午

还有一个是增加流水线的数目。无论是增加硬件线程数目还是增加流水线的数目，包的有序问题比较讨厌。
who 于 2011-10-29 1:52 上午

关于流水线的瓶颈分析的请教：
我现在理解的流水线的实现:将功能拆分，一个模块做一个功能，这样，流水形成后每一个最长流水周期就出来一个结果；在rtl编码时比较如意实现；

但关键是：怎吗才能在设计之初就能确定那个模块是瓶颈呢？interlock是怎样一个概念？：是否是：不同流水功能段间需要等待导致很长时间才能进行下一个包的功能处理？
AnOnyMous 于 2011-10-30 4:58 上午

“Sufficient bandwidth for lookups must also be provided and the right balance must be struck between processing power and lookup bandwidth。

NPU处理性能与查表带宽之间要取得一个平衡。 ”
这里的processing power指的应该是对业务的处理能力，而不是性能。lookup BW和capacity之间不可能是strike的关系。

最新用户评论

最新文章

分类

FP2网络处理器架构师Clarke的设计理念

New Case for a Programmable Fastpath

Kevin Macaluso and Michael Clarke, TiMetra Networks

5/15/2003 6:14 AM EDT

雁过留声