“The Pentium Chronicles”摘录

Sina WeiboBaiduLinkedInQQGoogle+RedditEvernote分享




作者注:下文是”The Pentium Chronicles”一书的阅读摘要,个人最喜欢的一本书之一。双芯记里留言看不到原书的朋友可以翻一翻,结合研读弯曲首席的系统设计方法论,相信不日就可向老板要求大力加薪。作者前文已经介绍过,Intel的前首席IA32架构师,97年的fellow。这本书详细描述了从1990年作者单独来到俄勒冈组建设计团队,到发布P6处理器的历程,以及后续领导Pentium4开发;除此之外,也穿插了许多有趣故事和技术看法:例如访问MS时NT团队和95团队拍桌子对骂数个小时,向Carmack致敬,对Pentium4设计组织的不满-市场团队压过技术团队拥有最终技术决策权,对IPF开发的不满,FPU臭虫的背后故事等等等。书中多处引用萧伯纳,歌德等文字,看得出鲍勃也是一个文艺青年,不知道上不上豆瓣。
顺带八卦两句另外两本硬件和芯片开发的书:
The Soul of a New Machine”也写得非常精彩,虽然年代有点久远,但跟现在的硬件系统设计还是差不太多。对于我等大宋人士来说,这本书不愧是得过普利策奖的,一大片单词要翻字典。
The race for a new game machine”还算有点意思的一本书(中文名压力下的角逐),作者来自IBM,fellow下面一级,也算一个大包工头。不过书里暴露了IBM跟Intel在处理器设计方面的各方面差距,例如内部产品线之间的互殴等,还有一个有趣例子,cell处理器为什么是长方形,8个spe,不是因为量化分析,而是因为Sony的久多良木键在做完6个spe的处理器后一个电话过来,说他喜欢数字8,必须用8个spe。另外也扯扯ms的两个架构师,我记得读过他们micro上的一篇paper,讲述设计360处理器如何如何,怎么做的分析,为什么这个决策,把我忽悠得云里雾里,觉得真牛X,但看了这本书,你就知道,那文章基本上属于瞎扯蛋。

The Pentium Chronicles

FOREWORD

Many of us dreamed of changing the world with great ideas in processor design.

I personally experienced the conceptual phase and the refinement phase as part of a four-person team. There were many questions that we tried to address with our limited ability to formulate a potential design and acquire simulation data. In respect, we left many issues not addressed; our project generated many more questions than we answered. It was a heraic pioneer-style exploration of designing a speculative out-of-order-execution processor.

Bob gives the ultimate principles of great project management: acquire the best minds, organize them so that each of then operates within their comfort zone, protect them from the forces of mediocrity, help them to take calculated risks, and motivate them to tie up the loose ends.

-WEN-MEI HWU

PREFACE

Microprocessor design, on the massive scale on which Intel does it, is like an epic ocean voyage of discovery. The captain and his officers select the crew carefully, so as to represent in depth all the requisite skills. They pick the types, sizes, and numbers of ships. They stock the ships with whatever provisions they think they will need, and then they launch into the unknown. With a seasoned crew, good weather, and a fair amount of luck, most of the voyage will go as expected. But luck is capricious, and the crew must inevitably weather the storms, when everything awry at once and they must use all their experience to keep the ships afloat.

My intention is to use real incidents and events to illuminate the general engineering methods; it doesn’t matter who made a particular mistake. What matters are the conditions that led up to a particular design or project error, and how someone might prevent a similar occurrence in future designs. We all make mistakes-that’s one of the book’s major themes. Our design have to work flawlessly despite us.

I believed that in any well-intentioned, well-run engineering endeavor, choices will be made among multiple plausible options, some of which will turn out to brilliant, others not. We should always learn from our mistakes, but if we are striving for excellence, we should also consider the triumphs and tribulations of others. I sincerely hope that by exploring lessons from the P6 and Pentium 4 projects, this book provides a way to do that.

I would be guilty of breaking an important rule I learned as an engineering manager: Never let an engineer get away with simply presenting the data. Always insist that he or she lead off with the conclusion to which the data led. The engineer’s job in that presentation is to convince you that he reached the right conclusion. If he or she cannot draw a conclusion, or will not, then either the work or the engineer lacks the maturity to justify making the presentation at all.

P6 was a big project that started with one person(me), grew over several years to 450+ people, and then eventually decreased to zero staff. In between were waves of concurrent activity. For the first years or two, the project was a handful of chip architects conceiving and documentating a general approach that would realize the project’s overall goals. Meanwhile, a design team was being recruited, organized, and trained so that it would be ready to hit the ground running when the architects were ready. Two validation teams, presilicon and postsilicon, were also recruiting, training and writing tests for the day when they were needed. Other groups were trying to anticipate the requisite manufacturing production challenges and what could be done to ameliorate them. Marketing was preparing their campaign strategies, marketing literature, and promotional materials, and setting up customer visits to validate the architects’ design feature choices and to inform potential customers of what lay ahead on Intel’s product road map. Try plotting all that on a single line.

It’s also clear in retrospect that Fred, in particular, was a masterful corporate diplomat, keeping the existing corporate antibodies from killing or damaging our project, back when we were a bunch of unkowns proposing odd plans and slinging scary-sounding phrases like “out-of-order microarchitectures” and “glueless multiprocessing.” Even early on, Fred gave us the freedom to roam, and trusted us to come back with something worth keeping.

The basic steps that the P6 project would have to accomplish were reasonably clear: conveive a microarchitecture, do the detailed chip design, debug first silicon, and drive the chip into production.

recognition for the dedication, creativity, enthusiasm, and unstoppable elan of a great design team.
But it would be an even greater injustice to say nothing. So with considerable humility and trepidation, but profound gratitude, I offer the following thanks.

Dadi Perlmutter, Mike Fister, Will Swope, and Pat Gelsinger, for the opportunities you gave me and the errors you forgave. In your own way, each of you gave me the maneuvering room to try new things and the support I needed when a few of them went away.
Frank Binns, thanks for proving that marketing people can do real technical work, and for having a twisted sense of humor-balm for tough days.
Dawn Kulesa, Human Resource representative extraordinaire-for me, you were the human face of what sometimes seemed a cold, impersonal corporate machine. Your initiatives on training and mentoring ar Intel were exemplary and helped the team enoemously

George Bernard Shaw wrote, “This is the true joy in life, the being used for a purpose recognized by yourself as a mighty one… the being a force of nature instead of a feverish, selfish little clod of ailments and grievances complaining that the world will not devote itself to making you happy.”.

Chapter 1 Introduction

P4
The second day, my boss stuck his head in my office and said, “Your job is to beat the P5 chip by a factor of two on the same process technology. Any questions?” I replied, “Three. What’s a P5? Can you tell me more about Intel’s process technology plans? And where’s the bathroom?”

P6 PROJECT CONTEXT
* Proliferation Thinking
Henry Petroski points out that this flagship/proliferation paradigm is not unique to the microprocessor industry: “All innovative designs can be expected to be somewhat uneconomical in the sense that they require a degree of research, development, demonstration, and conservatism that their technology descendants can take for granted.”

When it became clear that P6′s real importance to Intel was not so much its first instantiation(which Intel eventually marketed as the Pentium Pro), but in its “proliferability”, we began to include proliferation thinking in our design decisions. Early in the project, proliferability figured prominently in discussions about the P6′s front-side bus, the interconnect structure by which the CPU and its chip set would communicate.

DEVELOPING BIG IEDAS
The first step in growing an idea is not to forget it when it comes to you.

We tried as much as possible to reuse existing solutions, tools, and methods, but we knew at the outset that existing technology, tools, and project management methods would not suffice to get out project where we wanted it to go. So we purposefully arranged the design team to address special technology challenges in circuits, buses, validation, and multiprocessing.

* Defining Success and Failure
Engineers generally recognize the validity of the saying, “Creativity is a poor substitute for knowing what you’re doing.” (Ignore what Albert Einstein is reputed to have said:”Imagination is more important than knowledge.” That might be valid for a scientist, but as an engineer, I know that I can’t simply imagine my bridge tolerating a certain load.)
Good engineers would much rather use a known-good method to accomplish some task than reinvent everything. In this way, they are free to concentrate their creativity on the parts of the design that really need it, and they reduce overall risk.

Large, successful engineering companies must constantly struggle to balance their corporate learning (as embodied in their codex of Best Known Methods) against the need of each new project to innovate around problems that no one has faced before.

* Senior Wisdom
In most cases, a company will present the “new project” need or opportunity to a few senior engineers who then have the daunting job of thoroughly evaluating project requirements and answering a two-pronged question: What constitute success for this project and what constitute failure? They must identify possible avenues for the project to pursue that will lead to the promised land.
In essence, a few senior people are making choices that will implicity or explicity guide the effors of hundreds of others over the project’s life. It is, therefor, crucial that project leadership be staffed correctly and get this phase right, or it will be extremely hard for the project to recover. Do not begin the project until the right leadership is in place.
Quantity cannot replace quality. Guard these folks when you find them, because you cannot replace them, and their intuitions and insights are essential to getting a project onto the right track and keeping it there through production.

The framework presented is a product of my attempt to impose order and labels on what we actually did, with the benefit of hindsight and the passage of time.
The four major phase I’ve been able to distill are
1. Concept
2. Refinement
3. Realization
4. Production
In the concept phase, senior engineers consider the request or opportunity and try to brainstorm ways to satisfy it.
The refinement phase weeds out the implausible solutions and prioritize the rest so that the project’s limited engineering effort concentrates on the iedas that are most likely to pan out. Of the initial set of, say. 10 or 20 ideas that exit the concept phase, only two or three are likely to survive the refinement phase. One of these will be designated the plan-of-record and will receive the most attention at the beginning of the realization phase.
Realization is the actual engineering process. The design team takes the best idea that has emerged from the refinement phase(and may even have been the instrument by which refinement occurs) and implements the prototype product.
The last phase of the engineering process-production, driving what you’re created into solid high volume-is often overlooked by design teams. Design team must shepherd their creation all the way through volume production, not simply transfer responsibility to some production engineering group at the first sale.

In the chapters that follow, I hope to show that an amazing combination of intelligence, wisdom, stubbornness, and ambition made P6 what it is, and in the computer business, or any business for that matter, excellence is a critical prerequisite to success.

Chapter 2 THE CONCEPT PHASE

OF IMMEDIATE CONCERN
SUCCESS FACTORS
There are no guarentees, but the chance of a successful concept phase is much higher if you start with clear goals, choose the project team wisely, and pay attention to the mechanics.

* Clear Goals
Architects pointed toward an agreed-on goal are far more likely to get a lucrative idea. They must be able to answer these questions above all others: What constitutes project success, and what constitutes failure? The art of all engineering is compromise, trading one thing for another.

Never assume that all team members have the same implicit goals…Write the goals on a giant poster in order of priority and have all team members walk past it to get to their meetings. Projects that don’t know what their goals are have no chance of hitting them.

Our solution was to establish a formal project document, a project-wide plan-of-record, which was our official response to what upper management and marketing were requesting of our new design.
————————————————————————————————————–
Marketing needs & want———\
Management vision & direction—->Marketing Requirements Doc(MRD)
Linear roadmap extrapolation—/           |
|
|
Technology vision———\                |
—-Think and argue(a lot)—>Product Implementation Plan(PIP)
Scars from past project—/                                          |
/|\
_____________________/ | \_________________
|                      |                  |
Financial planning       Marketing foils        POR docs
Recruitment
Space planning
Computer purchase
Tool licenses

Figure 2.1 The road to the project POR

* The Right People
The senior leadership of any design project is the single most important predictor of project success, bar none. These are the people the project will count on to keep it on track. They must routinely select a workable path from among a forest of alternatives, most of which eventually turn out to be untenable. Project leadership must constantly check the project’s progress and goals against the competition, monitor the effects of previous decisions, make repairs as necessary, and directly intervene in the actual design when only supreme technical wizardry can transcend the next roadblock.

* P6 Senior Leadership
…Dave has an uncanny gift for identifying(frequently on the basis of scant clues) which of many possible avenues for solving a given problem are most likely to pay off. This gives a design project an enormous mechanical advantage over the possibility space, in that the team can quickly abandon less-promising choices so as to concentrate on the one that will eventually become the POR. Glenn is a genius at creating his way out of whatever corner the project has inadvertently painted itself into, a role he fulfills with aplomb and abundant good humor; he is a joy to work with. Michael was a RCG, but had some industry experience and was so bright and unafraid to speak his mind in any engineering context that he fit into our team immediately. Michael also had the charming and absolutely essential ability to gracefully give suggestions to his boss(me) on how to improve my software without covert snickering or outright laughter in the process. Andy was an idea foutain; mention a problem and he’d instantly recite the seven known ways to solve it and then invent four more spontaneously(at least two of which actually work). His intellectual energy was extremely valuable to our conceptualizing.
The senior leadership steered an aggressive path, but also purposely worked to keep the project out of trouble. The two junior engineers brought new insights and latest thinking from elsewhere. Their constant questioning of the senior engineers spurred us to higher and better planning throughput and generally kept us from thinking too much inside old boxes.

* Setting the Leadership Tone
Nothing is more corrosive to overall design team morale than perceived disarray in the senior project leadership.

Project leaders must be both resilient and resourceful and they must recognize that they are not going to be given the easy questions – the ones that get answered at lower levels in the project hierarchy. Most or all decisions made at the leadership level will threaten one or more key project goals. There will be no reliable data on the possible avenues for resolution and little time to collect much more. As the project progresses, the choices get tougher, because time is shorter and the degrees of freedom are fever. All this argues for good judgment up front. The project leaders must reserve enough design margin(but not so much that they are “sandbangging”, which can also kill a project) in the early stages that they have room to maneuver when the surprises appear later.

* Managing Mechanics
The concept phase could also be called the “blue sky” phase because all things seem feasible, the horizons are very far away, and the possibilities are infinite. Many system architects enjoy this phase of a project so much that they are loathe to leave it. After all, the crystal castle they’re building in the clouds of their gray matter has only good features, unsullied by compromises, physics, economics, road-map choices, customer preference, and schedule pressure…The project leader must be able to sense when the project’s concept phase is nearing the end of its usefulness and have the fortitude to push the team to the next phase. (conversely, project leaders must also be able to accurately discern when the project has not yet come up with ideas strong enough, and prevent the project from committing to a plan that cannot produce a successful product).
Pay attention to some basic mechanics can make it easier to manage this phase and its transitions.

* Physical Context Matters

* The Storage Room Solution
The idea that physical context matters is hardly new.
The converse of this idea is also valuable. If a particular physical context has contributed to the current project state, it makes sense to change that context when the project occasionally stalls and needs a really good new idea. Sometimes, just meeting temporarily in a different room can break the logjam [NUDT experience]

* Beyond the Whiteboard
Our solution to whiteboard limitations was to create a meeting record, or log, that captured our key decisions as we made them. The honor of record keeping fell to a designated “scribe”, whose job was to notice ideas and decisions worth writing down. The scribe would then circulate the meeting minutes among the rest of the concept phase team for error-checking as soon as possible after the meeting. The scribe not only captured decisions made and directions set, but also tried to document roads not taken and why. Brainstorming in the meeting commonly ended with an Aha! memonet, in which everyone suddenly realized why an idea we had all thought had some potential would in fact not work. Or we might come upon a related idea that we all felt was a good one but not appropriate for our project. If no one had documented such moments as they occurred, they might have been lost forever.

* “Kooshing” the talkers.
You want doers, not talkers, on a concept phase team. Doers write programs, run experiments, look up half-remembered papers, ask expert friends, and come up with boundry-case questions to help illuminate the issue. Talkers generate endless streams of rhetoric that seem to enumerate every possibility in the neighboring technical space. When you become exasperated and cut them off, however, you will discover that this long list has brought you no closer to resolving anything.
With dozens of complex concepts being invoked or invented, and a few overall project directions still on the table, the possibility space becomes enormous. Each choice you tentatively make generates five new issues, each of which embodies some number of knowns and unknowns. The team will not have much data to guide them in this phase. They must find their way through these choices with a combinaion of
1, Knowledge (what is in the literature or was proven at last week’s conference)
2, Experience (we did it this way eight years ago, and it worked or didn’t work)
3, Intuition (this way looks promising to me; that way looks scary)

A DATA-DRIVEN CULTURE
An important part of the concept phase is to establish the project’s technical roots. For P6, we had to choose some basic project directions such as 32-bit virtual addressing versus 64-bit, 16-bit performance, front-side bus architecture, and out-of-order microarchitecture versus the P5′s in-order architecture. Somewhat later in the project, we faced similar questions about clock frequency, superpipelining, and multiple processors on a bus.

There is no religious demagogue worse than an enigeer arguing from intuition instead of data. If any two cencept phase engineers have opposing intuitions, and neither collects data sufficient to convince the other, the debate cannot be resolved and quickly degenerates into a loud, destructive ego contest.
We avoided most such encounters in the P6 by allocating the list of open items to the five participants and then requiring that the item be argued at the next meeting on the basis of data to be collected by then. In this way, we established the project’s data-driven culture. Each person had to think of and then execute an experiment to help resolve a technical issue or unknown in sufficient detail and veracity to convince the rest of us that his conclusion had merit. The lack of time and resources actually worked mostly to our advantage. You might think that there was not enough time to do a really good experiment between Tuesday noon and Thursday 8 A.M., but the short deadlines forces us to think carefully about exactly what was being asked. The thinking was the important part; if we got that part right, then writing up some code or researching the literature was usually simple. And the focus it provided our answer tended to make that answer right, or at least useful.

Of course, insisting on data for all decisions will lead to analysis paralysis. In this case, schedule pressure and proactive leadership are the antidotes. Legendary NASA designer Bill Layman said: “If you force people to take bigger intuitive leaps, and their intuition is sound, then you can get more and more efficient as you force larger and larger leaps. There’s some optimum point where they succeed with ninety percent of their leaps. For the ten percent they don’t make, there’s time to remake the leap or go a more meticulous route to the solution of that particular problem…There is an optimal level of stress.”

* The Right Tool
The important thing was not how pretty the DFA code was; it was what we learned in creating and using it. The lesson for all chip architects at this phase is to get your hands dirty. What you learn now will pay off later when all the decisions become painful trade-offs among must-have’s.

* The “What Would Happen If” Game
Having a data-driven culture in a design organization is always a good idea, but never more so than when the team’s leaders are learning new technology. When you have spent 10 or 15 years designing computers, as Glenn, Dave, and I had, you will have absorbed a subliminal set of intuitions about what seems reasonable. For example, this size for an L2 cache should mean that the L1 cache is that size; if the CPU is this fast, then the frontside bus must have at least that bandwidth. This kind of experience base can be immensely valuable to a design team, since it focuses the team’s efforts on the few solution sets that might actually work, instead of allowing them to wander in the much larger general possibility space.
But DFQ quickly taught us that much of what we had subconsciously learned was wrong; it just didn’t apply to out-of-order engines. Undiscouraged, Dave and I started a little prediction game that went something like this: I would show up at my office and start work. Within an hour, Dave would arrive, sporting an “I know something you don’t” grin, and say, “What do you suppose would happen if you had an out-of-order x86 engine with 300 arithmetic pipelines that could do 200 simultaneous loads but only one store per clock?” Now, Dave didn’t just make up that question on the spot, and he wasn’t asking because he hoped I would know the answer. He was asking because, just last night he didn’t know, and he stayed up all night experimenting until he found the answer – an answer that apparently delighted him with its unexpected nature, and taught him some thing worth knowing about this strange out-of-order universe. He could just tell me the answer, but where was the fun in that?
My part of the game was to think through the conditions he specified, and let my intuition provide my best guess (which is exactly what he had done the previous night). Of cource, I would cheat a little bit, because I knew that if the answer were obvious, Dave wouldn’t be smiling, so I offered him the most extreme among the possibilities my intuition said were at all feasible. Then, when he told me what he had found, I would realize how far my experience had led me astray, which then motivated me to apply the right amount of corrective mental pressure. Consequently, the next time we played this game, I would get closer on my first try. Occasionally, my intuition even turned out to be right, but many times that meant I had just found a bug in DFA.
This DFA-induced intuition returning was key to P6′s overall success in the concept phase. Data has a way of making you ask the right questions.

* DFA Frontiers
A subtle benefit of a handcrafted tool such as DFA is that it is written directly by the architects themselves. The act of codifying one’s thinking unfailingly reveals conceptual holes, mental vagueness, and outright errors that would otherwise infest the project for a much longer time, thus driving up the cost of finding and fixing them later. Having the architects write the tool also forces them to work together in a very close way, which pushes them toward a common understanding of the machine they are creating.
We probably got too enamored of this tool, however…Willamette project’s behavioral model…The moral of the story is, don’t fall in love with your tools but rather use or make the right tool for the job.

PERFORMANCE
* Benchmark Selection
It don’t end up being that simple. The troops had a lot of questions. Performance on which of the dozen important benchmarks? Did we have to get 2* performance or more on all of them? Or did the global average have to be 2*, with some benchmarks substantially better than that, and perhaps a few that were much slower? And if some programs could be less than 2*, how much less could they be? Was it okay for the slowest chip of the P6 generation to be slightly slower than the fastest chip of the previous generation? Were some benchmarks more important than others, and therefore to be weighted more heavily?
We also had to determine what methods we would use to predict P6 performance. Would benchmark comparisons be made against the best of the current generation, or today’s best extrapolated out a few years? In some cases, such as transactions processing, actual measurements are extremely difficult to make, and theoretical predictions are even harder. Yet that type of programming is important in servers, and you cannot just ignore it.

* Avoiding Myopic Do-Loops
It’s dangerous to use benchmarks in designing new machines, but not as dangerous as not using them, since the best you can say about computer systems designed solely according qualitative precept instead of data is that they are slow and unsuccessful. But relying too heavily on a benchmark has its own pitfalls. The most serious is myopia: forgetting that it is, after all, just a benchmark, and therefore represents only one kind of programming and only to limited degree.

[On benchmark X-Y interaction]
The next night, however, when the same architects run another performance regression(which they do because the project is so well managed) they are not as happy. DFA reports that the new changes have helped 70% of all benchmarks, some a little, some a lot, but the new changes have also slowed the other 30%, and in benchmark Y’s case the slowdown is rather alarming. Now what? The architects have several choices. They could
1, Back out the heroic measures and try to fix benchmark X some other way
2, Leave the heroic measures in and try to fix only the worst of the collateral damage
3, Reconsider their benchmark suite and ask how important benchmark X is versus benchmark Y
4, Intesively analyze benchmark Y to find out why a seeming innocuous change like the fix for X should have had such a surprising impact on Y (a choice that invariably leads to back to choice 1 or 2
Such do-loops can rapidly become demoralizing, as architects spend all their time pleasing the benchmark and lose sight of the project goal: how to achieve true doubled performance over the previous generation. Moreover, given the difficulty and intellectual immersion the do-loop demands, architects must make superhuman efforts to rise above it and ascertain that project performance is really still on track.

* Floating-point Chickens and Eggs
Seek a balance between what the technology makes possible, the cost of various design options, and what buyers can afford.

* Legacy Code Performance
* Intelligent Projections
Big design companies such as Intel will usually have several CPU design projects going at any time, plus their associated chip-set developments, and many other related projects, such as new bus standard developments, electrical signaling research efforts, compiler development, performance and system monitoring tools, marketing initiatives, market branding development, and corporate strategic direction setting.
All these must be coordinated. It is hard enough to convey informantion accurately to a customer about a new design’s schedule, performance, power dissipation, and electrical characteristics, without also having to explain why the design appears to compete with another of your company’s product. Customers have the right to assume (and they will) that their vendor has its act together and is actively coordinating its internal development projects so as to achieve a seamless, comprehensive product line that makes sense to the customer and is something the customer can rely on in making their own product road maps.
This is a challenge in any venue.

* All Design Teams Are Not Equal
Design teams, are a collection of all the individual talents and abilities of their members, amplified(or diminished) by interpersonal team dynamics and overall management skill.
I believe a company’s strongest design team must be allowed to do what its leader believe is necessary to achieve the finest possible results. Other teams in a company may find it helpful to follow their lead, or may themselves be strong enough to strike out in new directions.
In this context, a project’s official performance projections are not just the output of standard company simulator. Performance projections are project leadership judgments that have four critical bases:
1, A deep knowledge of what is being designed
2, Risks that important as-yet-unresolved issues will be settled favorably
3, Composition of the performance benchmark suite
4, Most important, the particular design team’s culture as modulated by the corporate culture

* Overpromise or Overdeliver?
Thus, what appears to be a simple technical determination – establishing a realistic performance target for a new design – is in fact a deep statement of how a design team sees itself and to what heights that team aspires.

I have two words for those arguments: Get real. If I’m going to surprise a customer, I want it to be accompanied with delight, not dismay…With P6, we chose underpromise and overdeliver – when we committed to project targets, it was because we had high confidence we could meet or exceed them.
The best you can do is purposely adopt the philosophy that matches a team and its leadership, make it clear to management and other projects what that philosophy is, and remind them of it at all opportunities. And then execute to it.

CUSTOMER VISITS
* Loose Lips

* Microsoft
We were talking to them in 1991 to 1992 about a microprocessor that would be in volume production by 1995, so Intel had to have at least a five-year planning horizon for its microprocessor road maps…Initially, I had hoped that we could compare and contrast our vision for that time frame with various customers’ plans and visions, to the benefit of both. But what we found was that almost no companies even tried to see out beyond two years, let alone four or five, and they were not all that interested in the topic.
Microsoft was an exception. Their developments were as long-lived as Intel’s and like Intel they were very sensitive to the long-term implications of establishing standards and precedents.
…NT Vs. 95 Team…

* Insights from input
* Not So Secret Instructions
* Help from the Software World
By far the shortest planning-time horizons we saw on customer visits were at the software vendors. Those folks behaved at all times like the world end 18 months from today and there wasn’t much point in pretending time existed beyong that. Some of them were helpful, anyware. John Carmark of ID Software comes to mind. He had some deep, useful insights about when and why a software vendor would ever use instruction subsets such as MMX, SSE, and SSE II. He also knew more than anyone else about the difficulties of writing software that would deliver a compelling gaming experience while remaining compatible with dozens of sound and video cards, on top of choosing between the then-competing Open-GL and DirectX standards.

* The Truth about Hardware and Software
I rapidly discovered that software vendors and hardware vendors are not completely aligned. Hardware vendors traditionally learn their best profits on their latest, fastest, and best hardware, and that is what they normally promote most zealously in marketing programs. A few years later, the magic of cleverness, hard work, big investment, and Moore’s Law enables them to produce something even better, and the cycle repeats.
Software vendors, on the other hand, want to sell as many copies of their new title as possible. If they write the code so that it requires leading-edge hardware, they will drastically limit their immediate market.

ESTABLISHING THE DESIGN TEAM
Toward the end of P6′s concept phase, we felt confident that some parts of the overall design were mature enough for design staff to begin working on them. Integer ALUs, float-point functional units, the instruction decoder, the frontside bus, and the register alias tables were candidates, since they seemed unlikely to need major reworking before the project’s end. We had less experience with the out-of-order core, including the reservation stations(which began as split-integer/float-point reservation stations but were later combined into one), the reorder buffer, and much of the memory subsystem.
Randy Steck…He also had to convince a significant number of his experienced engineers that managing these new engineers was a good thing.
Randy understood deeply that a design team is no less a team than a group of professional basketball players who must be assigned their roles for maximum aggregate effectiveness. Teams are assemblages of individuals, each of whom has unique physical and intellectual capabilities, as well as individual circumstances in their personal lives that will impact their effectiveness throughout the game, project, or whatever other goal they must accomplish together.
Design team members need the same kind of careful positioning – a reality that technical design managers often forget. They see a baccalaureate degree, coupled with hiring (passing the firm’s sanity check), which they take to mean that someone is at least minimally competent at some set of engineering or programming tasks and can learn more on the job.
But really good teams go way beyond this job assignment level, actively judging each design engineer so as to give her of him the best possible of leverage on the project. Some engineers thrive in an assignment with high uncertainly and high pressure; they enjoy the challenge and feeling of being the first to wrestle with ocncepts that must coalesce if the project is to succeed. These folks tend to graviate toward architectual and performance analysis assignments. Others like the logic and stability of a design assignment. You tell them exactly what a unit has to do, give them constraints for die space, power dissipation, clock rates, connections to other units, and schedule, and they will apply their ingenuity to solving that N-dimentional problem. Still others live for technical arcana; they get an intellectual thrill out of knowing how many open page table entries any single x86 instruction can possibly touch, and the attendant implication for how the microprocessor must handle page misses in general. These folks are the microcoders.
Place people where they can excel at what they’re gifted at doing and you are on your way to a winning team.

* Roles and Responsibilities
After a few weeks of intense deliberation, we had a preliminary allocation of design engineering heads to known tasks. And we had also turned up some new and interesting issues about the relationship between design and architecture. Who did presilicon validaiont? Who did microcode development? Where was performance analysis to be performed? What was this new “MP” group and who owned it? Who owned the process to develop the RTL model?

We ended up with an overall organization similar to that in Figure 2.3.

General Manager
Design Manager: Circuits/Tools/Execution cluster/Memory cluster/Front-end cluster/Bus cluster/OOO cluster/Others
Architeture Manager: Microarchitecture/Performance/Microcode/Validation/Tools/Bus/Multiprocessing
Chip Set
Post-silicon Validation
Marketing

The GM position has a great deal of authority and responsibility; thousands of engineers report to these folks, and they are the primary voice of a project to upper management, as well as executives’s main communication channel back to the project.
The P6 design manager had groups devoted to design tools and circuits libraries and expertise, but most of the design group was partitioned into clusters, with each cluster further subdivided into functional units.

* Presilicon Validation
I felt strongly that presilicon validation had to be separate from, but embedded withing, the design team. Design engineers often have invaluable knowledge of the corner cases of their unit, the places where they made the most complex trade-offs or where the most confusion was present during design.
Our final answer to organizing presilicon validation was to require that unit owners, the designers who actually wrote the RTL code, write sanity-check regression tests for their own code. Before the designer officially released a new, latest-and-greatest version of his unit’s code for use by other designers or validators, he had to have tested it to at least some minimal level.

* Wizard Problem Solving
* Making Microcode a Special Case
As the project progresses, various units beocme silicon-area landlocked or power constrainted.

* Cubicle Floowplanning
The design engineers’ physical proximity profoundly affected their mutual communication bandwidth.

* Architects, Engineers, and Schedules
Projects that require hundreds of people and last more than four years are extremely expensive. Corporate executives tend to notice this and consequently  exert constant pressure on project leaders to find ways to shrink the overall schedule. Better tools, better methods, more engineers, more vehement tongue-lashings – no means to this end seem unreasonable. I cannot fault the end executives for their basic motivations. Long projects are expensive, and the more distant production is, the more far-sighted the architects must be and the higher the risk the project must bear.
The concept phase itself is an obvious point of high leverage on the schedule in at least two ways. In drawing up their plans and specifications for the overall design, architects are essentially writing checks that the design engineers must later cash. Any mistakes the architects make will eventually translate into schedule slips. On any given issue, the time spent getting the design right up front will typically more than offset the time required for a diving save to fix a conceptual error later.
But it is also true that a four-year project may spend a third of its overall schedule leveraging the efforts of only a few people, the architects. Improving their efficiency and shortening this time will have an immediate impact on overall project schedule.
The question is how to speed up the project’s conceptual phase while attaining the same goals as before. One proposal that just won’t stay dead is to have a specialized team of researchers and architects, whose sole job is to anticipate the microarchitecture and feature set that some future chip will require, and have that design ready by the time the design team becomes available.
I hate this idea.
Who are these people who can see even further out than the usual chip architects? They are the same chip architects, who, much to their dismay, have been unceremoniously yanked from another design project. How does that simple reassignment help anything? It certainly cuts the close ties to the design team that might otherwise have been possible. Granted, this estrangement could, at least in principle, help some architect see something promising he might have overlooked, but it’s also highly likely that architects, working alone and without designers to keep them honest, will make some fundamental assumptions about die size, power, circuit speeds, or design effort that will force a restart to the architecture work anyway.
While thess architects are off working on basic research for a future chip, who is finishing the existing design? The temptation is always strong to pull the architects themselves off a project once it has taped out, but I think that is a mistake. They can productively spend part of their time on current designs, and part in concept development meetings. Many architects would rather spend their time dreaming up new ideas for future chips than facing daily reminders that not all their previous ideas worked out the way they had hoped(which is really good reason to keep them engaged with the current design).
Another reason advance development does not work is that these teams tend to be like the United Nations. Everybody wants their interests represented or, better yet, unfairly biased toward them if possible, and researchers quickly find that they have many masters. In a short while, realizing that no single solution will satisfy the logical union of all requests, they begin to examine the politics. If they can’t satisfy all their masters, it is prudent to identify the most important ones and at least make them happy. Even this is problematic, however, because the folks with the power today might not be the ones who have it in five years.
Advance development thus produces imponderables that require leadership to define a useful path. And where will that leadership come from? From one of the masters, of course. Under these conditions, the advance development team can easily spend lots of time employing high-powered engineers and yield essentially nothing, doomed from the start by the lack of clear mission and goals.
Not that all research efforts must have this gloomy end. Advanced development can productively investigate new features that require specialized knowledge of the field, such as security or encryption engines. Requiring chip architects to do this investigating is inefficient, at least with the intense schedule of a chip development. If done properly, the feasible points match the requirements for that feature. With this expert tutelage, the architects can quickly and with low risk incorporate new features into their products.

* Coding, Egos, Subterfuge
The    issue was that in the absence of data, opinion and intuition would decide, and nobody wanted to defer to anyone else on that basis.
Why did this subterfuge work so well? All the participants had well above average intelligence; they know the scam that I was perpetrating. It worked anyway because it allowed all egos to back out of the conflict without visible loss of face. It moved the conflict from “who knows best”, which can be resolved only by anointing one coder above all others, to a mechanical means for homogenizing the code to a mutually acceptable level. Paying attention to this kind of human dynamic has many other applications ina long, large design project.

Chapter 3 THE REFINEMENT PHASE

This is perhaps the main difference between the concept and refinement phases. The concept phase has generated a few ideas that look promising. The refinement phase must now evaluate them against the project goals, including their appeal to potential buyers.
In that respect, the refinement phase is a bridge between the concept phase’s blue-sky, anything-goes brainstorms and the realization phase’s nuts-and-bolts schedule pressure. During this phase, architects are still happily juggling several castles in the air, but they are no longer trying to invent new ones. The point of the refinement phase is to settle on the most promising idea that appears to meet the project goals and make it the project’s formal plan-of-record(POR).
It is not easy to choose from among possibilities that the team has spent months identifying. Senior technical leadership has weighted each proposed direction carefully for its ability to create products that satisfy the major project goals.
There is also the matter of ramping up from the very few concept phase engineers to the team of hundreds who will realize the design after refinement.

OF IMMEDIATE CONCERN

This assignment is not static, however. For many reasons, some part of the overall design reach maturity before others. Project management must then borrow engineers from the leading units and reassign them to the laggards, because the project is not over until the last unit is finished. This can feel to the leading units owners like punishment for doing their jobs well, which is likely to damage morale. Who wants to work over the weekend if it means losing 20% of your engineering troops to those who cannot seem to keep up? Project management must keep a comstant focus on the “leading edge of the wave” so as to reward and recognize units that are out front. Units that have fallen behind can then measure tha lag and plan ways to catch up.
It is also crucial for project managers to correctly identify why units have fallen behind. The reasons are not always strictly technical. For example, late in the P6 development, the validation manager and I were analyzing the latest validation results and were discussing why one unit seemed to be generating the most design errors when a neighboring unit that we had worried about seemed to be doing fine. We had not expected the unit in question to be so complex as to generate such results. After investigating the engineering-change-order(ECO) trail, the validation bug pattern, and the intrinsic microarchitectural complexity, the answer began to crystallize: The leader of the unit that was doing fine was a conservative, down-to-earth engineer who did not hesitate to ask for help from an architect when he was not sure exactly how his unit should behave under all corner conditions. At the head of the unit in trouble was an engineer determine to make his mark on the project and show that there wasn’t any level of intellectual complexity he couldn’t handle; he never once asked for help. Given enough time, he probably could have succeeded on his own, but there wasn’t enough time. We imported a set of experts into that unit, fixed some things, removed some complexity, and got things back on track.
The concept-to-refinement project transition is a subtle one. It’s not like throwing a switch. One day you’re dreaming up new ideas, and the next you’ve picked one and tossed the rest out. It’s more of a gradual zooming in on the ideas that look most promising so that the ensuring, more intense focus can identify the one idea that will carry the peojct to victory.

SUCCESS FACTORS

As I described in the previous chapter, quantifying opinions with data is vital to this process. Project leadership must acquire or develop tools and teach the team members how to use them. In this way, a common language and a common data-driven culture evolve, in which to test the team’s ideas and absorb the best ones into the project’s POR.
On the other hand, you cannot resolve everything with data. Some issues might require too much time, programming, risk, or more market data than is reasonably available, so to be successful a project must rely on intuition as well. Multiple perspectives are also key to success in this phase, and during this time validation and product engineering can provide useful insights. Finally, complexity on the scale of P6 requires both planning and careful modeling if the design is to avoid constant rework.

* Handling the Nonquantifiable
Changing a project’s goal in midstream is a delicate matter. On the one hand, there’s a strong argument not to. It’s much simpler to identify the right project targets early and permit no distractions while the project is en route. On the other hand, there’s also a strong argument for being flexible. Design projects are long, with enough time for exigencies to arise and for the team to learn things midway through the project that they did not know at the outset. If you resolutely ignore such discoveries, but your competition finds a way not to, their product may turn out to be considerably more compelling than yours.
How do you decide whether somebody’s midstream brainchild is worth the risk and additional effort? It’s a judgment call that must be made with the cooperation of management (who must sign off on the extra schedule pressure and die size), marketing(who must estimate the size of the additional market made available by the new feature), and, most important, the design team. If the project leadership can sell the new idea to the design team and get them excited about it, that new idea is off to a great start. If the design team feels that the new feature is being imposed on team against their will, its odds of working as envisioned are considerably reduced. Team morale wil suffer, as well as their confidence in their leadership.
Design team like data. If they can see why the new feature will make a more profitable product, their assessment of its implementation difficulty will be noticeably more optimistic. Unfortunately, many such decisions must be made when there really is no data to support the vision.
Adding instruction sets is another area that is hard to resolve with data alone because data is hard to get or only partially useful.
My point is that a design project will always have unquantifiable issues. To resolve them, about the best you can do is go with the intuition of your best experts. Don’t let them off the hook too easily, though. Pure intuition should be a last resort, after the experts have tried to find ways to first quantify that intuition. For P6′s technical issues that resisted quantification, we found that even a failed effort to quantify them was invariably instructive in its own right and well worth attempting. It also helped cement the data-driven culture we wanted.

* Managing New Perspectives
Product engineering must also start to get involved in the refinement phase. In P6, we had been tantalized by a packaging engineer’s offhand comment that “if we wanted it, they could actually stick two silicon dies into the same package and run bond wires between them.”
Architects must constantly fight the urge to put both hands in the cookie jar. If they can resist, they’ll see great benefit in keeping production engineering in the loop, starting with the refinement phase, which is the perfect time to bring them up to speed on general project directions. Do not be surprised if what they say has a measurable impact on the project. As with validation, product engineering knows things that the technical development leadership needs to learn.
For example, having decided tht P6 would be capable of multiprocessor operation, the obvious question was how many microprocessors could be on the same bus. Simulations would teadily tell us how much bus bandwidth we needed to reasonably support four CPUs on a bus, and accurate simulations would even illustrate the effects of bus protocol overhead. But bus bandwidth cannot be increased without limit because fast buses are expensive. The architects knew how much bandwidth a four-CPU system needed, the circuit designers knew how longer wires on the motherboard would affect the bus clock frequency (which is a first-order determinant of the bus bandwidth), but only the product engineering team knew how large the CPU packaging had to be. They also knew how many layers of signals would be available in the motherboard, the impedance and variability of each, the stub lengths to the CPU sockets, and the relative costs of the alternatives being considered.

* Planning for Complexity
It is important to plan ahead for the complex issues that will become part of the POR. When we chose to make P6 an MP design, I realized that, as a team, we were not deep in expertise on that topic, so I did some checking around. It was time well spent, because I found that all the CPUs reported in the industry that had attemped this feat had had major problems. Either they were late or they had design errata that prevented initial production from being usable in a multiprocessing mode.
I proposed to the division’s GM that the project should bring in new engineers, whose job would be to get P6′s MP story right on the first try. We recruited and hired a team of five, led by Shreekant (Ticky) Thakkar, an ex-Sequent Computer architect, to make sure that our CPU cache coherence schemes, our system boot-up plan, our frontside bus architecture, and our chip set would work together properly in an MP mode.

BEHAVIORAL MODELS
Prior to P6, Intel did not model its microarchitectures before bulding them. These earlier efforts began with block diagrams on whiteboards and lefeover unit designers from previous projects. Some of the design team would start working on circuits, some would write a RTL description, and the rest would devote themselves to tools, microcode, placement, routing and physical design.

[Multiflow lessons] The oversight was pretty fundamental and likely something that modeling would have revealed.

For the most part, seat-of-the-pants computer design seemed to be extremely efficient, especially given an experienced team with high odds of making compatible implicit assumptions. But there are limits to how much complexity can or should be tackled that sumptions. For an in-order, two-way superscalar design such as the original P5, seat-of-the-pants was doable, especially since the P5 design team had just completed what amounted to a one-pipeline version of the same chip, the Intel 486. (I’m not advocating this design approach, mind you; I’m just relaying history and explaining why the P5 design team succeed despite the lack of performance or behavioral models to guide their decisions.)
We could not have done P6 this way. In a section of the previous chapter (The “What Would Happen If” Game), I described the process we followed in recalibrating our intuitions about our-of-order microarchitectures. Without a performance model to keep us honest and expose where our intuitions were wrong, we could have easily created a P6 with all the complexity and none of the performance advantage we had envisioned.

In the same section, I also said that that DFA was a peculiar model. Wielded properly, it could tell you the performance impact of various microarchitectural choices, but because it was arriving at its answers via a path other than mimicking the actual design, there were limits to how deeply it could help you peer into a design’s inner workings.
So, somehow, we had to get the project from a DFA basis to a structured RTL(SRTL) model, because SRTL is the model all downstream design tools use and the one that is eventually taped out. In some form, nearly all design activities of a microprocessor development are aimed at getting the SRTL model right. The question was how to translate our general DFA-inspired ideas for the design into the mountains of detail that comprised an SRTL model.
It seemed to me that we needed a way to describe, at a high level, what any given unit actually did, without having to worry too much about how that unit would be implemented. In other words, we wanted a behavioral model that would be a stepping stone to the SRTL model. This behavioral model would be easy to write, since most of the difficult choices surrounding the detailed design would be elided, yet the model would still account for performance-related interactions among units.
Just to get out ideas onto firmer footing, we started writing a C program to model the essentials of the P6 engine we had conceived. We called this program the grassman, because it wasn’t mature enough even for a strawman. After a few weeks, we had coded the basics, as well as realized that coding would take several months to complete and would not easily tranlate to SRTL.
While reading the SRTL manual one day, I noticed that Intel’s HDL referred to a bahavioral capability. I checked around and found that the language designers had anticipated our quick prototyping needs and had made provision for behavioral code. “Just what we need!” I thought, and went about collecting enough RTL coders to begin the P6 behavioral model development. Equally exciting was the prospect that we would not have to translate our bahavioral model into SRTL when we completed it. We would write everything behaviorally at first, and then gradually substitute the SRTL versions in whatever order they became available. This process also promised to free the project of the “last unit” effect on performance testing I described earlier.
The only problem was that we were the first to ever try using iHDL’s behavioral mode. Undaunted (no one had ever tried to do an OOO x86 engine, either), we launched into creating the behavioral model with a few designers and five architects. I estimated to my management that we would have it running code with six months-Christmas or bust. I wrote three quick C programs, carefully tailored to need only what would be running by then, and posted them as the immediate targets.
On the second day of the behavioral RTL (BRTL) effort, it was clear we were months off the schedule I had conceived only yesterday. The behavioral mode of iHDL had major shortcomings. Some were built into the language, some were the growth pains of any new tool technology, and some were just the inevitable surprises that are always present at the meeting of reality and ideas. But the main reason my schedule was so far off was that I had built it assuming that 100% of the BRTL effort would go into coding. I had not accounted for so many unresolved microarchitectural issues, some of which were quite substantial. In my defense, I wanted to begin the BRTL in part to force our microarchitectural unknowns to the surface. I had just grossly underestimatd how many unkowns there were. I wasn’t even close.
I sidled abashedly into my boss’s office and told him it was already clear that yesterday’s schedule estimate had been wildly off, and please could I have few dozen really bright design engineers to help me write the BRTL. After an obligatory and well-deserved upbraiding, Fred Pollack and design manager Randy Steck agreed to commit the core of the design team to helping create the behavioral model.
There ensued an interesting symbiotic (and unanticipated) connection between the newly deputized BRTL coders and the microarchitectural focus groups in which they were already working. Looking back, the pattern is much clearer than it seemed when we stumbled into this arrangement. Writing a computer program to model something will always reveal areas of uncertainly and issues with no clear answers. Likewise, writing a model forces subtle problem into open. A common syndrome when working in abstract conceptual space is to believe that you have a solution fo issue A, another for issue B, and so on. When you look at each solution in isolation, it seems feasible, but writing a model fores you to consider solutions A, B, and so on together. Only then do you realize that the workable solutions are mutually exclusive at a deep, irreconcilable level. Writing the behavioral model raised questions, and the focus groups set about resolving them. This virtuous loop continued throughout the behavioral model coding, for approximately 18 months.
We ended up meeting our Christmas goal by sprinting from Thanksgiving on. The core BRTL team worked essentially continuously for six weeks on nights, weekends, and holidays. Two things became clear during this time. First, iHDL’s behavioral mode was only marginally higher in abtraction than the usual structural iHDL. Second, because we were committed to our initial concept of how P6′s OOO engine would work, we would not have enough schedule slack to make any substantial chages to that concept.
In other words, we were stuck. So much for risk-reducing, quick performance models. Even with BRTL, we still had to identify every signal in and out of any given unit with proper unit pipelining and staging. That level of detailed implies the same detailed designing required for SRTL, and that process takes time. Luckily (both in the sense of capricious luck and in the sense of luck related to hard work), our initial concepts proved to be workable and the BRTL helped us iron out the details.
When we started the BRTL effort, we had hoped that having the BRTL as an intermediate project milestone would make it easier to get the SRTL going later. Having now complained that the abstraction level of iHDL’s behavioral mode was too low, it’s only fair to add that this made conversion to SRTL vastly simpler. And since the same designers who wrote the behavioral description of their unit were responsible for writing the SRTL, we avoided an entire class of potential translation errors. In essense, BRTL took too long, but SRTL was shortened and was of much high quality in terms of functional validation.
We purposely limited these new BRTL teams to fewer than ten people each, and through Randy’s customary foresight, we “rolled them up” through the lead architects, not their usual management chains(In effect, Randy inserted the lead architects into the BRTL developer’s technical management reporting chains, thus ensuring the architects would get the necessary mindshare.). This combination of organizational tactics meant that the architects were firmly and visibly ensconced as intellectual leaders of the various refinement efforts, thus encouraging them to lead their new subteams directly(and quickly).
I had hoped to keep the behavioral model up to date as project emphasis shifted to SRTL, but that wasn’t practical. The BRTL gradually morphed into the SRTL, and asking every designer to maintain both models would have cost too much project effort and time for the return on that investment. We now had the butterfly, but sadly, the caterpillar was gone.
For years after the P6 had gone into production, I would occasionally hear comments from BRTL participants that the “behavioral model wasn’t worth it.” I think they are wrong. They are remembering only the price paid, without carefully considering what was achieved or what the alternative would have been without the behavioral model. The P6′s BRTL was expensive to create, and there is still much room to improve the language and how we used it, but it still taught us that computer design has moved far past the point where people can keep complex microarchitectures in their heads. Behavioral modeling is mandatory. The only question is how best to accomplish it.

MANAGING A CHANGING POR
The only constant in life is change, and this rule is omnipresent in an engineering project. At any moment, a project’s state comprises a vast amount of information, opinions, decisions made, decisions pending, the history of previous changes to the POR, and so on…It is imperative that projects have a means for unequivocally establishing the POR, as well as an organized, well-understood method for changing it.

* The Wrong Way to Plan
Early in the Willamette project, our general manager appointed marketing as the POR’s owner, and marketing did what marketers always do: They called a series of meetings to discuss the issue. Have Powerpoint, will travel.

After due deliberation, which somehow always took exactly two hours no matter how simple or complicated the issue, the court would inform the petitioner that his or her issue had been duly considered and dismiss the hapless party.

All in all, this is not a good way to run a design project. I can think of at least four major problems.
First, a meeting-based product planning process does not gracefully accommondate the reality that engineers travel-a lot. It is true that coordinating meetings with multiple diverse attendees is extremely difficult, and the usual practice is to send a representative if you can’t attend. But some of these decisions required an extensive context that a representative could not be expected to bring.
Second, given that virtually all the proposals being considered for POR change will affect the silicon, it is clearly imperative that the design team consider the proposals for feasibility before anyone decides anything. Forcing a design team leader to make a snap judgment that the team is then expected to implement is dangerous. At the very least, the design representative should be able to request a one-week postponement on any issue so that she can consult the rest of the team. Such a time lag also help throttle the rate at which project changes are made, and that is a good thing. It is very easy to generate project change requests at rates far higher than the design team can actually absorb, let alone implement.
Third, at least in Willamette, only the meeting attendees of the meeting were informed of the meeting, and even these were not always informed of its outcome (if there was one). The meeting chair was reputed to have kept a red-cover POR document up to date, but to get such a document, you had to formally request it from document control and then surrender the old one. No one was willing to do this every week.
Fourth, meeting-based product planning does not allow time to deeply consider how a decision will impact the design itself. Changes appear easier, more enticing, more feasible, and less threatening than they really are. Many proposed changes were, in isolation, fairly simple and low-impact. But substantial number interacted in subtle, entertaining ways with other requested changes or previously accepted features. Only a design team representative fully conversant with all these interactions could be expected to spot such problems, and possibly not even such an expert could.
The irony was that the design team had its own POR, and a means to carefully change it: our engineering change order (ECO) process. Modern microprocessor design teams make hundreds of important daily decisions, day after day, week after week, year after year. Each decision is a careful trade-off between die real estate, power consumption, complexity, performance, schedule, design and project margin, risk, and the user-visible feature set. Each decision helps set the context for future choices – good prior decisions will afford the project good options when new issues arise. Conversely, cutting corners early or naively attempting to incorporate conflicting ideas will eventually lead to a day of reckoning that will at best severely damage the schedule. Our initial POR was the mechanism by which we kept score of project decisions make and those pending, and by which project designers were informed of the same.

* Engineering Change Order
But just because ECOs are not free does not necessarily imply that all ECOs are bad. Nor do ECOs constitute tacit proof that the architects have no discipline. Soem ECOs fix errors, others simplify the design, and still others simply reflect midcourse corrections that are reactions to new information about the competition or the customers. Some pushback on ECOs, and more as the project nears its end, is probably healthy for the project.

* The Origin of Change
In Willamette, ECOs often came from the project’s general maganement and marketing meeting series and were aimed at managing POR changes. These were primarily ECOs for the design team, since the architects were expected to convert feature ideas from the great hall to implementable ideas…Their(marketing in P6) cubicles were only a few feet away from the design team. This proximity is important to nurture the feelings of being on the same side. Having marketing physically removed from the Willamette team led, I believe, to an almost automatic us-versus-them psychology on nearly all issues affecting the design.
Another stream of design changes on both Willamette and P6 stemmed from performance analysis. As the project design progresses, the performance analysis engineers gradually turn their attention from the abstract performance models on which the project design is based, to the RTL itself. Early RTL is useless for performance studies because it simply lacks the design detail to make performance analysis interesting and may not even implement enough of the instruction set to run the benchmark code. But the RTL gets more capable every day, and eventually becomes mature enough for the analyst to draw useful inferences…They will also find some benchmarks that run outside the envelope and will then work with a team of architects and design engineers to track down the performance shortfall’s root cause and propose design changes via ECO to fix it.
Functional validation also generates a stream of high-priority ECOs to fix design errors that are causing incompatibility or wrong answers.
Finally, changes can occur just because the design is so complex. Architects are monitoring the project status, watching for design areas that are generating more than their share of errata – a sign that often means too much complexity has been assigned to that area. Architects must translate marketing POR changes into practical ECOs, watch the competition, and think about ideas that might have barely missed the cut for this chip but that could still be implemented.
So given that there will be ECOs, despite best efforts, best intentions, and unenlightened attitudes to the contrary, the issue becomes how to handle them. Difficult questions and trade-offs abound:
1. When does the project go “under ECO control”?
2. What parts of the design effort require ECOs?
3. Whose signatures are on the ECO sign-off list?
4. How can the process be kept timely, especially for controversial ECOs?
5. How do resolved (accepted or rejected) ECOs become visible to the project?
6. Given that other related projects such as tool development or chip-set design need to know about some ECOs or even have signatory input, who decides on the need to know and what process ensures necessary inclusion and notification?

* When, Where, Who

At this stage, the design itself is not under any ECO control. There is a point at which the amount of intellectual property that ECO control guards is worth the overhead of the ECO mechanism; before that, ECOs are worse than useless. Until the behavioral model has been designed and is returning useful results, the project is better off without ECOs. The program source codes, on the other hand, should still be kept under revision control, just to maintain sanity in a programming environment in which many people are contributing code, checking out and building models, and filing bug reports.

* Communicating Change
Miscommunications are among the biggest time-wasters in a development project.

TIMELY RESOLUTION, NO POCKET VETOES
ECO signatories must typically grapple with submitted ECOs directly, not just hope they will go away.

The Folly of the Preemptive Signature

THE ECO CZAR

* ECO control and the Project POR
Design teams that feel no tangible sense of ownership over what they design could, in principle, create a credible product, if they were exceptionally professional and skillful, but they are much more likely to yield an uninspired, insipid, loser product. Design teams that are emotionally engaged in their work and people who have committed themselves to making the world’s best product, no matter what it takes, will turn out a world-class product every time. Conversely, by treating a team like a job shop or guns for fire, project management removes the single best weapon any design project has – pride of ownership by the people in the best position to improve the design.

POR issue that this is exactly what was happening-the project planning process was sending continuous project-change signals to a design team that could not sample them fast enough.

Put your project POR under ECO control. It will not make your general manager cede any authority, and it does not cut marketing out of product planning. It just enforces the proper lines of communication and makes visible what should have been obvious anyway: The design team must be part of the planning process in guiding a product development to a successful conclusion.

* Avoiding Mediocrity?
The people who really make the difference between a design that merely works and one that is truly stellar are those willing to defend the project against mediocrity.
Mediocrity can be imposed from above, by management that simply will not allow enough time to do the project right. It can arise from the design team’s poor execution. It can be preordained if the project’s goals are set too high or too low or the feature set is too timid or aggressive. Mediocrity is the default outcome of any project that does not value people with total committment to what they are designing. Management must do its part, but only the architects and design engineers who actively involved in the project can accurately judge where mediocrity lies, and they must be given an opportunity to speak out whenever necessary to avoid its strong pull.

THE BRIDGE FROM ARCHITECTURE TO DESIGN

The essential challenge in bridging a project into the refinement phase is to transfer the core ideas from the heads of the architects into the heads of the design team without serious distortion. This transference involves far more than block diagrams and pipeline drawings; the architects must pass on the overarching philosophy of the approach they have conceived. It is not enough to convey to the design team what you think is needed; it is essential to also impart why you think that.

* Focus Groups
A project’s concept phase must be driven by a very small group of people but, typically, that small group cannot finish the project by themselves because there is too little time and too much to do. The exception is a startup company, in which the entire technical staff might be that small group. In those circumstances, the team must automate the design process wherever possible, give up all semblance of a life outside work, and cut whatever corners look nonfatal, so as to hit the competitive market window.
Intel cannot approach product design that way and does not need to. The schedule pressure still feels brutal, but there are a lot of engineers available. The difficulty lies in teaching them what they need to know to help realize the design.

We deputized about 30 design engineers as junior architects and split them into teams of five to seven. To each group, we assigned one concept architect who was commissioned to solve some subset of important questions. We then assigned to respective functional subsets of the overall design, such as out-of-order core, frontside bus, and branch prediction. We found this group-to-subset mapping convenient because we were then assured of having at least one expert in key chip areas.
Having a concept architect in each group was critical because the architect could ensure that the focus group observed the fundamental assumptions built into the overall design.
Careful intergroup coordination was also critical. Focus groups are investigating and resolving open technical issues, and their results must be communicated to any focus groups the resolution would affect.
Finally, focus groups must have some global pacing mechanism because the project cannot tape out until the last unit is ready. Pacing is the fine art of ensuring that no group dives too deeply into an issue while still providing the necessary rigor so that decisions made will not have to be repealed and repaired later in the project, when the cost of change is much higher.

PRODUCT QUALITY

Human make mistakes. Even well-trained, highly motivated engineers, working at the top of their games, make mistakes – big ones, small ones, funny ones, subtle ones, and bonehead ones.

* Avoid/Find/Survive
I believe that nature has a set of immutable laws wired directly into its fabric, and engineers must observe those laws or suffer the consequences. One of these laws is that no matter how assiduously we work, no matter how motivated and inspired our validation effort, design errors will appear in the final product. This suggestion three things:
1. Design errors will appear in a design by default, and we must strive mightily to prevent them.
2. Some errata will elude us and we must find them in the design before they “get out the door”.
3. Some errata will hide so well that they will make it into production and be found by a paying customer.

Similarly, project managers have to know that mistakes will end up in their design and take appropriate measures before, during, and after they have manifested. A defense-in-deep strategy is the best approach to design flaws I have found: avoid, find, survive.

* Design to Avoid Bugs.
Design engineers must constantly juggle many conflicting demands: schedule, performance, power dissipation, features, testing, documentation, training, and hiring. They may intuitively know that if they spent the next two weeks simply thinking long and hard about their design, they would produce a far better design with fewer errors than if they had spent the two weeks coding. But their reality is a project manager who must ensure that the project stays on track and no schedule and whose favorite way to do that is to measure everything. It is hard to quantify the project benefits of meditation, but a manager can tally how many lines of RTL the design could have generated had she not spent two weeks simply thinking. This translates into subtle pressure to favor immediate schedule deadlines now, and if more design errors creep in as a result, then so be it. Someone can attend to those later. For now, the project is measurably on schedule and tomorrow may never come.
Management must undo this mindset by emphasizing how much less expensive it is to get a design right in the first place than to have to create a test, debug it when it fails, and then fix the design without breaking something else in the process. Managers can also help by being sensitive to the huge array of issues designers face. Balancing schedule pressure is tricky. Too little, and the project might slip irreparably. Too much, and the pending trouble moves off the schedule, where it is visible, and into some other area, where it becomes design errata, eventually becomes visible, and wreaks havoc on the project schedule anyway.

Astute project leaders such Randy Steck were particularly adept at finding this balance point, especially when accumulated project slippage was pointing toward a formal schedule slip. Few tasks are more unpleasant than having to officially request your boss’s approval on a schedule slip.

Architects have a first-order impact on design errata because they are the ones who imbued the design with its intrinsic complexity. They have ameliorated (or exacerbated) this complexity by clearly (or not clearly) communicating their design to the engineers who are reducing it to practice. Architects write the checks that the design engineers have to cash. If the amount is too high, the whole project goes bandrupt.

One day during the P6 project, the three of us who had already designed substantial machines – Dave Papworth, Glenn Hinton, and I – were comparing notes on where design errata tended to appear. We realized that even though the three of us had worked on far different machines, the errata had followed similar patterns. Datapath design tended to just work, probably because datapath patters repeat everywhere and so lend themselves to a mechanical treatment at a higher abstraction level.

Control paths are where the system complexity lives. Bugs spawned from control path design errors reside in the microcode flows, the finite-state machines, and all the special exceptions that inevitably spring up in a machine design like thistles in a flower garden. Insidiously, the mist complex bugs, which therefore have a higher likelihood of remaining undetected until they inflict real damage, live mostly “bettween” finite-state machines. Thus, anyone studying an isolated finite-state machine will likely see a clean, self-consisten design. Only analysts well versed in studying several finite-state machines operation simultaneously have any chance of noticing a joint malfunction. And even then it would be an intellectual feat of the first order.

In light of our discovery, we surmised that by careful architecting, we might be able to rule out a whole class of potential design errata. We began looking for ways to simplify the P6 core’s exception handling. We noticed that we could implement many of these exceptions on top of the branch misprediction mechanism, which was complicated itself, but so intrinsic to the machine’s basic operation that it got a huge amount of excercise and testing.
Our strategy worked, and for the first time in our collective experience, we ended up with a machine that had essentially no important errata associated with traps, faults, and breakpoints. We also found that ruling out a a class of design errata this way by far the most cost-effective strategy for realizing a high-quality machine.
This strategy also makes sense in light of the design’s complexity. Complexity breeds design errata like stagnant pond water breeds mosquitoes. Some bugs on the original P6 chip, for example, required many clock cycles and the complex interactions of six major functional units to manifest. In such cases, it is not reasonable to blame any of the six functional unit designers, and it will probably be unavailing to ask why validation did not catch the error. Such bugs lie squarely with the architects, who need to think through every corner case and make sure their basic design precludes such problems. We should have applied our good idea everywhere in the design, not just to those items that were already on our worry list.

* When Bugs Get in Anyway, Find them Before Production
No amount of management attention to presilicon testing and no degree of designer diligence and dedication will avoid all mistakes. Design and validation errors are inevitable (even if you still harbor some hope that perhaps there might be a way to avoid them, however theoretical, it’s still better to plan and execute your project as if there weren’t). If falls to the validation crew to find these mistakes before production and to work with the designers to fix them without breaking anything else.
Validation teams have the same kinds of crushing pressure as the design teams, and then some. Validators must perform their task without noticeably changing the tapeout schedule, but until the RTL model is reasonably mature, they are limited to trivial tests. The design team typically has several months’ head start, but from that time to tapeout, validation is supposed to identify every design flaw and verify that their remedies are correct.
By the very nature of validation, that expectation is doomed to nonfulfillment. The validation plan may be thorough, but it is of necessity incomplete. How can validation test everything with finite time and finite simulation cycles? Given the combinatorial-state explosion that today’s complext machines imply, they cannot even get close.
The testing team also learns as it goes along. If a chip unit is behaving in stellar fashion and yielding almost no bugs, while another unit is behaving very badly across a range of tests, the validation team will shift resources to the flaky unit.

IDENTIFYING BUGS
I do not know any comprehensive, useful theory that consistently helps identify any type of bug, but I can offer some rules of thumb.
First, be careful how you measure bugs if you want to know about them. Well-run projects are tracked by data the project managers collect. We managers have a good feel for when the project will tape out, its expected performance, and for how much power the chip will dissipate, because we collect a lot of data on those items and track their trends carefully each week.
(Quitely fix the bug and don’t tell anymore else about it)

Second, look closely at the microcode.

If you make a change to an engineering design late in the project, that change is at much higher rish of bugginess. The team is tired, the pressure is high, and there is not enough time left to redo the past two or three years of testing an dto incorporate the effects of this new change on the design.

TRACKING BUGS
At first glance, it seems pretty clear what validation needs to do: Create a comprehensive list of tests such that a model that passes them all is considered to be of production quality. Then test the RTL models, identify the tests that do not pass, find out why, get the designers to fix the errors, and repeat until finished.
In the real world, however, the validator’s life is often messy. The tools have bugs, the tests can be faulty, and because the RTL model under test is not a stationary target, tests that passed last week might fail this week. One design bug might prevent several tests from passing, and one of those tests might have found a quite different bug. Or a validation test might have assumed correct functioning was of one of form, but the design engineer might have assumed something different. Is that a design bug? You cannot tell because you need more information.
Also, a validator chasing one specific test commonly stumbles across something else quite by accident.

We dubbed any incident found during testing that could have a bug as its root a “sighting”, and we learned to be very dogmatic about these incidents from our experience in other design projects. The rule was that anyone had to report a sighting to a general database, along with the specific RTL model, the test that generated the sighting, the general error syndrome, and any information a validation engineer might need to try to reproduce the sighting. See Figure 3.1 and 3.2.
A validation engineer, typically but not always the one who filed the sighting, would then attempt to reproduce it and find its root cause. Were other bugs recently found that might explain this sighting? Have any tools or test deficiencies been found that might be relevant? After checking the easiest answers first, the validator would then begin zeroing in on exactly what wasn’t going according to plan and what might be causing the problem. In many cases, he could collect evidence in a few hours that a real bug in the design was the culprit, and would then officially change the issus’s status from “sighting” to “bug”.
Once an issue attained the status of bug, it was assigned to the most appropriate design engineer – whenever possible, the person who had put it into the design in the first place. This was not a punishment. Rather, the idea was to let the person who owned that part of the design be the person who fixed it and, thus, minimize the odds of the fix-one-break-two problem occurring.
A few hours or days later, depending on the bug’s severity and the designer’s workload, the design engineer would come up with a fix and was expected to build and check a model that embodied it. This sanity checking had a threee-pronged goal: (1) determining that bug was really gone, (2) establish that nothing had broken as a result of the fix, and (3) ensure that the previous hole in designer’s unit tests was now filled so that this bug could never come back. Once the design engineer’s model passed this sanity check, he or she could mark the official issue status as “fix pending”.
We deliberately withheld the authority to mark the issue as resolved, saving that role for the validation engineer who field the issue in the first place. That validation engineer had to then demonstrate that the previously failing test now passed, and that the new fixed model also passed some minimal regression tests so as to catch any obvious system-wide cases where the new fix broke something that used to work.
We found this process extremely valuable it guaranteed that at least two project engineers had deep knowledge of both the bug and its cure. It also removed the temptation for the design engineer to continually short-circuit the process by simply unilaterally deciding it really was not a bug or he “thought he had fixed it” or any of the thousand other delusions as a creative person can rationalize his way into.

Sighting —> Online report entered to sighting/bug database, sighting number assigned
|
|(yes)
|
Reproducible?——-(No)——->Move to anomalies list, remove from sighting/bug database END
|
|(yes)
|
Is neccessary RTL functionality in model—(No)—>Move sighting to special list;
|                                     suspend sighting until RTL is ready END
|(yes)
|
Is this sighting a bug?—(No)—>Change spec, or fix test or validation pilot error END
|
|(yes)
|
Is this a known bug?—(yes)—>Annotate existing bug entry with new test END
|
|(no)
|
New bug 671: enter into bug database with test that reproduce it
|
|<time goes by…>
_________   |
|         |  |
|         |  |
|   [Validator pulls next bug to be disposed of from the queue and turns up "671"]
|            |
|            |
|            |
|    Reproduce bug per test
|            |
|            |
|            |
|    Test still fails?—(no)—>An RTL change made since bug was filled may have fixed
|            |                    this bug. Verify. Watch for case where RTL makes test
|            |(yes)               pass but original bug remains! END
|            |
|      There’s still a bug here.
|      1. Track down root cause
|      2. Submit to appropriate unit owner for fix
|            |
|            |<time goes by…>
|            |
|   Unit owner checks in fix to RTL database, marks bug as tentatively fixed
|            |
|____________|

Figure 3.1/3.2 Sighting-to-bugs flowchart

MANAGING VALIDATION
Successfully manage the validation effort. This takes sublime judgment, which I have been able to distill into four don’ts (since the first step in doing something right is knowing what not to do).
First, don’t use the validation plan as a performance measure. A validation manager finds herself embedded in a design project that values objectivity and measurement. She is asked to provide indicators and metrics by which her superiors can gauge her team’s overall progress. Because the validation plan most closely resembles a list of tests to be run, an obvious metric is to measure how much of the plan the team has accomplished at any given time. The flaw in this thinking is that the validation crew has conceived the plan while carefully considering all the technical corners of the design it must cover. You cannot reasonably expect them to anticipate how the sequence of those tests will jibe with what the design is capable of at any given week, nor with what the design team may actually need that week.
More important, validation teams learn as they go. And the main thing they learn is where their own plan had holes and weaknesses. If a validation team is being managed strictly to the fraction of the plan they have completed, they may become fatally discouraged about adding any more tasks to it.
Second, don’t use the number of bugs found as a performance measure. Late in the project, after most of the RTL has been written and built into a chip model, validation applies their accumulated mass of tests, along with randomly generated code and other methods, to try to find any design errata. The rate at which they find bugs depends on a mix of the design’s innate bugginess, how close to the surface the bugs are, how many computing resources are available to the validation team, and how quickly the team can resolve the list of currently open sightings. Measuring validation’s output strictly in terms of bugs found per week can quickly distort the entire validation process. Coverage matters, too. If validation finds that all their testing has turned up very few bugs in functional unit X, but revealed a veritable bug killing field in functional unit Y, they must be allowed to increase their pressure on Y without completely giving up on X.
Third, don’t use the number of tests passed as measure of project health. Design projects running at full speed can be intimidating to upper management…They are tempted to ask the crucial question, “Is this project lurching toward a tapeout convergence, or is it exploding right before my eyes?” How can they tell?
One indicator they look at is the rate at which new bugs are manifesting in the design. Managers want the design errata rate to decrease steadily toward the tapeout window and then essentially hit zero per week for several weeks before taping out. That is what they would like, but that is not what happens. What happens is that by the time the RTL has matured enough to run the really tough tests and the validation crew has disposed of the easier bugs, not much time is left before formal tapeout. In fact, as the validation team gets more expert at wielding their tests and dubugging the failure, the overall errata rate may well go up in the project’s last few weeks.
To avoid upper management’s wrath, the validation team might choose to accentuate the positive. It is easy to rationalize. After all, the validation plan called for every combination of opcode and addressing mode to be checked, so it is not necessarily duplicitous to report that all those seem to work, instead of concentrating your overall validation firepower on the area yielding the most bugs.
Resist the urge of even think in that direction. Instead, let the project’s indicators tell you the truth and guide whatever actions are appropriate. If some particular part of the design is generating more than its share of bugs, increase the design and validation resources assigned to it. Pay attention to the type as well as the number of bugs. The bug pattern may reveal an important lesson for either of these dimensions. And once having formed as accurate a picture as possible of the project’s health, relay that picture to your management, the bad with the good. (Then take your licking with aplomb. That is why they pay you the big bucks.)
Fourth, don’t forget the embarrassment factor. Imagine a design error such that the RTL model is yielding this result: 2+2=5. There is a class of bugs that would prove beyond reasonable doubt that the design term simply hadn’t ever tested whatever function they are found in. Worse, the implication to the buyer is “If they didn’t test that, what else didn’t they test?” For this reason, every opcode and every addressing mode really should be tested at some point in chip development, and anything else with a high embarrassment factor, whether or not this “coverage” testing is detecting design errata at a high rate.

* Plan to Survive Bugs that Make It to Production
Those who are not engineers know two certainties: death and taxes. Engineers know a third: there are no perfect designs. Every single product ever made has had shortcomings, weaknesses, and outright design errata.
The goal of the first two parts of my avoid/find/survice product-quality algorithm is to prevent design errata in the final product. The focus of the third part is to minimize the severity and impact of the errata that make it into the product. Some bugs are hardy or well disguised enough to complete that journey.
NASA has a well-tested methodology for dealing with unforseen eventualities. For mission-critical facilities on a spacecraft, for example, NASA provides backups and sometimes backups for backups. But merely providing the additional hardware is not enough. You must also try to anticipate all possible failure modes to make sure that the backup is usable, no matter what has happened.

With microprocessors, especially those with caches and branch-prediction tables, an awful lot of activity can occur on the chip with no outwardly visible sign. By the time it is externally visible that things have gone awry, many millions or even billions of clock cycles may have transpired. You could be chasing a software bug in the test, an operating system bug, a transient electrical issue on the chip or on the bus connected to it, a manufacturing defect stuck-at fault inside the microprocessor, or a design error. At this instant, as you stand there helpless and befuddled in the debug lab, the scales fall from your eyes clearly and ruefully that during design you should have provided an entire array of debug and monitoring facilities, with enough flexibility to cover all the internal facilities you wish you could observe right now.
Having learned this lessons on previous machines, we architects imbued the P6 and Pentium 4 microprocessors with a panoply of debug hooks and system monitoring features. If a human debugger has access to the code being executed and wants to see the exact path the processor is taking through that code, all she needs from the microprocessor is an indication of every branch taken. When that sequence of branches diverges from what was expected, she had a pretty good idea of the bug’s general vicinity and can begin to zero in on it.
There are two subtleties in providing debug hooks. The first is not to use their existence as a crutch to do a poorer job at presilicon validation (a fear upper management often expresses). The second is to take validation of these debug hooks as seriously as you do the chip’s besic functionality. After all, “If you didn’t test it, it doesn’t work” applies to all aspects of the chip, not just the mainstream ISA stuff.

* A Six-Step Plan for High Product Quality
There are no sure-fire recipes for getting a product right. But there are effective tactics and attitudes. Here are six good ones.
1. Incorporate only the minimum necessary complexity in the project in the first place. Do not oversubscribe the team.
2. Make sure the design team and corporate management agree on the right level of product quality. Empty aphorisms like “archieve highest possible quality”, “no product flaws”, or “20% better than competitor X” are worse than useless because they lead to insidious cases in which the design team is invisible and actively wokring at cross-purpose to project management, executive management, or itself.
3. Don’t let the design team think of themselves as the creative artists and the validation team as the janitorial staff that cleans up after them. There is one team and one product, and we all succeed or fail together.
4. Foster a design culture that encourages an emotional attachment by the designers to the product they are designing (not just their part of that product). But engineers must also be able to emotionally distance themselves from their work when it is in the project’s best interests.
5. Make sure the validation effort is comprehensive, well planned, and adequately staffed and funded, with the goal of “continuously measuring the distance from the target,” thus ensuring product quality is converging toward success.
6. Design and religiously adhere to a bug-tracking method that will not let sightings or confirmed bugs fall into the cracks.
The idea that the design culture should encourage emotional attachments, yet the engineers must be able to sometimes turn that emotion off, may seem inconsistent or even mystical, but it is actually quite simple. The emotional attachment just means the engieer cares about what she is designing. She want to get it right, and she wants the product to succeed. To do a proper design review, however, the designer must check her ego at the door, and realize that it is in the best interests of the overall goal – a successful product – that her design undergo some rigious scrutiny. The committment to the overarching goal is what will guide the engineer in knowing when to override the emotional attchment.

* The Design Review
For the P6 and Willamette projects, our panel consisted of formal reviewers who were expected to do preparatory reading and study before the event, and 10 to 20 other interested designers and observers. The obververs were not necessarily expected to actively contribute, although they sometimes did. Their role was to learn about their neighbor’s unit design and to reinforce in everyone’s mind that this project took its design reviews seriously.
The reviewers must have enough information to be able to follow the review and contribute to it. The unit designer has to furnish this information, particularly that which establishes the design’s context: its place in the overall system; the function it is expected to fulfill; its constraints in terms of power, performance, die size, and schedule; and early alternatives to the approach that was eventually selected. Depending on the design itself, the designer might also need to provide block diagrams, pipeline and timing diagrams, and protocols, as well as describe finite-state-machine controllers and possibly furnish the actual source code or schematic diagrams.
When design reviews are done properly, the outcome is a list of ideas, objections, concerns, and issues identified by the collective intellect of the review panel. The reviewee then e-mail that list around to all attendees and project management, along with plans for addressing the issues and tentative schedules for resolving all open items. The overall project managers incorporate these issues and schedules into the overall project schedule and begin tracking the new open items until they are resolved.
At the review’s outset, the team should designate a scribe to capture the ideas, suggestions, proposals, and any questions the review team asks. These will range from observations to pointed queries to suggestions for improvements or further study.

-When to Do a Review. Some companys or design groups have a long list of reasons to avoid design reviews. “There’s not enough time in the schedule” tops the list and indicates a far deeper problem with project management. If a project does not have enough time to check that they are getting the product right, it will not have enough time to fix the product later (when they discover that they got the product wrong). Design reviews, properly done, facilitate vital communication among designers and between the design team and project leadership.
A good rule of thumb is to review every important subsystem in the design at least once when the design has matured enough to have worked out the essential elements but is still flexible enough to allow changes. And at least one joint design review should be held between the product team and any external interfaces required for that product’s success. For microprocessors, this means as least one joint design review between the CPU and the chip-set designers.

ANOTHER ONE RIDES THE BUS
Looking back with 12 years of perfect hindsight…The root cause of the conflict was partly our naivete as a new x86 design group and partly my own lack of understanding about how Intel was organized.

Chapter 4 THE REALIZATION PHASE

It’s always questionable to try to do something too cleverly.
-Albert Einstein

The concept phase produced about three feasible avenues down which the design project could travel successfully. The refinement phase investigated those avenues and identified the most promising. Now, the realization phase had to translate this winning idea into a product.
This phase can seen sudden and more than a little scary. Yesterday, the project was an abstraction – a collection of ideas, concepts, and new terms that a few people kicked around. Today, dozens of bright, experienced design engineers are taking these ideas at face value, studying them intently and internalizing them. Hundreds of people are being organized into small groups to work on respective subunits that seemed to be a lot less defined yesterday. People are creating T-shirts with drawings and caricatures based on technical terms you conjured out of thin air only a few weeks ago. Dozens of people are using terms you invented and you wonder if they are using them in the way you intended.
At this time in the P6 project, I remember being overwhelmed by the many people who were taking our architecture ideas seriously. It is one thing to put a brave face on your uncertainty when convincing upper management that they should invest hundreds of millions of dollars in your concept. It is quite another when you see your peers staking their careers on the idea that you and the rest of the concept team have come up with something that has enough integrity to be implementable, is aggressive enough to beat the competition, and is flexible enough to survive the surprise ahead. Either they have bought into your vision, or they have put their trust in you; both prospects induce humility.
Those who are key players in a start-up know what it means to be fully and irretrievably committed to a technical vision and how grateful you are when others make similiar sacrifices. Shortly after I joined Multiflow Computer, a VLIW start-up, I was copying some documents related to building a house to when the VP of engineering walked in. When he realized that I was that deeply committed to making the company a success, the look on his face communicated volumes. What he knew then was exactly how precarious the corporate finances were and how badly things could go for us if the worst happened. All I knew was that no start-up had a hope of succeeding unless we first burned the lifeboats. My attitude was full speed ahead and do not look back. Samuel Johnson said, “Nothing will ever be attempted, if all possible objections must be first overcome.”
The project’s refinement phase began with the behavioral model coding and the realization phase will convert that to the final structural RTL from which the circuit and layout efforts will work. The finished P6 RTL model had approximately 750,000 lines of source code, clearly a nontrivial software development.

OF IMMEDIATE CONCERN

To get a feeling for what the realization phase is like, consider what would be involved in building a new high-rise. The concept phase of that project has only a few participants: the buyer, the financer, and the architect. The refinement phase has more, as choices are considered and made, various possibilities weighted, and ground is broken.

In a chip development, the concept phase comes up with some overall project alternatives, and the refinement phase narrows those down to two and then selects one. It in essence sets up the project scaffolding needed to support the concurrency in the realization phase. Because of this relationship, it is crucial that the realization phase not be undertaken until the preparation work is complete. If you wait a week longer than necessary to crank up your project into the full-out execution mode of the realization phase, you will at worst have delayed your project by a week. You can often make up that week by inspired management later. But if the project is allowed to begin the realization phase before the project direction has been firmly and confidently set and before all team members have internalized it, you will pay a much higher price than a one week slip. Subtle errors will creep into interfaces, designers will make choices that may not be obviously wrong but still are seriously suboptimal, and everyone on the team will get the wrong subliminal impression that, overall, the project is further along than it really is. This impression can itself cause errors in judgment that validation or management has to notice later on.

SUCCESS FACTORS
When the realization phase begins, the team has settled on one direction and has made some progress toward the RTL model that describes the actual product. The classical output of the realization phase is a prototype, a single instance of the product being developed. For silicon chips, the production engineers actually make several wafers’ worth of chips on that first run, but the principle is the same – build a few, test them, fix what’s broken, and when the time is right, move the project to the production phase.

Whereas errors in the earlier phases could impact schedule, an error in the realization phase may directly affect hundreds of design engineers who are working concurrently. Assiduous project management and communication are key to the realization phase.

* Balanced Decision Making
As a design proceeds, design engineers make dozens of decisions every day about their unit’s implementation. They strive to balance their decisions in such a way that they meet or exceed every major project goal (performance, power dissipation, feature set, die size) within the schedule allotted. The very essence of engineering is the art of compromise, of trading off one thing for another.

At the beginning of P6 project, we thought the relevant architect should bless all performance-related engineering decisions, but this requirement, though ideal in theory, quickly became impractical. There were simply too many such decisions and not enough architects to implement them and stay on schedule. We eventually established a rule, proposed by Sunil Shenoy: If you (the design engineer) believe that the performance impact of the choice you are considering is less than 1% on the designated benchmark regression suite, you are free to make the choice on your own. Higher than a 1% performance hit and you must include the architects in the decision.
This 1% rule generally worked well and certainly salvaged the overall project schedule. On the downside, the model ended up absorbing quite a few <1% performance hits. Dave Papworth likened this to mice nibbling at a block of cheese: Mice do not eat much per bite, but over time the cheese keeps getting smaller.
The performance hits were usually independent of one another, and their combination at the system performance level was typically benign. But occasionally, the combination would be malignant, noticeably dragging down overall performance and drawing the ire of architects and performance analysts. Over time, the cumulative effect of these minor performance losses would cause aggregate system performance to sag imperceptibly lower almost daily until performance engineers would become alarmed enough to notify project management that performance projections were outside the targeted range. We would then essentially stop the project for a week, intensively study the sources of the performance loss, repair those areas, and reanalyze. We repeated the process until performance was back where we expected it to be.
Why then was the 1% rule important or even desirable? The simple answer is time. There weren’t enough architect-hours to try to oversee design decision that might affect performance. Picture a team of 200 designers, each making 10 decisions a deay that might affect performance. We used the 1% rule, not because it was perfect, but because the alternative (utter chaos) was unworkable.

* Documentation and Communication
We architects were creatively relentless in out attempts to transfer information. The first document written as a general introduction to the P6 was “The Kinder Gentler Introduction to the P6″, an internal white paper intended to convey the general philosophy of the P6 microarchitecture. Next was “The P6 Microarchitecture Specification”, of MAS, the first of what became a large set of documents detailing the operation of every unit on the chip, including new features, basic operation and pipelining, and x86 corner cases that our proposed design might affect.

* Capturing Unit Decisions
Once the realization phse is underway, each unit group must begin to record their plans and decisions. For the P6 project, these records were the microarchitecture specifications. Each MAS described how the unit would execute the behaviors outlined in the unit’s behavioral specification. Each unit group maintained their own MAS, which grew with the design, and distributed it to all other units. All MASs were written to a common template, so designers from one unit could easily review the MAS from another unit.
Each MAS include
- Pipeline and block diagram
- Textual description of the theory of operation
- Unit inputs and outputs and protocols governing data transfers
- Corner cases of the design that were especially tricky
- New circuits required for implementation
- Notes on testing and validation, which we required so that design engieers could think about such things during the design, when it is easiest to address them
MASs were begun early in the realization phase, well before all the major design decisions had been collectively rendered. This timing was purposeful – the act of writing this documentation helped identify conceptual holes in the project. We began each MAS as early as we could, but only when we knew enough about the design that it was not likely we would have to tear the MAS document up and start over.

* Integrating Architects and Design Engineers.
Perhaps P6′s strongest communication mechanism was not a videotape or document at all, but the architects’ participation in the RTL model’s debugging. Some engineering teams have the philosophy that the architect’s job is finished when the concept phase documentation is complete.

I am sorry, but pipelining people, especially architects, is a monumentally bad idea. The architects conceived the machine’s organization and feature sets and invented or borrowed the names for various units and functions. They know how the machine is supposed to work at the deepest possible level of understanding, and in a way that other engineers cannot duplicate later, no matter how smart or experienced they are. The software industry is famous for inadvertently introducing one bug while fixing another one. Exactly the same madaly will strike a chip development. It is all too easy to forget some subtle ramification of a basic microarchitecture cannon and design yourself into a corner that will remain shrouded in mystery until chip validation, when the only feasible cures are painful to implement.
The architecs are the major corporate repository of a critical knowledge category. Every deisgn decision reflected in chip implementation is the result of considering and rejecting several alternative paths, some of which might have been nearly as good as the one chosen and some of which might have been obviously unworkable. The point is that the alternatives probably looked very appealing until the architect realized some subtle, insidious reason that the particular choice would have been disastrous later on.
If no one retain that crucial information, future proliferation efforts will suffer. Downstream design teams will have to change the design in some way, and if they stumble across one of these Venus flytrap alternatives, their product may be delayed or worse. The original architects are the best ones to tend those exotic plants and instruct others in their care and feeding.
Another reason not to pipeline architects is that architects, like all engineers, must use what they have created to solidly inform their own intuitions about which of their design ideas worked and how well, and about which ideas turned out not to be worth the trouble. Pipelining the architects is equivalent to sending them driving down a highway with their eyes closed. They may steer straight for a short time, but without corrective feedback, they will soon exit the highway in some fashion that will not do anyone any good.

PERFORMANCE AND FEATURE TRADE-OFF
Making sure that the design team had the information needed to make these trade-offs consistently and in an organized fashion was both a techincal and a communication problem.

* (Over-)Optimizing Performance
* Perfect A; Mediocre B, C and D
In this type of overoptimizing, the architect perfects one design aspect to the near exclusion of the others. This shortchanges product quality because architects working on idea A are, therefor, not working on ideas B, C, and D, and as the project lengthens, the odds of including B, C, and D go down. And idea A often has an Amdahl’s Law ceiling that is easily overlooked in the heat of battle: Idea A may have been conceived as a solution to a pressing, specific performance problem, but any single idea may help that problem only so far, and to improve it further would require much more sweeping changes to the design, thus incurring further development costs and project risks. One must not become ao myopically fixated on one project goal that other goals are neglected. B, C, and D will not achieved as a by-product of achieving A.

* The Technical Purity Trap
A common tendency, especially among inexperienced engineers, is to approach a development project as a sequence of isolated technical challenges. Rookies sometimes think the goal is to solve each challenge in succession, and that once the last problem has been surmounted, the project will have ended successfully. Experienced engineers know better. Subtleties abound, circumstances change, buyers’ needs change, technical surprises arise that require design compromises, and schedule pressure only get worse. Truly great designs are not simply those that post the highest performance scores, regardless of the costs. Great designs are those for which the engineers had a clear vision of their priorities and could make intelligent, informed compromises along the way.

[Mercedes S-class Vs Ford Taurus]
The lesson translate well to computers. Designing the “fastest computer in the world” is a great deal of fun for the design engieers, but it is an engineering joyride reserved for very few. The rest of us must design machines that accomplish their tasks within first-order economic constraints.
In an insidious way, microprocess vendors who succumb to the allure of trying to be the fastest computer will win in the near term, but they will lose in the long run, a decade or more down the road. The reason is simple: Only a small market – at most a couple of million units a year – will pay large premium to keep a niche vendor afloat. A user base that small cannot support the design costs of world-class microprocessors, not to mention the cost of state-of-the-art IC proccing plants (fabrication plants, or fabs). When that vendor is inexorably driven out of business by these extraordinarily high costs, the mainstream, cost-constrained vendor is still here. And with one or two more turns of the Moore’s Law wheel, that mainstream vendor inherits the mantle of “world’s fastest” without having even tried for it.
Voltaire is often credited with the saying, “The best is the enemy of Good,” which means that myopic striving toward an unreachable perfection may, in fact, yield a worse final result that accepting compromises on the way to a successful product. Everyone wants their product to be the best; achieving that is great for both your career and your bank account. The trap is that taking what appears to be the shortest path to that goal – technical excellence to the exclusion of all else – can easily prevent you from reaching it. In plainer terms, if you do not make money at this game, you do not get to keep playing.

* The Unbreakable Computer

* Performance-Monitoring Facilities
At Multiflow Computer, we had included a set of performance-monitoring facilities directly in the computer itself. With no recourse to logic analyzers, you could get the machine’s diagnostic processor to “scan out” the performance-monitoring information and present it in a variety of useful ways. After having used and loved that facility for several years, Dave Papworth and I resolved to provide something similar in any future designs, especially microprocessors, in which overall visibility is the most restricted.

* Counters and Triggers.
For the P6, we therefor proposed and implemented a set of hardware counters and trigger mechanisms. Our intention was to provide enough flexibility so that the performance analyst could set up the performance counter conditions in many ways to help zero in on whatever microarchitectual corner case was turning out to be the bottleneck in his code. But we could not spend a lot of die area on the facility, and we absolutely wanted to avoid introducing any functionality bugs associated with the performance counter apparatus.

* Protecting the Family Jewels
* Testability Hooks
We were not sure we would have an FIB(Focused Ion Beam) tool in the P6 generation, so to help expose bugs hiding behind other bugs and generally give engineering more tools during debugging, we gave the P6 an extensive set of debug/testability hooks. Our intension was that for any feature that was not crucial to the chip’s basic functioning, there should be a way to turn it off and run without it.

A truly insidious psychological artifact rears its ugly head in designing performance-monitoring facilities. Designers and validators are very, very busy people. They routinely miss important family events and give up holidays, weekends, and evening attempting to keep their part of the overall desing on schedule. They, therefore, value their time and energy highly, and ruthlessly triage their to-do list. When items appear on those lists that are not clearly and directly tied to any of the major project goals, those items inevitably become bottom feeders. Consequently, they get less time during design, which implies that they are probably less well thought out than the mainstream functionality. By then there is less time to validate them, and the validators are always way behind schedule at this point, so there items also get less testing than they deserve.
Morever, during debugging, performance-monitoring facilities that are not working quite right will not hold up the activities of very many people. That’s not to say that debuggers do not depend on their tools. If the testability hooks intended to speed debugging are themselves buggy, the confusion they generate can easily outweight the value they bring.
The only way I have ever found to ensure that performance monitoring and testability hooks get properly implemented and tested is to anoint a special czar, whose job is to accomplish that.

GRATUITOUS INNOVATION CONSIDERED HARMFUL
Enginners fresh from college are uniformly bright, inquisitive, enthusiastic, and clue-challenged, in the sense of that they are somewhat preconditioned to the wrong things. Perhaps at some point in the college education of every engieer (myself included), someone put us in a deep hypotic trance while a voice chanted, “You are a creative individual. No matter what someone else has designed, you can do it better, and you will be wildly rewarded for it.” Or maybe new engineers just lack the experience to know what has been done before and can be successfully reused, versus what is no longer appropriate and must be redesigned. Whatever the reason, almost all new engieers tend to err on the side of designing everything themselves, from scratch, unless schedule or an attentive boss stops them from doing so.
This disease, which I call “gratuitous innovation,” stems from confusion in a designer’s mind as to why he is being paid. New engineers think they are paid to bring new ideas and infuse stodgy design teams with fresh thinking, and they do contribute a great deal. But many of them lose sight of an important bottom line: They are paid to help create profitable products, period. From a corporate perspective, creating a wildly profitable product with little new innovation is a wonderful idea because it minimizes both risk and investment.
Engineers who understand that the goal is a profitable product, not self-serving new-patent lists or gratuitous innovation, will spend much more time dwelling on the real problems the company and product face. To be sure, in some cases, new ideas will bring about a much better final product than simply tweaking what has already been done, but my experience is that unless you restrain the engieers somehow, they will migrate en masse to the innovation buffet.
It can be fun to get wooden plaques bearing the seal of the U.S. Patent Office and your name, but it’s much more fun, and ultimately much more lucrative for all concerned, to concentrate on the product and its need. Real innovation is what attracts many of us to engineering in the first place. Never confuse it with gratuitous version, which only adds risk to the overall endeavor.

VALIDATION AND MODEL HEALTH

* A Thankless Job
As the RTL model develops, upper management can collect statistics such as number of the new RTL lines written per week and track that running total against the original estimates of how many lines are needed to fully implement the committed functionality.

But when the same upper management focus turns to presilicon validation, difficulties abound. The validation plan shows all the tests that must be run successfully before taping out, and there is a running total of all tests that have run successfully, but neither is terribly helpful. You cannot simply measure the difference between them, nor can you simply extrapolate from the improvement trend.
When the validation plan is conceived at the project’s beginning, its designers try to account for all that is known at that time by asking questions, such as
- Which units will be new, and which will inherited unchanged from a previous design?
- What is each new unit’s intrinsic degree of difficulty?
- What is the most effective ratio of handwritten torture tests versus the number of less-efficient but much more voluminous random-test cycles on each unit, and on the chip as a whole?
- What surprise are likely to arise during RTL development, and what function of overall validation effort should be held in reserve against such an eventuality?
- How long will each bug sighting take, and how long will it take to resolve them?
- What role will new validation techniques play in the new chip (formal verification, for example)?

Perhaps more to the point, they were unhappy that the fraction of the original validation plan that was being accomplished week by week was not shrinking on any acceptable trendlibe###; in fact, it appeared that the fraction of overall weekly validation effort was shrinking, not growing, because the validation team was alertly adding new testing to the plan as they learned more about the design, and the plan was growing faster than the list of now-running tests. In effect, for a while it looked as though the validation effort was falling behind by 1.1 days for every day that went by.
After a week of unproductive meetings on the topic, management asked us to conceive a metric that we would be willing to work toward, one that would show constant (if not linear) progress toward the quality metric required for the chip to tape out.

* Chooseing a Metric
We proposed a “health-of-the-model”(HOTM) metric that look into account what seemed to me, Bob bentley, and his team, to be the five important indicators of model development, and we weighted them as seemed appropriate:
1, Regression results. How successful were the most recent regression runs?
2, Time to debug. How many different failure modes were present, and how long did it take to analyze them?
3, Forward progress. To what extent was previously untried functionality tested in the latest model?
4, High-priority bugs. Number of open bugs of high or fatal severity.
5, Age of open bugs. Are bugs languishing?
We then began tracking and reporting this HOTM metric for the rest of the project.

The fair amount of subjectivity in these indicators was intentional. We recongnized that the strong tendency is to “get what you measure”, and we did not want the HOTM metric do distort the validation team’s priorities until we had accumulated enough experience with it to know if it was leading us in the right direction. Because we are the ones who had conceived the validation plan, we knew it was a vary valuable, yet necessarily limited, document. Despite our best efforts to be comprehensive and farsighted, if history was any guide, we would discover that some parts of the validation plan would place too much emphasis on some part of the design that turned out not to need it, while other parts would turn to be the most problematic and require much more validation effort than we had expected. We did not want to find ourselves unable to respond appropriately to such exigencies on the sole basis of some document we ourselves had written, knowing only what we knew two or three years ago.
Another reason for the metric’s substantial subjectivity had to do with managing your manager. I believe there is a generally well-placed but occasionally extremely dangerous perchant within Intel to insist on quantifiable data only. It is simply corporate culture that if someone asks, “Is this chip’s performance projection on track?” the preferred answer has the form, “A best-fit linear extrapolation of the last 6 weeks of data indicates a probability of 0.74 with a standard deviation of 0.61,” not, “The indicators say we’re marginally on track, but my instincts and experience say there’s a problem here.”
Management wanted to know when the chip would be ready to tape out, and they did not want to hear that the chip would tape out when the design team, the project leaders, and the validation team all agreed it was ready. They wanted a mechanical way of checking that all pieces were trending toward success. Then they could confidently reserve the right to “shoot the engieers” and unilaterally declare tapeout.

* Health and the Tapeout Target
The problem is that judging the health of a chip design database is not so easy. You can pick a metric at the project’s beginning and then measure against it every week, but the metric itself is subject to revision as the weeks roll on, and it is not easy to go back, add something to the metric, and extract the necessary data from the old records. Basically, whatever metric you pick at the beginning is what you are stuck with throughout. You can revise the weightings, but even that is problematical.
The RTL model achieves its full functionality only toward the end of the development cycle, and the validation team can no-holds-barred test only that mature model. When a bug is found and fixed, you should assume that the fix may break something that used to work, but you cannot repeat every test ever run on the project. Again, judgment is required to know how much additionaly testing is appropriate, given particular design errata.
We picked a score of 95 as our tapeout target, knowing that upper management could eventually hold this score against us. As big, expensive chip development projects near completion, a kind of tapeout frenzy tends to break out in the design team as well as across the management chain. On the plus side, it inspires all concerned to do whatever it takes to finish the job. On the minus side it encourage management to something discount the opinions of technical people, especially opinions that they do not want to hear, such as, “This chip is still too buggy. It needs at least three more weeks of testing.” It is a management truism that a “shoot the engineers” phase of any design project is necessary, because without it engineers will continue polishing and tweaking well past the point of diminishing returns. By picking a fairly lofty target, we hoped we were placing it sufficiently out of reach so that we would never face the problem of having management wave our indicator at us and say, “You said this score was good enough to tape out, so you have no right to make it any higher now. Tape this thing out.”

* Metrics Doldrums
Out HOTM metric did not behave as intended. We had hoped that if we weighted the five indicators properly, the overall score would start low and climb linearly week by week toward the final score that would suggest our new RTL was tapeout ready. What actually happened was that the overall score did start low and climb for awhile, but then it stubbornly parked for many weeks at an intermediate value that seemed much too low for anyone to accept as being of tapeout quality.
After several weeks of watching the HOTM score languish, I began upping the pressure on my validation manager, Bob Bentley. Bob patiently bug firmly reminded me of all the pieces built into the metric and showed me how the RTL status and new changes were affecting each one. That make sense in isolation, but we had created this metric so that we could feel conmforted as the model’s quality visibly climbed toward acceptability, and now that it wasn’t climbing, my comfort level was dropping precipitously.
Finally, at one of these meeting Bob said (no so patiently), “Okay. You are pushing on the wrong thing here by pressuring validation. We don’t put the bugs in. We just find them and report them.” He was absolutely right. I turned my attention to the design team, its coding methods, individual testing, and code-release criteria, and we made many chages that immediately began paying off in terms of overall design quality.
The HOTM metric never did linearly climb to 95, but it started moving after that, so, in retrospect, I think it was a good exercise. Intel has since revised the HOTM many times to make it more useful as a project-guidance mechanism.

COORDINATING WITH OTHER PROJECTS
The first and most important is that customers like to think that a big company like Intel has put a lot of corporate thought into creating and coordinating their various product offerings. They expect that the overall road map will be coherent and will allow them to design systems that will form their customer’s roadmap later on. If Intel puts multiple products into the same product space, it confuses the customers…

Comparing two chip developments gives rise to several first-order issues. One is performance estimation. Another is the mothodology: the simulators used, how they work, and their sources of possible inaccuracies. Design teams will have different beliefs about what is “best”, and it’s a virtual certainty that what one team considers an absolute requirements in a tool or design methodology will be rejected as anathema by another. And never underestimate the unpredictability of human psychology, which can easily subdue any rational technical decision.

* Performance Estimation

* The Overshooting Scheme
With many years of computer design experience, Dave has come to believe that the task of conceiving, refining, and realizing a computer microarchitecture is a process of judiciously overshooting in selective areas of the design, in the resigned but practical expectation that various unwelcome realities will later intrude. These surprise will take many forms, but the one common element is that they will almost never be in your favor. That clever new branch-prediction scheme you are so proud of will turn out to be a very poor fit to some newly emerged benchmarks. Your register-renaming mechanism, which looked so promising in isolated testing, will turn out to require much more die area thatn you had hoped, and the circuit folks will be engaged in hand-to-hand combat to make it meet the required clock speed.
Given that surprise will occur and will not be in your favor, your overall plan had better be able to accommodate any required change. Dave’s overdesign approach assumes that you will eventually be forced to back off on implementation aggressiveness, or you will realize that the design just do not work as well as you had hoped, either in isolation or in combination with other parts. He proposes that you not approach such thinking as if it were a contingency plan – that such eventualities are almost a certainty, given the complexity and schedule pressures of contemporary designs.
In essence, Dave’s theory is that if the absolute drop-dead project performance goal is 1.3x some existing target, then your early product concept microarchitecture ought to be capable of some much higher number like 1.8x, to give you the necessary cushion to keep your project on track.
With both P6 and Willamette, we did, in fact, go through a process much like Dave’s anticipated sequence:
- Early performance estimates are optimistic, and as performance projections gradually get more accurate, they yield a net loss in expected product performance.
- As the design team continues implementation, they constantly face implementation decisions that trade off performance, die size, schedule, and power. Over time, the cumulative effect of all these small trade-offs is a noticeable net loss in performance.
- Projected die size tends to grow continuously from the same root causes so, eventually, the project faces s double threat with no degree of freedom left from which to seek relief.
When we reached this state in P6, we basically suspend the project for two weeks, purposely placing the performance analysis team’s concerns on project center stage. Anyone not directly engaged in the performance war was tasked with finding ways to reduce die size. These “all hands on desk” exercises actually work quite well, and tend to give everyone a chance to step back from their daily grind and revisit the bigger picture. With that new vantage point, they can often spot things that would otherwise have gone unnoticed.

* Psychological Stuff.
Although it seems to surprise the public every time it happens, well-intentioned and well-educated experts can start with the same basic set of facts and reach opposite conclusions. In the same sense, performance simulations, especially the early ones, require a great deal of interpretation and are thus subject to the whims and personalities of their wielders. Some design teams are conservative and will not commit to any target that they are not sure they can hit. (Our Pentium Pro team was one of these, at least partly because Randy Steck, Fred Pollack, and I believed in this approach so strongly.) Other Intel teams had internalized company politics and reflected that understanding in their performance goals – promise them what they ask for now; if you later fall short, they will have forgotten about it, and even if they haven’t, you can apologize later. Besides, tomorrow may never come(The sentiment was expressed by a leader of one of the company’s other design teams. It was said over the second beer of a three-beer discussion, which is often the moment of truth.). Still other teams would aggressively choose the highest numbers end to end, on the grounds that (a) they are very smart people, (b) there is plenty of time left before project’s end, and (c) nobody can prove the numbers cannot turn out this way.

Management would then direct us to go reconcile our differences with Project Z and report back as to which team had had to change their official POR. Almost always, after a perfunctory attempt at reconciling the two points of view, neither team changed anything and we would all forget about the episode until it repeated about six months later.

I wish I could offer bulletproof, hard-won advice on this topic that anyone could follow and avoid the unpleasantness implied above, but I can’t. When I was a boy, my mother warned me that I could not change anyone else’s behavior, just my own. In the same way, I believe strongly that engineers must think straight and talk straight: tell themselves and their management the truth as best as they can.

* Simulator Wars
Many of us were uncomfortable with how difficult Willamette’s NDFA was turning out to be, but few of us had the software skills to actually do something about the problem. One of us did: Mike Haertel complained long and loud about the problem, and when it become clear to him that we were going to try to stay with NDFA rather than formally commissioning what we believed would be an even riskier start-from-scratch simulator, he asked if he could write one himself. It is not uncommon for bright, creative engineer to become frustrated with their tools, to convince themselves that they could do much better, and to importune management to let them try.

Mike pulled it off in eye-popping fashion. He called his new simulator “Willy” and convincingly showed its intrinsic advantages over NDFA. He even had a better user interface. The Willamette architects loved it.

We were informed that a central research group in the company had just created the New Standard Simulator Which We All Must Use (NSSWWAMU)…

Now, corporate “everyone do things the same way, please” initiatives are not new. They appear with regularity every six months or so, and some actually yield real results. Others are an annoying distraction from the real work of designing competitive microprocessors.

PROJECT MANAGEMENT
The realization phase of a project is the quintessentially### manegement-driven part of a project.

* Awards, Rewards, and Recognition
Finding the right incentives to get a large design team to its maximum output and then keep it there for years is crucial to a project’s success. Obviously, getting paid for work done is a key motivator, and design teams are full of professionals who will always try to give a day’s work for day’s pay. But if that is all management is getting from their team, real trouble is afoot.
The difference between getting maximum output from a team and getting only a day’s work for a day’s pay cannot be overstated. An absolute requirements for achieving a world-class product is a design team that is fully committed at every level and at all hours. Sound technical ideas I have had or have witnessed others contributing have come during showers or while walking around armlessly or driving – any activity other than sitting at a desk actively pondering the problem. Something about being outside the pressurized corporate environment while engaged in a task that can be performed without your entire consciousness tends to free up the creative muse and left the really good ideas bubble up into view. When your design team is routinely reporting creative solutions to their troubles, and a large fraction of those are coming from pondering in off-hours moments, you know that your team is fully engaged and giving the design everything they have.

But for most people, an even stronger incentive than “What’s in it for me?” is “What will my peers think?” The act of engineering a product is pitting your training, talent, intellect, and ability against both competitors and nature itself. This is the draw and the scary part of an engineering career, since no matter how many times you have succeeded at similiar projects, the current project could always turn out to be the one that crashes and burns because the design team came up one good idea short. And the only people on earch### who really understand just how truly great your efforts have been are your peers on the same design team. Coming from a peer, a little encouragement that your design is on the right track can boost your morale and confidence more than any other incentive.

Pay raises and stock options are important, but they are private transactions. Bill Daniels had it right – there is nothing like hearing your name called in an auditorium full of your peers to make you work like a madman so that it can happen again.

* The Dark Side of Awards
Don’t reward firefighting by arsonists.

Giving out awards is an exercise in compromise no less difficult than the design itself.

* Project Management by Grass Cutting
Executives must provide the necessary motivation on the worker’s own terms…In other words, though the company might be organized to bestow certain financial rewards on its employees, and employees will seldom turn those down if offered, the reality is that many other forms of compensation or recongnition are also effective, and you have to talk to the employees to find out what they are.

Later in the project, we began to notice that the project was exacting a toll, not only on the engineers directly, but also on their spouses and children. It occurred to us that these people were also part of the team, and that reaching out to our extended team might have a salutary effect on our engineers. One highly successful tactic was our Thursday Family Night dinners – we invited whole families to join us at the engineers’ dinner. The design team loved it because they got a rare opportunity to enjoy their family, the engineers’ spouses loved it because they did not have to prepare dinner, and the kids loved it because nobody yelled at them when they returned to the dessert table for seconds.

So you might not think a $20 gift would particularly excite them. But it wasn’t the price of the gift than mattered. It was that a peer noticed their work and found it exemplary enough to want to draw corporate attention to it. Peer recognition is powerful indeed.

It might seem as though these little rewards are too mundane to deserve much attention by project leadership, but I don’t think so. When a team is really cranking at full output, every person on that team has committed their livelihood, their career, and their sacred honor to succeeding no matter what it takes. They have every right to expect their mangement to exude this same take-no-prisoners attitude and to look out for their welfare during the sprint to the project’s end. It is an artical of faith, a covenant between engineers and their managers, that is fundamental to the success of the whole enterprise. Management should welcome every opportunity to reinforce the idea that they are holding up their end of this bargain.

* Marginal Return from Incremental Heads
No matter how well you plan, and no matter how hard you work, your project will inevitably run into schedule pressure.

On some unfortunate day, you will have to announce the news of the necessary schedule slip to your upper management. This is exactly the kind of problem executives know how to handle. They can castigate the project managers severely for not having kept thier project on track despite said executive’s constant reminders of its overriding importance. They can use the opportunity to announce that they have lost faith in these leaders and will now investigate every nook and cranny of the project instead of continuing take their word on anything. They can also lighten the load on the project by removing features or adding heads.
By the way, it is not necessarily a given that removing features actually reduced overall schedule pressure. It depends on the actual features, how much time remains to tape out, and how deeply the feature is embedded in the design. But these subtleties are not part of the executive’s schedule slip calculus, and he is not much in the mood to hear you argue it just now.
For an architecture team in the middle to late stages of a design, adding heads would only drain more time and energy off the chip and into training. When I respectfully declined, the VP gave me a shocked look and spluttered, “But you can’t decline additional heads! As soon as you take on a management role, you can no longer say ‘I cannot use additional heads.’ It would be a sign of managerial incompetence.”

* Project Tracks
A new flagship microprocessor development effort involves several hundred design engineers, a new process technology from the fabrication plants, new design tools, a supervisory staff that has often just promoted into their current jobs, and a great deal of uncertainty about product features, targets, microarchitecture, and circuits. With such a huge number of unknowns, it is virtually impossible to predict a project schedule a priori###, no matter how fervently upper management demands it.
Instead, we took our best informed guesses and inflated them per hard-earned experiences of the past. Then we modulated those guesses to account for several important related factors, including our VP’s management style, what recent other projects had projected and how those projections had fared, and what our own design team was telling us about tools, personnel, and project difficulty.
These early estimates are useful only for very high-level planning, such as corporate road map and project costing efforts, project staffing, and management of related projects such as chip sets and tools. To really determine where a project is relative to its plan, you must track the project directly.
Randy understood this before anyone else and put a mechanism in place by which every project engineer would, on a weekly basis, report what he or she had accomplished that week and how much they estimated they had left to do before tapeout. Then Randy’s Unix tools would roll all those individual estimates up to the project level. The difference between the combined joint estimate of remaining work and the combinated joint estimates of what had been accomplished so far was, we hoped, proportional to how much work the project had left to do.
Things are never so simple, of course…

The work-remaining curve never actually did intersect with the accomplishment curve, mostly because it is no longer worth tracking when the project gets to within a few weeks of tapeout. We had always expected that when the difference between those curves reached zero we would have tapeout day. But that does not happen. Instead, as the engineers accumulate experience, they find more and more things they wish they had time to do. Validation learns a lot of very useful information as the proejct proceeds: which tests are finding lots of bugs and which are not, and which functional units are clean and which seem to be bug-infested. Validators will also pay attention to design errata being found in other chips in the company to make absolutely sure the company is never subjected to the agony of finding essentially the same embarrassing bug in two different chips. In general, validators will always be able to think up many more testing scenarios than any practical project will have the time and resources to carry out. And since their careers and reputations are on the line, they want to perform those tests. Project managers often have to modulate such work inflation, or the chip will tape out too late.

* Flexibility Is Required of All
At the end of every engineering project is a crunch phase, during which the team is working at or beyond its maximum sustainable output and the mindset is very much that of a fight to the death. At that point, the engineers have been working on the project for years. All the fun and limitless horizons “This is going to be the greatest chip ever made” thinking has now given away to the cold realities of compromise. Or worse, by now all the project engineers have become painfully aware of whatever shortcomings the project now includes. Years have elapsed since the early days of boundless optimism, and the competition has long since leaked details of what they are working on. Their designs are not as close to fruition as the press believes (and maybe as the competition believes) but the engineers on this project cannot help but compare that paper tiger to their almost finished creation and wince.

Obviously, if we knew how to conceive a bulletproof, guaranteed-to-work microarchitecture on day 1, we would not have to spend day 2 through 730 continuing working on it. We do not know how to do miraculous conceptions like that. What we know how to do is this: conceive promising approaches to problems, refine them until our confidence is high, combine them with other good ideas, and stop when we believe the final product will hit the project’s targets.

I believe that doing microarchitecture development the way we did on the P6 is the optimum method. The architects get some time to think, and they tell management and the design team when their ideas have matured to the point of usability. Like everyone else, architects learn as they go. They build models, they test their theories, and they try out their ideas any way they can. If they are taking the proper types and number of risks, then some of these ideas will not work. Most ideas will work well in isolution or on a single benchmark do not play well with others. It’s not uncommon to find two ideas that each generate a few percent performance increase by themselves, but when implemented together jointly degrade performance by a percent or two.

Designing at the limits of human intellect is a messy affair and I believe it has to be. The danger and schedule pain of design changes are real, but so are competition and learning. Projects must trust their senior technical leadership and project managers to make good judgments about when a change is worth making. Attemping to shut this process off by applying more up-front pressure to the architects does nothing useful, and I can testify from personal experience that it damages working relationships all around.

* The Simplification Effort
Complexity, in the context of microprocessor design, is a living, growing monster lirking in the corridors of your project, intent on simultaneously degrading your product’s key goals while also hiding the fact that it had done so.

The complexity has many costs, but among the worst is the impact on final product quality in terms of the number and severity of bugs (errata). The more complicated a design, the more difficult is the task of the design team to get that design right, and the larger the challenge facing the validation team. While it is hard to quantify, it feels as though the validation team’s job grows as some exponential function of the design complexity.
I used to go to bed at night thinking about what aspects of the P6 project I might be overlooking. One such night in 1992, I realized that this daydreaming had developed a pattern: It kept returning to the topic of overall project complexity. I knew that we were accumulating complexity on a daily basis and I knew that this complexity would cost us in design time, validation effort, and possibly in the number of design errata that might appear in the final product. What could be done about it? I brieftly pondered going ong a one-man crusade to ferret out places in the design where the injected complexity was not worth the cost, but there was too much work to do and not enough time.
The last time in the project that I had found myself facing a task too bug to handle alone, I had successfully enlisted dozens of other people on the project and together we got it done. Was there a way to do that again? Back in the BRTL days, the design enigneer did not have more pressing concerns and were relatively easy to conscript, but the project had since found its groove and everyone was incredibly busy all of the time. So asking them to put down their design tasks and help me with a possible simplification mission would not be a low-cost effort.
On the other hand, enlisting the design engineers themselves might have some tangible benefits besides additional sheer “horsepower” devoted to the task. They were the source of some of the added complexity, so they knew where to look. That could save considerable time and effort. And once they saw that their project leadership felt so strongly about this topic that we were willing to suspend the project for a couple of weeks in order to tackle it, perhaps the engineers would find ways to avoid adding unneccessary complexity thereafter.
We launched the P6 Simplification Effort, explainting to all why some complexity is necessary but anything beyond the necessary is an outright loss, and got very cooperation from the engineering ranks. Within two weeks we had constructed a list of design changes that looked as though they would be either neutral or positive in terms of project goals and would also make noticeable improvements to our overall product complexity. This experiment was widely considered to be a success.
Just as I always do with die diets (mid-project, forced marches to make the silicon die smaller, mostly by throwing features off the chip), I wondered if some better up-front project management migh have avoided the need for the Simplification Effort. I don’t think so. I think it is useful as a certain stage of a design project to remind everyone that there are goals that are not stated and are not easy to measure but still worth pursuing. Perhaps stopping the project periodically has the same effect that “off-site” events have on corporate groups – it gives people time to take a fresh look at what they are doing and the direction in which they are going, and this is very often a surprisingly high-leverage activity.

Chapter 5 THE PRODUCTION PHASE

Two male engineering students were crossing the campus when one said, “Where did you get such a great bike?” The second engineer replied, “Well, I was walking along yesterday minding my own business when a beautiful woman road up on this bike. She threw the bike to the ground, took off all her clothes and said, ‘Take what you want.’” The first engineer nodded approvingly and said, “Good choice; the clothes probably wouldn’t have fit.”

Early in the design program, ideas flowed like water – the more the better. Architects, managers, and marketing people were encouraged to roam the product space, the technology possibilities, and user models to find compelling product features and breakthroughs. As the project evolved through its refinement plase, the vast sweep of possibilities was winnowed to only a few of the most promising. The realization phase settled on one of those semifinalists and developed it to a prototype stage. This is nice, logical sequence that makes sense to most technical folks, who are often not prepared for what happens next, even though they think they are: production.

Their mindset is that they have designed the product they set out to create and now it is someone else’s job to make tens of millions of them. How hard can that be, compared to the intellectually Herculean task that has now been accomplished? The answer is, very hard, and it require a whole new set of skills.

The production engineering team (the corporate production engineers plus a substantial fraction of the original design team) must provide silicon functionality and show that circuits and new features work as intended with the compilers and other tools. The chip power dissipation must be within the expected range, the clock rate must hit the product goal, the chip must operate correctly over the entire target temperature and voltage ranges, the system must demostrate the expected performance, and testers must create the test vectors to help drive production yield to its intended range. Any of these requirements could become problemativ, so a great deal of highly creative engineering must literally be on call.
Meanwhile, the marketing team is preparing the customers so that they will be ready when early production units arrive. These customers are preparing their own systems around this new chip, so they often have questions, suggestions, and concerns that require technical expertise to resolve. Technical documents must be updated and distribued. Collaeral### such as tools, performance models, and reference designs must be turned to represent the chip’s final production version.

The production team’s responsibility then is to finish the job the design team has begun. In much too short time, they must polish a raw, brand-new design into a form suitable for the sage shipment of tens of millions of copies. Tension is constant and comes from all sides: Management screams about schedules and product cost; validation constantly lectures onthe dangers of another FDIV; marketing and field sales remind you that your project is late and that only an inspired performance by them will stave off the unbelievably strong competition that you have ineptly allowed to flourish; and the fab plant manager everyone of how many millions of dollars a day are lost if the chip isn’t ready on time and their plants run out of things to make.

OF IMMEDIATE CONCERN
In the phase, cleverness gets you only so far, brute force in the form of a lot of hard work by a lot of people must take you the rest of the way.

* Functional Correctness
The first barrier to knock down is any remaining functional inaccuracy. It is a mistake to think that postsilison validation is just an extension of presilicon testing. The two have unique advantages and disadvantages.

* Speed Paths
With a cutting-edge, flagship design, however, odds are that the chips will not be as fast as the design team intended. This is hardly surprising, since it takes only one circuitpath detour or some overlooked design corner to limit the entire chip’s speed. Typically, tens to hundreds of these speed paths are in a chip’s initial build, and the production engineering team must identify and fix them before mass production can begin.

* Chip Sets and Platforms
In the end, the one thing you can count on is that surprise are inevitable, and the production team will have to find them, identify them, decide which must be fixed (and how), and which can be lived with. This take time, people, and a good working relationship with the early development partners.

SUCCES FACTORS
Because the production engineering team comprises design, validation, management, marketing, and product engineers, it must balance a variety of concerns on a very tight schedule. There is no time to explore nuances in someone’s point of view. Communication must be direct, succinct, and frequent.
To satisfy this requirement, we created the “war room”, a designated room for daily meetings, during which the team assimulated new data and decisions, planned out the next day’s events, and coordinated with management.

* Prioritizing War Room Issues
The war room team must successfully juggle a steady stream of sighting, confirmed bugs, new features, marketing issues, upper management directives, and the constant crushing pressure of a tight schedule. On a daily basis, however, its most important function is to prioritize the list of open items to ensure that they can be disposed of within the required schedule.

* Managing the Microcode Patch Space

PRODUCT CARE AND FEEDING

Test Vectors

* Preformance Surprise
Correctly executing code is only a prerequisite.
So even though presilicon testing has reasonably wrung out the initial silicon, the SRTL that defined that silicon has had relatively minimal performance verification. Therefor, performance surprise await early silicon, and such surprise are never in your favor.
Benchmarks are the only plausible way to tune a developing microarchitecture and find at least some of those surprise.
Benchmarks are supposed to represent the real code in all the important performance-related ways, while being much more manageable in terms of slow simulations. But which benchmark? And who chooses them?
In yet another feat of deft judgment, the design team (in particular the performance analyst) must consciously predict which software applications will be the ones that really matter in a few years, and then find or create benchmarks and a performance prediction method on which they can predicate the entire project’s success or failure. The trick is to fit all the benchmarks into the acceptable simulation overhead space, so one benchmark cannot take up so much space that it becomes the only one that gets analyzed presilicon.
Another is to incorrectly anticipate how the buyer’s usage model will change. We were somewhat guilty of this with P6, having conceived and designed a workstation engine in 1990 that ended up being used as a Web server in 1996. We got lucky, however, because our basic design turned out to be compatible with the performance rules of an I/O-intensible Web server workload. But it is always better to correctly anticipate the workloads that will be of interest and then find ways to model those workloads that are compatible with the mothods to analyze presilicon performance.

* Feature Surprise
We had many meeting with Microsoft, swapping notes and ideas about the future of DRAM sizes and speeds, hard-drive evolution trends, and our respective product road maps.

What all these objectors[16bit - 32bit performance] fail to see is that design is the art of compromise. You cannot have it all. In fact, you cannot have even most of it. The Iron Law of Design says that, at best, you can have a well-executed product that corresponds to a well-conceived vision of technological capability/feasibility and emerging user demand, and that if you strike a better compromise than you competitors, that is as good as it gets. If you forget this law, your design will achieve uniform mediocrity across most of its targets, fail utterly at some of them, and die a well-deserved and unlamented death.

*** Making Hard Decisions
You cannot explore every idea to equal path, cover every base, hedge every bet, and refuse to make any decisions until all the data is available. All the data is never available. This is true not only in engineering, but in every important human endeavor, like marriage, family, and choosing a job or home. To choose one path among several is to fundamentally exlude other sets of possibilities; you cannot have it both ways.
LDS Elder Robert D. Hales says, “The wrong course of action, vigorously persued, is preferable to the right course pursued in a weak or vacillating manner.” Engineering is about taking calculated risks, pushing technology into new areas where knowledge is imperfect, and if you take enough risk, some of them will go against you. The trick in a project’s concept phase is to know when and where you are taking risks and to make sure you can either live with a failure (by permanently disabling an optional but hoped-for new feature, for example) or have viable backup plans in place. And never forget Will Swope’s dictum: Your backup plan must be taken as seriously as your primary plan; otherwise it is not really a backup plan. Thinking you have a backup plan when you really do not is much more dangerous than purposely having none. The Space Shuttle Challenger’s second O-ring exemplifies this trap.
To choose A is not to choose B. People who try too hard to get both, as a way of avoiding the difficult choice between them, will end up with neither.

* Executive Pedagogy and Shopping Carts
An important part of product care and feeding is to expect the unexpected.

I knew I could always be boring and pedanic, and give the listener a straightup data dump. If they could not keep up, too bad.

But that is just not my style and, anyway, I was proud of our design and I really whanted our executives to understand, at least to some extent, how thoroughly cool it was. So I decided to pitch the talk at my mother – a smart person, but one with no technical background.

I also happen to like analogy because it supports one of my pet conjectures: Computer design looks a lot more mysterious than it is because familiar ideas tend to be hidden by engineers who rely heavily on the passive voice and routinely forget to eschew obfuscation. Actually, computer science has very few original concepts. Once you get past the buzzwords and acronyms, you can fairly easily explain the ideas using a range of familiar contexts.

MANAGING TO THE NEXT PROCESSOR

Part of the production phase necessarily involves looking beyond the current project. Engineers love novelty, so they well not normally do the same thing over and over unless they are compelled to do so. The exception is the basis for much marketing folklore: engineers who will not stop polishing long enough to get the product to market. Typically, however, most of us look forward to skipping the mundane issues of speed paths, performance divots, and functional errata, along with the daily management browbeating to meet impossible schedules. The temptation can be overwhelming to chuck it all and jump into the marvelous new project that is welcoming fresh ideas.

THE WINDOWS NT SAGA

All computing platforms have their unique set of quirks and vagaries. We dealt with the RS6k’s by dedicating a set of very talented software engineers to serve as the design team’s first line of defense. These folks were invaluable at figuring out when an unexpected problem was a designer’s own pilot error, the fault of the tools being used, a design flaw in the workstation or its operating system, or some combination of these possibilities.

An engineer who worked on the controls for the jet engines on the Boeing 747 told me that he and a colleague went along on that engine’s first flight. A military pilot once told me that a parachutist commonly packs his own chute, and when he can’t, the person who did pack it has to jump, too, using a randomly chosen chute. Both the practices tend to focus the practitioner’s full attention on the task and to expose that person to any flaws in its execution.

What we did not anticipate was that, for many reasons, it is better to rewrite tools from scratch when migrating them to Windows from Unix.

Many people believe that if you throw a frog into a pot of boiling water, it will jump right back out, but if you put it into room temperature water and gradually heat it to boiling, it will stay in it until it is too late.[The moral of the story is valid, but the story itself is an urban legend. In reality, frogs have more sense. They try to get out of the increasingly hot water, with an urgency proportional to temperature-http://www.snopes.com/critters/wild/frogboil.htm. Who makes these things up, anyway?] That is how our forced migration away from Unix felt, Because the tools environment seemed to get gradually more reliable, we stuck to the plan, hoping that when the software-tools folks found and fixed the few really big bugs, our design tool chain would once again exhibit the overall level of reliability to which we had become accustomed. But it never did.

In calssic Intel tradition, they did not ask permission, but simply brough up Linux on several hundred validation servers and re-ported whatever tools were needed. Within a few days, they were once again merrily running their tests and enjoying the kind of computing system stability we had not seen in several years.

PRODUCT ROLLOUT

Like most large comopanies, Intel carefully stage-manages product rollouts. Senior marketing executives collect information from the technical people who created the product and combine that information with their own imaginations to come up with the glitzy extravaganzas you see at rollout affairs. Rollouts also require a certain awareness of what you should and should not say during interview, which takes more training than you might thnks. (Or maybe just more than I had.)

* On Stage with Andy Grove
Marketing rollouts of new high-tech products usually include selected product users and early adopters, whose job it is to say, “Without Intel’s latest and greatest processor, my life would be devoid of meaning,” and “Now that the Pentium Pro exists, my applications leap tall building in a single bound.”

Judgements that turn out wrong are crucially different from judgements that no one ever made, and that difference is what distinguishes great design teams from mediocre ones – Great teams can make wrong calls, but they make all their calls based on the best information available at the time. It is the deliberate, informed decision making that stacks the odds in their favor.

* How Not to Give Magazine Interview
“How do you think P6 will fare against the forthcoming PowerPC chips?” As I walked past, I thought, “Who cares how fast any chip is if it cann’t execute the right kind of code?”

* Speech Training
He correctly noted that most tehnical presenters are deadly dull, believeing their job is to inundate the listener with data and to pack as much as possible onto each powerpoint foil. If the foil will not accommodate all the data, they use a smaller font size. Yet, as Jerry pointed out, none of us actually like to sit through such presentations.
We had all been to technical conferences and the universality of Jerry’s truths was self-evident. Jerry suggested that even technical folks like a good story and that no one was going to remember all the data anyway. He advised us to pick the one or two major ideas we hoped we could get across and then structure the talk around them. he had numerous suggestions for structuring the Powerpoint foils. No more than four bullets per foil. No more than four words per bullet. Place words so that they enhance the graphic, not obliterate it. Build the foils so that the sequence makes sense.
He also had dozens of great ideas on how to present the story. Do not grip the podium; it makes you look like you are scared. And do not hide behind it. Let the audience see you. When they sense your mastery of the material, they will relax and accept the message better.

I definitely said a silent thanks to Jerry and my own general managers for having put me through this training, because I soon found myself alone on an enormous raised platform in front of 2,000 conference attendees, wearing a wireless microphone headset. I felt like Madonna but with a lot more clothes.

* Trash-Talking Helps the Opposition
“The guys who conceived of the Pentium were better than the guys who conceived of the P6.” It doesn’t matter which guys were truly “better”, and Sanders’ opinion on it mattered even less. What got my attention was that the CEO of a rival company had decided to not just trash the P6, but to personally attack its designers as well. Why he thought that was appropriate or useful to AMD, I could not guess, but I took it personally and decided to do with that quote what sports teams do with trash-talking by their rivals: they post the quotes in the locker room for additional inspiration. So maybe I should thank Mr.Sanders, because in the wee hours, when I was tired and wanted to quit for the night, I would see that quote on the wall and feel reinvigorated.

Chapter 6 THE PEOPLE FACTOR

HIRING AND FIRING

* Retional Recruitment
We had cleverly isolated six major technical areas – architecture, microarchitecture, software, logic, circuits, and layout – and determined that successful candidates should be expert in at least one.
The interviewer scored the candidate’s knowledge of that area from 1 to 10, with 1 being clueless and 10 being world-class expert.
Few candidates scored at either extreme; middle scores between 3 and 7 were by far the most common. For calibration, we routinely reminded interviewers that 5 was supposed to be the mean score expected of a candidate.

* Hiring and Promotion List
The higher the pay grade, the longer the average period between promotions. Job effectiveness in the higher pay grade is a strong function of a person’s ability to influence peers, and it takes time to build the neccessary interpersonal relationships.
We concluded that no single quantifiable metric could reliably predict career success. However, we also felt that fast-trackers have indentifiable qualities and show definite trends.
One quality is a whatever-it-takes attitude. All high-output engineers have a willingness to do whatever it takes to make a project succeed, especially their particular corner of it. If that means working weekends occasionally, doing extracurricular research, or rewriting tools, they do it.
Another quality is a solid, unflappable understanding of all the technologies they are using or developing. No one person can know everything, but fast-trackers have the drive to familiarize themselves with all aspects of their work, even those that are not required for their immediate task, and this gives them the necessary credibility to discuss their design with project experts, which in turn drives them up the learning curve that much faster. Finally, these high-output tupes seemed to innately grasp that they are members of a large team, which means that certain behaviors are very efficient while others are counterproductive. They know, for example, that the classic elbow-your-neighbor scramble up the corporate ladder does not work in a large design team that is completely managed, so they avoid that tactic. Instead, they are the ones always helping everyone else, sometimes directly and sometimes by sharing tools or tricks they have learned or developed.
As employees go higher in the pay grades, this problem becomes less important, because it is an official tenet of Intel’s meritocracy that employees “own their own career”. They are required to operate at their pay grade; it is not enough to just do what the boss tells them to do.

POLICY WARS

* Corporate Disincentives
Chip development projects or, for that matter, any product development I have ever been part of, always feel like a log flume ride at an amusement park. These rides begin with a long uphill climb, where things happen rather slowly and the fun is minimal. Then the log-boat floats around several curves, with lots of gratuitous splashing and generally nice views, but it does not feel like you are going very fast or getting anywhere. Then you see the final drop off and time speeds up. The sense of inevitability mounts, and you have the distinct feeling that you are committed to getting to the end of the ride no matter what.
In the “pre-drop-off” frenzy, key engineers are spending every walking moment working whether at the plant or from home. Unless you can get a team to this tapeout crunch, death-march phase, you have no hope of meeting your schedule. Nature will conspire to throw obstacle ater### obstacle into your project’s path, and the only way to prevail is to have a every hand on deck, actively resolving issues as they arise. These crunch phases typically last six months, although I have seen them go on for as long as a year. Some would argue that you can do the entire project in crunch mode, but you surely risk burnout and then you would get the exact opposite of the effeciency you are seeking.

…every employee was required to be at work by 8 A.M…
I objected vociferously. Half our engineering team had still been on the premises at midnight the night before, yet they were expected to be back by 8.A.M.? That was obviously ridicuous, and I could easily predict their reactions. “Sure, boss, from now on I’ll be here at 8 A.M. And I’ll leave at 5 P.M. I have no problem cutting back on my hours!”…I said I would throw it away without reading it and that I trusted they were all adult and professional enough to know how to get their jobs done.
A corporate initiative that I particularly disliked was, “Do chips in half the time whith half the people.” Talk about an unfunded mandate! The executives could just as easily, and with the same effectiveness, have promoted an initiative for each electron to carry twice the normal charge. As goals go, at least that one would generate interesting discussion, and in that context might even have useful outputs, but as a requirement from above, this kind of wishful thinking is very dangerous. Well-run design teams that mean what they say would be unable to commit to this target, but poorly managed teams might succumb to the ever-present temptation to tell management what they want to hear. For the next few years, that second-rate team would be the darling of the company, unitl the day they had to deliver the new design. At that point, everyone would realize that the team did indeed achieve a design with half the labor in half the time. They just forgot to create a viable product in the process.
An Intel initiative for several years now has been “Operational Excellence”, or OpX. The basic idea is to execute well – make and meet commitments, do not accept mediocrity, and strive for continuous improvements across the board. So what’s not to like about that? Plenty, but none of it is obvious, and therein lies the danger.
OpX emphasizes exactly the wrong thing. What makes companies like Intel successful is creating profitable products that compete well on the open market. The customer who plunks down hard-earned cash for an Intel machine does not ask, “How did you design this chip?” The product must stand on its own. In the final analysis, it does not matter how it was conceived and executed.
The insidious aspect of OpX is that when a team does create a world-class product, much of its development was indeed performed in ways congruent to OpX’s goals, but in spite of OpX, or independently of it, not because of it. A good design team will naturally make and meet commitments, not accept mediocrity, and strive for continuous improvements across the board in their pursuit of a world-class product. They do not need OpX to spur that thinking.
Conversely, teams what have not conceived a world-class product, or who are simply not up to that challenge, will not benefit from the distractions of constantly analyzing their execution when they ought to be thinking about how their product will fare in the open warfare of the commercial marketplace.

Teams that need OpX need much more that it provides, and teams that don’t need it will be hurt by it.

MANAGEMENT BY OBJECTIVE

From ten thousand feet up, the overall flow of a chip development project and the technology it is developing are reasonably clear. You form a team, acquire or develop the necessary tools, conceive a design, implement it, and validate the result.
But you cannot run a project from ten thousand feet. Project have to be executed from the trenches because only from there can you see the myriad details the team must resolve. For this reason, and to strike a reasonable compromise between overhead and benefit, Intel mandates the use of iMBO. Andy Grove described the genesis of this idea in High Output Management, in which he points to two key questions that any management planning effort must address: What is the right target, and how can I measure my progress toward that target?
iMBO’s basic idea is to list a set of objectives the team must accomplish over a quarter. These are things that if left undone might jeopardize the project’s overall schedule. Typically, a manager identifies four to eight objectives, some carried over from previous quarter, especially if they were on the previous quarter’s list, but are still not complete. The manager also identifies a set of activities by which to judge that objective’s completion.
What I found worded best was to “seed” the next quarter’s tentative iMBO list with my own ideas, and then spend 30 minutes of staff time discussing them. Almost always, the team had valuable inputs on which objectives were the best ones for the next quarter, and how those objectives could be achieved and measured. This discussion was often the most valuable aspects of the whole iMBO method.
The other very valuable fallout of using iMBOs came during the quarterly review of how last quarter’s results should be judged. Each quarter, the team that had taken the objective assesses how well they have accomplished it, essentially “grading the quarterly sheet”. A graded quarterly key results sheet might look partially like this:
Graded Q4/92 Objectives/Key Results, P6 Architecture
Objective: Complete BRTL development, AMB:
1  1, Run 95% of all Real Mode tests on BRTL, except for BBL, EBL, DCU, and MOB.
0  2, Run BenchmarkA and BenchmarkB on full model.
1  3, Resolve all SRTL gating issues.
1  4, Resolve 15 simplification issues.
And so on, for typically 4-6 objectives and 4-6 examples under each.
AMB stands for “As Measured By”,and the list of activities corresponds to the judging criteria.
Each objective must be concrete, specific, and meaningful, and the team must be honest in judging its state of completeness at the quarter’s end.
It probably seems as though having a team grade its own accomplishments might yield “perfect” results, quarter after quarter. That can sometimes be a problem; managing with iMBOs can be tricky because, although they are an extremely effective tool when used properly, they are also very easy to subvert. A sure way to destroy the iMBOs effectiveness, for example, is to tie compensation to the iMBO grading. A great deal of judgment is required to select the right objectives and the best metrics by which to measure them. Any unnatural pressure to score well on iMBOs, meaning any pressure other than having the project turn out as desired, would subvert the process by grade inflation. When every groupgets### a perfect score every quarter, the iMBOs are no longer filling their role as a planning procedure. And if a manager chooses to be honest even though all the other managers around her are rounding their numbers up, few employees will want to work for her, since it would, in effect, cost them compensation.
I was blessed at Intel with excellent managers, all of whom felt that a perfect score on a quarterly key results sheet might mean your group had an excellent quarter, but it might also mean you were not aggressive enough three months ago when you planned this quarter’s activities or that your ability to fairly score your group’s results was suspect.
Subvertion aside, the iMBO process is a valuable tool. As Andy Grove mentions in his book, the act of identifying what you believe are the highest-priority tasks for your group in the next three months is also the act of ruling out certain other tasks. If they are worth doing, include them, and if they are not on the list, do not do them. Writing tasks down in this way has the salutary effect of forcing a team to be honest about what they think is really important, as opposed to what is simply interesting or fun.
The list of intention also drives out miscommunication among cooperating design groups. If I receive you list of proposed iMBO objectives for next quarter, and I do not see the completion of some task I was relying on you to accomplish, I will assume the worst and go talk to you about it. It is better to find out now that we have our metaphorical crossed, rather than three months from now.

WE ARE SO RICH, WE MUST BE GOOD

Yes, that means you win, but it does not mean you are good. I gingerly pointed out some of our own execution errors, holes in our overall product road map, and places where we were not cooperating well between projects. My theme was that we were getting away with these errors, but should not count on continuing to do so in the future.

As Clayton Christenson reminds us, however, all technologies climb similar maturation curves and eventually reach a point where their basic performance becomes satisfactory. By “satisfactory”, He means that more performance is not valued enough to remain a viable differentiating sales factor.

BURNOUT

Burnout occurs when the employee just does not care about the product any more…For P6 project…We were believers on a mission.
Burnout also follows if engineers lose faith in their management. Engineers will do what it takes to succeed, but nobody likes to feel exploited. They want to know that while they have their heads down, getting the technology right, their teammates in marketing are getting the sales message down and that management is makeing the right connections to get this new chip into successful volume production. They also need to believe that their own careers will trend in line with their contributions, sacrifices, and successful results.
Our engineers were not burned out. They were tired, and the cure for that was rest, upper management’s acknowledgement of their incredible work, and immersion in the flow of accolades that follow a successful product. The worried executives were overlooking that key factor – our engineers wanted to succeed. Feeling like all their work was not vain was the essential balm for these tired souls.

Chapter 7 Inquiring Minds like Yours

* What was Intel thinking with that chip ID tag, which caused such a public uproar
What was Ford thinking when they designed the Edsel? Or Coca-Cola when they did the New Coke? Beforehand, they thought “This is going to be great!” and afterwards they thought “Whoops! Whose stupid idea was that?”

* What did the P6 team think about Intel’s IPF?
The second problem was, I believe, intrinsic to the charter required of the Intel IPF team. In essence, they were told that their mission was to jointly conceive the world’s greatest instruction set architecture with HP, and then realize that architecture in a chip called Merced by 1997, with performance second to no other processor, for any benchmark you like. The justification for this blanket performance requirements was that if a new architecture was not faster than its predecessors, then why bother with it? Besides, HP’s studies had indicated that the new architecture was so much faster than any other, that even if some performance was lost to initial implementation naivete, Mecerd would still be so fast that it would easily establish the new IPF.
This plan did not go over well with the Oregon design team. At one point, I objected to the executive VP in charge that no company had ever achieved anything like what he was blithely insisting on for Merced. No matter how advanced an architecture, the implementation artifacts of an actual chip would offset them until the design team learned the proper balances between features, design effort, compiler techniques, and so on. Moreover, there are always uncertainties in complex designs and new designs most of all. The one thing you do not do with uncertainties is to stack them all end to end and judge them all toward the hoped-for end of the range. With any one issue, you can make an argument that it is likely to turn out at the high end of the desirability range, but you must not do that with every issue simultaneously. Nature does not work that way but, in effect, that is what Merced was assuming.

“You cannot expect any design team in the world to get so many things right on their first try. And with the number of new ideas in IPF, what Intel should be doing is designing Merced as a research chip, not to be sold or advertised. Take the silicon back to the lab and experiment with it for 18 months. At the end of those 18 months, you’ll know what new ideas worked and which ones weren’t worth the cost of implementation. Then design a follow-on, keeping the good ideas and tossing out the rest, and that second chip and its follow-ons have a chance to be great. It’s worth investing a year or two at the beginning of a new instruction set architecture that you hope will last for 25 years.”

* How can I become the chief architect of a company such as Intel?
Part of any manager’s job is to counsel his or her employees in planning their careers. Some employees take career planning very seriously – they know where they want to be and how long they think is reasonable to get there. Most are more like I was. Give me interesting work, creating products that are meaningful, and I will do my best to contribute to them. And if I succeed, my career will take care of itself. At least, I hoped it would.

If there is one thing I know about the chief architect job, it is that I could not have done it successfully when I was in my twenties. Nearly all of the things I have ever done, including writing code, writing microcode, designing ECL harware, design TTL, designing custom CMOS, writing validation suites, debugging hardware in the lab, doing performance analysis, doing technical documentation and presentations, reading magazine and talking to people at conferences, as well as the voluminous nontechnical reading I do, informed the decisions I would make or help make, the directions in which I wanted to take the team or the product, and how I would go about leading the organization.
Experience matters, and it cannot be substituted for by intelligence, political acumen, or marrying the boss’s daughter.

Find something you are passionate about, and go after it with everything you have. Really apply yourself, holding nothing back, with the aim of achieving excellence, no matter the task, no matter how menial it feels or may seem to others.

No matter what task you have been assigned, take it upon yourself to learn the context for that task. Why was it assigned? Where does it fit in the bigger picture? Why were you asked for a certain result? Is there a better way to achieve what your supervisor was really after? Give her back more than she expected, every single time. Sometimes, merely stepping back from the details of the task is all it takes to see a much better course of action. Other times, you just have to slog through the task. No knock it out the park. When your management realizes you are reliable in the small things, they will start trusting you on bigger things and your career will jump to a higher energy state.

AND IN CLOSING I’D JUST LIKE TO SAY…
You can spend your entire life as a design engineer and never have the good fortune to find yourself on a team of incredible people like I did on the P6 project, but I hope you do. It is an experience that will elevate your own abilities, give you a feeling of intense acceleration, like a roller coaster heading over the steepest drop, and the incredible sensation of having your own intellect amplified beyond reason by proximity to so many brilliant people. Engineering is fun, but engineering in a project like P6 is sublime.
Years later, when you think back over the project and the great products it created, there will be a feeling of immense satisfaction combined with awe at the sheer talent and dedication that were there in such abundance. In all your design projects, do your part to arrive at such a happy circumstance: Hold every project you find yourself in to the highest standards. Expect your team to attain greatness and never settle for less. If a particular project does not turn out the way you wanted, remember that mistakes are the learning curves with the steepest slopes, and redouble your commitment to the next project. As Goethe said, “whatever you can do, or dream you can, begin it. Boldness has genius, power, and magic in it.”

(1个打分, 平均:5.00 / 5)

雁过留声

““The Pentium Chronicles”摘录”有7个回复

  1. oioi 于 2011-01-03 5:37 上午

    压力下的角逐 这个书很多地方都没有货了,谁有电子版的呀?发个链接吧

  2. fanyu83 于 2011-01-04 1:44 上午

    我倒是买了一本,翻译的贼烂,不过毕竟不是搞这块的,我也懒的再买英文的了

  3. 云中漫步 于 2011-01-04 6:44 下午

    INTEL的CPU架构设计团队有两个,分别在俄勒冈和以色列。他们两家的架构交替用在INTEL的架构上,配合芯片工厂的技术升级,实现INTEL的架构和工艺交替升级战略。

    但是公认以色列的团队更有天赋,设计的架构更加高效。你说的这位老兄设计的Pentium4 在架构上简直就是垃圾。相反,以色列的团队设计的架构效率很高

    请问有没有INTEL以色列团队的架构设计师的文章,我非常想拜读一下。谢谢

  4. wr 于 2011-01-05 8:34 上午

    哪儿能下载到The Pentium Chronicles的pdf啊?
    压力下的角逐中文还是别看了,翻译的太垃圾

  5. KISS 于 2011-01-06 1:11 上午

    “The Pentium Chronicles”只有第一章有电子版,曾经作为单独文章出版在ieee computer,具体哪一期忘记了。

    压力下的角逐中文也还可以看懂了,呵呵,别要求太高。。。本身的书也一般。

  6. James 于 2011-01-19 7:19 下午

    我手头有 The Pentium Chronicles 的 djvu 格式. 哪位需要的找个空间, 我可以放上去.

    正在看这本书, 和单纯科技图书不一样, 本书在语言上还是有点难度. 最好配合一本好的字典.

  7. leon 于 2011-01-25 10:58 下午

    一楼的oioi难道是煎蛋的oioi?