olhon.info Religion Programming Massively Parallel Processors Second Edition Pdf

PROGRAMMING MASSIVELY PARALLEL PROCESSORS SECOND EDITION PDF

Wednesday, May 8, 2019


next generation of data parallel languages and architectures: OpenCL and ence on programming massively parallel processors—a true technological Our second goal is to teach parallel programming for correct functionality and reliability, which constitute . (olhon.info pdf). Programming. Massively Parallel. Processors. A Hands-on Approach. Second Edition. David B. Kirk and Wen-mei W. Hwu. AMSTERDAM • BOSTON. Massively Parallel. Processors of this book when we offered ECEAL the second time. We tested these .. Arvind, Dick Blahut, Randy Bryant, Bob Colwell, Ed Davidson, Mike. Flynn . Programming Massively Parallel Processors: A Hands-on Approach. (olhon.info pdf).


Programming Massively Parallel Processors Second Edition Pdf

Author:ALAN SEGRAVE
Language:English, Spanish, Japanese
Country:Mauritius
Genre:Religion
Pages:657
Published (Last):03.10.2015
ISBN:445-8-49969-515-5
ePub File Size:22.56 MB
PDF File Size:20.54 MB
Distribution:Free* [*Regsitration Required]
Downloads:44797
Uploaded by: MYRIAM

4 days ago Processors, Third Edition PDF Programming Massively Parallel Processors: Processors Second Edition A programming massively parallel. Programming Massively Parallel Processors A Hands-on Approach Second Edition David B. Kirk a Programming Embedded Systems, Second Edition with C. Programming Massively Parallel Processors, Second Edition: A Hands-on Approach Original language: English | PDF # 1 | x x l, | File type.

This clinic helps ensure that students are on-track and that they have identified the potential roadblocks early in the process. Student teams are asked to come to the clinic with an initial draft of the following three versions of their application: 1 The best CPU sequential code in terms of performance, with SSE2 and other optimizations that establish a strong serial base of the code for their speedup comparisons; 2 The best CUDA parallel code in terms of performance.

This version is the main output of the project; 3 A version of CPU sequential code that is based on the same algorithm as version 3, using single precision.

This version is used by the students to characterize the parallel algorithm overhead in terms of extra computations involved. Student teams are asked to be prepared to discuss the key ideas used in each version of the code, any floating-point precision issues, any comparison against previous results on the application, and the potential impact on the field if they achieve tremendous speedup.

From our experience, the optimal schedule for the clinic is 1 week before the class symposium. An earlier time typically results in less mature projects and less meaningful sessions. A later time will not give students sufficient time to revise their projects according to the feedback. Six lecture slots are combined into a whole-day class symposium. During the symposium, students use presentation slots proportional to the size of the teams. During the presentation, the students highlight the best parts of their project report for the benefit of the whole class.

The symposium is a major opportunity for students to learn to produce a concise presentation that xv xvi Preface motivates their peers to read a full paper. After their presentation, the students also submit a full report on their final project. While this book provides the intellectual contents for these classes, the additional material will be crucial in achieving the overall education goals.

Finally, we encourage you to submit your feedback. We would like to hear from you if you have any ideas for improving this book and the supplementary online material. Of course, we also like to know what you liked about the book. David B. Kirk and Wen-mei W. Their teams created an excellent infrastructure for this course. Calisa Cole helped with cover.

We also thank Jensen Huang for providing a great amount of financial and human resources for developing the course. Jensen also took the time to read the early drafts of the chapters and gave us valuable feedback. David Luebke has facilitated the GPU computing resources for the course. Jonah Alben has provided valuable insight.

Michael Shebanow and Michael Garland have given guest lectures and contributed materials. John Stratton and Chris Rodrigues contributed some of the base material for the computational thinking chapter. Laurie Talkington and James Hutchinson helped to dictate early lectures that served as the base for the first five chapters. Mike Showerman helped build two generations of GPU computing clusters for the course.

Jeremy Enos worked tirelessly to ensure that students have a stable, user-friendly GPU computing cluster to work on their lab assignments and projects. We acknowledge Dick Blahut who challenged us to create the course in Illinois. His constant reminder that we needed to write the book helped keep us going. Through that gathering, Blahut was introduced to David and challenged David to come to Illinois and create the course with Wen-mei. We also thank Thom Dunning of the University of Illinois and Sharon Glotzer of the University of Michigan, Co-Directors of the multiuniversity Virtual School of Computational Science and Engineering, for graciously xvii xviii Acknowledgments hosting the summer school version of the course.

Nicolas Pinto tested the early versions of the first chapters in his MIT class and assembled an excellent set of feedback comments and corrections.

Steve Lumetta and Sanjay Patel both taught versions of the course and gave us valuable feedback. John Owens graciously allowed us to use some of his slides. Michael Giles reviewed the semi-final draft chapters in detail and identified many typos and inconsistencies.

We are humbled by the generosity and enthusiasm of all the great people who contributed to the course and the book. This relentless drive of performance improvement has allowed application software to provide more functionality, have better user interfaces, and generate more useful results.

The users, in turn, demand even more improvements once they become accustomed to these improvements, creating a positive cycle for the computer industry.

During the drive, most software developers have relied on the advances in hardware to increase the speed of their applications under the hood; the same software simply runs faster as each new generation of processors is introduced.

This drive, however, has slowed since due to energyconsumption and heat-dissipation issues that have limited the increase of the clock frequency and the level of productive activities that can be performed in each clock period within a single CPU. Virtually all microprocessor vendors have switched to models where multiple processing units, referred to as processor cores, are used in each chip to increase the 1 2 CHAPTER 1 Introduction processing power.

This switch has exerted a tremendous impact on the software developer community [Sutter ]. Traditionally, the vast majority of software applications are written as sequential programs, as described by von Neumann [] in his seminal report. The execution of these programs can be understood by a human sequentially stepping through the code. Historically, computer users have become accustomed to the expectation that these programs run faster with each new generation of microprocessors.

Such expectation is no longer strictly valid from this day onward. A sequential program will only run on one of the processor cores, which will not become significantly faster than those in use today.

Without performance improvement, application developers will no longer be able to introduce new features and capabilities into their software as new microprocessors are introduced, thus reducing the growth opportunities of the entire computer industry.

Rather, the applications software that will continue to enjoy performance improvement with each new generation of microprocessors will be parallel programs, in which multiple threads of execution cooperate to complete the work faster.

This new, dramatically escalated incentive for parallel program development has been referred to as the concurrency revolution [Sutter ]. The practice of parallel programming is by no means new. The high-performance computing community has been developing parallel programs for decades. These programs run on large-scale, expensive computers. Only a few elite applications can justify the use of these expensive computers, thus limiting the practice of parallel programming to a small number of application developers.

Now that all new microprocessors are parallel computers, the number of applications that must be developed as parallel programs has increased dramatically. There is now a great need for software developers to learn about parallel programming, which is the focus of this book. The multicore trajectory seeks to maintain the execution speed of sequential programs while moving into multiple cores. The multicores began as two-core processors, with the number of cores approximately doubling with each semiconductor process generation.

In contrast, the many-core trajectory focuses more on the execution throughput of parallel applications. The many-cores began as a large number of much smaller cores, and, once again, the number of cores doubles with each generation. Many-core processors, especially the GPUs, have led the race of floating-point performance since This phenomenon is illustrated in Figure 1.

While the performance improvement of general-purpose microprocessors has slowed significantly, the GPUs have continued to improve relentlessly. As of , the ratio between many-core GPUs and multicore CPUs for peak floating-point calculation throughput is about 10 to 1. Phase 1: Phase 2: The next phase is a series of 10 lectures that give students the conceptual understanding of the CUDA memory model, the CUDA threading model, GPU hardware performance features, modern computer system architecture, and the common data-parallel programming patterns needed to develop a high-performance parallel application.

These lectures are based on Chapters 4 through 7. The performance of their matrix multiplication codes increases by about 10 times through this period.

The students also complete assignments on convolution, vector reduction, and prefix scan through this period. Phase 3: Once the students have established solid CUDA programming skills, the remaining lectures cover computational thinking, a broader range of parallel execution models, and parallel programming principles. These lectures are based on Chapters 8 through The voice and video recordings of these lectures are available on-line http: Tying It All Together: The Final Project While the lectures, labs, and chapters of this book help lay the intellectual foundation for the students, what brings the learning experience together is the final project.

It incorporates five innovative aspects: Students are encouraged to base their final projects on problems that represent current challenges in the research community. To seed the process, the instructors recruit several major computational science research groups to propose problems and serve as mentors. The mentors are asked to contribute a one-to-two-page project specification sheet that briefly describes the significance of the application, what the mentor would like to accomplish with the student teams on the application, the technical skills particular type of Math, Physics, Chemistry courses required to understand and work on the application, and a list of web and traditional resources that students can draw upon for technical background, general xiii xiv Preface information, and building blocks, along with specific URLs or ftp paths to particular implementations and coding examples.

These project specification sheets also provide students with learning experiences in defining their own research projects later in their careers. Students are also encouraged to contact their potential mentors during their project selection process. Once the students and the mentors agree on a project, they enter into a close relationship, featuring frequent consultation and project reporting.

We the instructors attempt to facilitate the collaborative relationship between students and their mentors, making it a very valuable experience for both mentors and students. We usually dedicate six of the lecture slots to project workshops. For example, if a student has identified a project, the workshop serves as a venue to present preliminary thinking, get feedback, and recruit teammates.

Students are not graded during the workshops, in order to keep the atmosphere nonthreatening and enable them to focus on a meaningful dialog with the instructor s , teaching assistants, and the rest of the class. The workshop schedule is designed so the instructor s and teaching assistants can take some time to provide feedback to the project teams and so that students can ask questions.

Presentations are limited to 10 min so there is time for feedback and questions during the class period. This limits the class size to about 36 presenters, assuming min lecture slots. All presentations are preloaded into a PC in order to control the schedule strictly and maximize feedback time. Since not all students present at the workshop, we have been able to accommodate up to 50 students in each class, with extra workshop time available as needed.

The instructor s and TAs must make a commitment to attend all the presentations and to give useful feedback. Students typically need most help in answering the following questions. First, are the projects too big or too small for the amount of time available?

Second, is there existing work in the field that the project can benefit from? Third, are the computations being targeted for parallel execution appropriate for the CUDA programming model?

Preface The Design Document Once the students decide on a project and form a team, they are required to submit a design document for the project. This helps them think through the project steps before they jump into it. The ability to do such planning will be important to their later career success. The design document should discuss the background and motivation for the project, application-level objectives and potential impact, main features of the end application, an overview of their design, an implementation plan, their performance goals, a verification plan and acceptance test, and a project schedule.

The teaching assistants hold a project clinic for final project teams during the week before the class symposium. This clinic helps ensure that students are on-track and that they have identified the potential roadblocks early in the process. Student teams are asked to come to the clinic with an initial draft of the following three versions of their application: This version is the main output of the project; 3 A version of CPU sequential code that is based on the same algorithm as version 3, using single precision.

This version is used by the students to characterize the parallel algorithm overhead in terms of extra computations involved. Student teams are asked to be prepared to discuss the key ideas used in each version of the code, any floating-point precision issues, any comparison against previous results on the application, and the potential impact on the field if they achieve tremendous speedup.

From our experience, the optimal schedule for the clinic is 1 week before the class symposium. An earlier time typically results in less mature projects and less meaningful sessions. A later time will not give students sufficient time to revise their projects according to the feedback. Six lecture slots are combined into a whole-day class symposium. During the symposium, students use presentation slots proportional to the size of the teams.

During the presentation, the students highlight the best parts of their project report for the benefit of the whole class. The symposium is a major opportunity for students to learn to produce a concise presentation that xv xvi Preface motivates their peers to read a full paper.

After their presentation, the students also submit a full report on their final project. While this book provides the intellectual contents for these classes, the additional material will be crucial in achieving the overall education goals. Finally, we encourage you to submit your feedback. We would like to hear from you if you have any ideas for improving this book and the supplementary online material.

Of course, we also like to know what you liked about the book. Their teams created an excellent infrastructure for this course. Calisa Cole helped with cover. We also thank Jensen Huang for providing a great amount of financial and human resources for developing the course. Jensen also took the time to read the early drafts of the chapters and gave us valuable feedback.

David Luebke has facilitated the GPU computing resources for the course. Jonah Alben has provided valuable insight. Michael Shebanow and Michael Garland have given guest lectures and contributed materials. John Stratton and Chris Rodrigues contributed some of the base material for the computational thinking chapter.

Laurie Talkington and James Hutchinson helped to dictate early lectures that served as the base for the first five chapters. Mike Showerman helped build two generations of GPU computing clusters for the course. Jeremy Enos worked tirelessly to ensure that students have a stable, user-friendly GPU computing cluster to work on their lab assignments and projects.

We acknowledge Dick Blahut who challenged us to create the course in Illinois. His constant reminder that we needed to write the book helped keep us going.

Description

Through that gathering, Blahut was introduced to David and challenged David to come to Illinois and create the course with Wen-mei. We also thank Thom Dunning of the University of Illinois and Sharon Glotzer of the University of Michigan, Co-Directors of the multiuniversity Virtual School of Computational Science and Engineering, for graciously xvii xviii Acknowledgments hosting the summer school version of the course.

Nicolas Pinto tested the early versions of the first chapters in his MIT class and assembled an excellent set of feedback comments and corrections. Steve Lumetta and Sanjay Patel both taught versions of the course and gave us valuable feedback. John Owens graciously allowed us to use some of his slides. Michael Giles reviewed the semi-final draft chapters in detail and identified many typos and inconsistencies.

We are humbled by the generosity and enthusiasm of all the great people who contributed to the course and the book.

This relentless drive of performance improvement has allowed application software to provide more functionality, have better user interfaces, and generate more useful results. The users, in turn, demand even more improvements once they become accustomed to these improvements, creating a positive cycle for the computer industry. During the drive, most software developers have relied on the advances in hardware to increase the speed of their applications under the hood; the same software simply runs faster as each new generation of processors is introduced.

This drive, however, has slowed since due to energyconsumption and heat-dissipation issues that have limited the increase of the clock frequency and the level of productive activities that can be performed in each clock period within a single CPU. Virtually all microprocessor vendors have switched to models where multiple processing units, referred to as processor cores, are used in each chip to increase the 1 2 CHAPTER 1 Introduction processing power.

This switch has exerted a tremendous impact on the software developer community [Sutter ]. Traditionally, the vast majority of software applications are written as sequential programs, as described by von Neumann [] in his seminal report. The execution of these programs can be understood by a human sequentially stepping through the code. Historically, computer users have become accustomed to the expectation that these programs run faster with each new generation of microprocessors.

Such expectation is no longer strictly valid from this day onward. A sequential program will only run on one of the processor cores, which will not become significantly faster than those in use today.

Without performance improvement, application developers will no longer be able to introduce new features and capabilities into their software as new microprocessors are introduced, thus reducing the growth opportunities of the entire computer industry. Rather, the applications software that will continue to enjoy performance improvement with each new generation of microprocessors will be parallel programs, in which multiple threads of execution cooperate to complete the work faster.

This new, dramatically escalated incentive for parallel program development has been referred to as the concurrency revolution [Sutter ]. The practice of parallel programming is by no means new. The high-performance computing community has been developing parallel programs for decades. These programs run on large-scale, expensive computers. Only a few elite applications can justify the use of these expensive computers, thus limiting the practice of parallel programming to a small number of application developers.

Now that all new microprocessors are parallel computers, the number of applications that must be developed as parallel programs has increased dramatically.

There is now a great need for software developers to learn about parallel programming, which is the focus of this book. The multicore trajectory seeks to maintain the execution speed of sequential programs while moving into multiple cores.

The multicores began as two-core processors, with the number of cores approximately doubling with each semiconductor process generation. In contrast, the many-core trajectory focuses more on the execution throughput of parallel applications.

The many-cores began as a large number of much smaller cores, and, once again, the number of cores doubles with each generation.

Many-core processors, especially the GPUs, have led the race of floating-point performance since This phenomenon is illustrated in Figure 1. While the performance improvement of general-purpose microprocessors has slowed significantly, the GPUs have continued to improve relentlessly.

As of , the ratio between many-core GPUs and multicore CPUs for peak floating-point calculation throughput is about 10 to 1. These are not necessarily achievable application speeds but are merely the raw speed that the execution resources can potentially support in these chips: We have reached that point now. To date, this large performance gap has already motivated many applications developers to move the computationally intensive parts of their software to GPUs for execution.

Not surprisingly, these computationally intensive parts are also the prime target of parallel programming—when there is more work to do, there is more opportunity to divide the work among cooperating parallel workers.

The answer lies in the differences in the fundamental design philosophies between the two types of processors, as illustrated in Figure 1. The design of a CPU is optimized for sequential code performance. It makes use of sophisticated control logic to allow instructions from a single thread of execution to execute in parallel or even out of their sequential order while maintaining the appearance of sequential execution.

More importantly, large cache memories are provided to reduce the instruction and data access latencies of large complex applications. Neither control logic nor cache memories contribute to the peak calculation speed. As of , the new general-purpose, multicore microprocessors typically have four large processor cores designed to deliver strong sequential code performance. Memory bandwidth is another important issue. Graphics chips have been operating at approximately 10 times the bandwidth of contemporaneously available CPU chips.

In contrast, with simpler memory models and fewer legacy constraints, the GPU designers can more easily achieve higher memory bandwidth. The design philosophy of the GPUs is shaped by the fast growing video game industry, which exerts tremendous economic pressure for the ability to perform a massive number of floating-point calculations per video frame in advanced games.

This demand motivates the GPU vendors to look for ways to maximize the chip area and power budget dedicated to floatingpoint calculations. The prevailing solution to date is to optimize for the execution throughput of massive numbers of threads. The hardware takes advantage of a large number of execution threads to find work to do when some of them are waiting for long-latency memory accesses, thus minimizing the control logic required for each execution thread.

Small cache memories are provided to help control the bandwidth requirements of these applications so multiple threads that access the same memory data do not need to all go to the DRAM.

As a result, much more chip area is dedicated to the floating-point calculations.

Calculus Practice Problems For Dummies

It should be clear now that GPUs are designed as numeric computing engines, and they will not perform well on some tasks on which CPUs are designed to perform well; therefore, one should expect that most applications will use both CPUs and GPUs, executing the sequential parts on the CPU and numerically intensive parts on the GPUs. Several other factors can be even more important. First and foremost, the processors of choice must have a very large presence in the marketplace, referred to as the installation base of the processor.

The reason is very simple. The cost of software development is best justified by a very large customer population. Applications that run on a processor with a small market presence will not have a large customer base.

This has been a major problem with traditional parallel computing systems that have negligible market presence compared to general-purpose microprocessors. Only a few elite applications funded by government and large corporations have been successfully developed on these traditional parallel computing systems.

This has changed with the advent of many-core GPUs. The G80 processors and their successors have shipped more than million units to date.

This is the first time that massively parallel computing has been feasible with a mass-market product. Such a large market presence has made these GPUs economically attractive for application developers.

Other important decision factors are practical form factors and easy accessibility. Until , parallel software applications usually ran on data-center servers or departmental clusters, but such execution environments tend to limit the use of these applications. For example, in an application such as medical imaging, it is fine to publish a paper based on a node cluster machine, but actual clinical applications on magnetic resonance imaging MRI machines are all based on some combination of a PC and special hardware accelerators.

The simple reason is that manufacturers such as GE and Siemens cannot sell MRIs with racks of clusters to clinical settings, but this is common in academic departmental settings. In fact, the National Institutes of Health NIH refused to fund parallel programming projects for some time; they felt that the impact of parallel software would be limited because huge cluster-based machines would not work in the clinical setting.

Yet another important consideration in selecting a processor for executing numeric computing applications is the support for the Institute of Electrical and Electronics Engineers IEEE floating-point standard. The standard makes it possible to have predictable results across processors from different vendors.

While support for the IEEE floating-point standard 1. As a result, one can expect that more numerical applications will be ported to GPUs and yield comparable values as the CPUs. Today, a major remaining issue is that the floating-point arithmetic units of the GPUs are primarily single precision. Applications that truly require double-precision floating point were not suitable for GPU execution; however, this has changed with the recent GPUs, whose double-precision execution speed approaches about half that of single precision, a level that high-end CPU cores achieve.

This makes the GPUs suitable for even more numerical applications. This technique was called GPGPU, short for general-purpose programming using a graphics processing unit. Even with a higher level programming environment, the underlying code is still limited by the APIs. These APIs limit the kinds of applications that one can actually write for these chips.

Nonetheless, this technology was sufficiently exciting to inspire some heroic efforts and excellent results. NVIDIA actually devoted silicon area to facilitate the ease of parallel programming, so this did not represent a change in software alone; additional hardware was added to the chip.

In the G80 and its successor chips for parallel computing, CUDA programs no longer go through the graphics interface at all. Instead, a new general-purpose parallel programming interface on the silicon chip serves the requests of CUDA programs. Some of our students tried to do their lab assignments using the old OpenGL-based programming interface, and their experience helped them to greatly appreciate the improvements that eliminated the need for using the graphics APIs for computing applications.

It is organized into an array of highly threaded streaming multiprocessors SMs. In Figure 1. Also, each SM in Figure 1. For graphics applications, they hold video images, and texture information for three-dimensional 3D rendering, but for computing they function as very-high-bandwidth, off-chip memory, though with somewhat more latency than typical system memory. For massively parallel applications, the higher bandwidth makes up for the longer latency.

The communication bandwidth is also expected to grow as the CPU bus bandwidth of the system memory grows in the future. In addition, specialfunction units perform floating-point functions such as square root SQRT , as well as transcendental functions. With SPs, the GT exceeds 1 terflops. Because each SP is massively threaded, it can run thousands of threads per application. A good application typically runs —12, threads simultaneously on this chip.

For those who are used to simultaneous multithreading, note that Intel CPUs support 2 or 4 threads, depending on the machine model, per core. The G80 chip supports up to threads per SM, which sums up to about 12, threads for this chip. Thus, the level of parallelism supported by GPU hardware is increasing quickly. It is very important to strive for such levels of parallelism when developing GPU parallel computing applications. As we stated in Section 1. One might ask why applications will continue to demand increased speed.

Many applications that we have today seem to be running quite fast enough. For anything beyond that, we invite you to keep reading! For example, the biology research community is moving more and more into the molecular level. Microscopes, arguably the most important instrument in molecular biology, used to rely on optics or electronic instrumentation, but there are limitations to the molecular-level observations that we can make with these instruments.

These limitations can be effectively addressed by incorporating a computational model to simulate the underlying molecular activities with boundary conditions set by traditional instrumentation. From the simulation we can measure even more details and test more hypotheses than can ever be imagined with traditional instrumentation alone. These simulations will continue to benefit from the increasing computing speed in the foreseeable future in terms of the size of the biological system that can be modeled and the length of reaction time that can be simulated within a tolerable response time.

These enhancements will have tremendous implications with regard to science and medicine. Once we experience the level of details offered by HDTV, it is very hard to go back to older technology. But, consider all the processing that is necessary for that HDTV. It is a very parallel process, as are 3D imaging and 1. In the future, new functionalities such as view synthesis and high-resolution display of low-resolution videos will demand that televisions have more computing power.

Among the benefits offered by greater computing speed are much better user interfaces. Undoubtedly, future versions of these devices will incorporate higher definition, three-dimensional perspectives, voice and computer vision based interfaces, requiring even more computing speed. Similar developments are underway in consumer electronic gaming. Imagine driving a car in a game today; the game is, in fact, simply a prearranged set of scenes.

If your car bumps into an obstacle, the course of your vehicle does not change; only the game score changes. Your wheels are not bent or damaged, and it is no more difficult to drive, regardless of whether you bumped your wheels or even lost a wheel.

With increased computing speed, the games can be based on dynamic simulation rather than prearranged scenes. We can expect to see more of these realistic effects in the future—accidents will damage your wheels, and your online driving experience will be much more realistic.

Realistic modeling and simulation of physics effects are known to demand large amounts of computing power. All of the new applications that we mentioned involve simulating a concurrent world in different ways and at different levels, with tremendous amounts of data being processed.

And, with this huge quantity of data, much of the computation can be done on different parts of the data in parallel, although they will have to be reconciled at some point. Techniques for doing so are well known to those who work with such applications on a regular basis. Thus, various granularities of parallelism do exist, but the programming model must not hinder parallel implementation, and the data delivery must be properly managed.

CUDA includes such a programming model along with hardware support that facilitates parallel implementation. We aim to teach application developers the fundamental techniques for managing parallel execution and delivering data. How many times speedup can be expected from parallelizing these superapplication?

It depends on the portion of the application that can be parallelized. The speedup for the entire application will be only 1. The trick is to figure out how to get around memory bandwidth limitations, which involves doing one of many transformations to utilize specialized GPU on-chip memories to drastically reduce the number of accesses to the DRAM. One must, however, further optimize the code to get around limitations such as limited on-chip memory capacity. An important goal of this book is to help you to fully understand these optimizations and become skilled in them.

Keep in mind that the level of speedup achieved over CPU execution can also reflect the suitability of the CPU to the application. Most applications have portions that can be much better executed by the CPU. This is precisely what the CUDA programming model promotes, as we will further explain in the book. Figure 1. Much of the code of a real application tends to be sequential.

These portions are considered to be the pit area of the peach; trying to apply parallel computing techniques to these portions is like biting into the peach pit—not a good feeling!

These portions are very difficult to parallelize. CPUs tend to do a very good job on these portions. The good news is that these portions, although they can take up a large portion of the code, tend to account for only a small portion of the execution time of superapplications.

Then come the meat portions of the peach. These portions are easy to parallelize, as are some early graphics applications. The cost and size benefit of the GPUs can drastically improve the quality of these applications. As illustrated in Figure 1.

As we will see, the CUDA programming model is designed to cover a much larger section of the peach meat portions of exciting applications. MPI is a model where computing nodes in a cluster do not share memory [MPI ]; all data sharing and interaction must be done through explicit message passing. MPI has been successful in the high-performance scientific computing domain.

Applications written in MPI have been known to run successfully on cluster computing systems with more than , nodes. OpenMP supports shared memory, so it offers the same advantage as CUDA in programming efforts; however, it has not been able to scale beyond a couple hundred computing nodes due to thread management overheads and cache coherence hardware requirements.

CUDA achieves much higher scalability with simple, low-overhead thread management and no cache coherence hardware requirements. On the other hand, many superapplications fit well into the simple thread management model of CUDA and thus enjoy the scalability and performance. Several ongoing research efforts aim at adding more automation of parallelism management and performance optimization to the CUDA tool chain.

Especially, many of the performance optimization techniques are common among these models. Similar to CUDA, the OpenCL programming model defines language extensions and runtime APIs to allow programmers to manage parallelism and data delivery in massively parallel processors. The reader might ask why the book is not based on OpenCL. The main reason is that OpenCL was still in its infancy when this book was written. Because programming massively parallel processors is motivated by speed, we expect that most who program massively parallel processors will continue to use CUDA for the foreseeable future.

We will give a more detailed analysis of these similarities later in the book. You can literally write a parallel program in an hour. In particular, we will focus on computational thinking techniques that will enable you to think about problems in ways that are amenable to high-performance parallel computing.

Note that hardware architecture features have constraints. Highperformance parallel programming on most of the chips will require some knowledge of how the hardware actually works. It will probably take 10 more years before we can build tools and machines so most programmers can work without this knowledge. We will not be teaching computer architecture as a separate topic; instead, we will teach the essential computer architecture knowledge as part of our discussions on highperformance parallel programming techniques.

Our second goal is to teach parallel programming for correct functionality and reliability, which constitute a subtle issue in parallel computing. Those who have worked on parallel systems in the past know that achieving initial performance is not enough. The challenge is to achieve it in such a way that you can debug the code and support the users. We will show that with the CUDA programming model that focuses on data parallelism, one can achieve both high performance and high reliability in their applications.

We want to help you to master parallel programming so your programs can scale up to the level of performance of new generations of machines. Much technical knowledge will be required to achieve these goals, so we will cover quite a few principles and patterns of parallel programming in this book. We cannot guarantee that we will cover all of them, however, so we have selected several of the most useful and well-proven techniques to cover in detail.

To complement your knowledge and expertise, we include a list of recommended literature. We are now ready to give you a quick overview of the rest of the book. It begins with a brief summary of the evolution of graphics hardware toward greater programmability and then discusses the historical GPGPU movement. A good understanding of these historic developments will help the reader to better understand the current state and the future trends of hardware evolution that will continue to impact the types of applications that will benefit from CUDA.

Chapter 3 introduces CUDA programming. This chapter relies on the fact that students have had previous experience with C programming. It then covers the thought processes involved in: Although the objective of Chapter 3 is to teach enough concepts of the CUDA programming model so the readers can write a simple parallel CUDA program, it actually covers several basic skills needed to develop a parallel application based on any parallel programming model.

We use a running example of matrix—matrix multiplication to make this chapter concrete. Chapter 4 covers the thread organization and execution model required to fully understand the execution behavior of threads and basic performance concepts.

Chapter 5 is dedicated to the special memories that can be used to hold CUDA variables for improved program execution speed. Chapter 6 introduces the major factors that contribute to the performance of a CUDA kernel function. Chapter 7 introduces the floating-point representation and concepts such as precision and accuracy. Although these chapters are based on CUDA, they help the readers build a foundation for parallel programming in general.

We believe that humans understand best when we learn from the bottom up; that is, we must first learn the concepts in the context of a particular programming model, which provides us with a solid footing to generalize our knowledge to other programming models.

As we do so, we can draw on our concrete experience from the CUDA model.

Programming Massively Parallel Processors

An in-depth experience with the CUDA model also enables us to gain maturity, which will help us learn concepts that may not even be pertinent to the CUDA model. Chapters 8 and 9 are case studies of two real applications, which take the readers through the thought processes of parallelizing and optimizing their applications for significant speedups. For each application, we begin by identifying alternative ways of formulating the basic structure of the parallel execution and follow up with reasoning about the advantages and disadvantages of each alternative.

We then go through the steps of code transformation necessary to achieve high performance. These two chapters help the readers put all the materials from the previous chapters together and prepare for their own application development projects.

Chapter 10 generalizes the parallel programming techniques into problem decomposition principles, algorithm strategies, and computational thinking. It does so by covering the concept of organizing the computation tasks of a program so they can be done in parallel.

We begin by discussing the translational process of organizing abstract scientific concepts into computational tasks, an important first step in producing quality application software, serial or parallel. The chapter then addresses parallel algorithm structures and their effects on application performance, which is grounded in the performance tuning experience with CUDA.

The chapter concludes with a treatment of parallel programming styles and models, allowing the readers to place their knowledge in a wider context. Although we do not go into these alternative parallel programming styles, we expect that the readers will be able to learn to program in any of them with the foundation gained in this book. Chapter 12 offers some concluding remarks and an outlook for the future of massively parallel programming. We revisit our goals and summarize how the chapters fit together to help achieve the goals.

We then present a brief survey of the major trends in the architecture of massively parallel processors and how these trends will likely impact parallel programming in the future.

We conclude with a prediction that these fast advances in massively parallel computing will make it one of the most exciting areas in the coming decade. References and Further Reading Hwu, W. The concurrency challenge. Khronos Group.

Programming Massively Parallel Processors

Beaverton, OR: Mattson, T. Patterns of parallel programming. Upper Saddle River, NJ: Message Passing Interface Forum. University of Tennessee. CUDA programming guide. Santa Clara, CA: Sutter, H. Software and the concurrency revolution. ACM Queue, 3 7 , 54— References and Further Reading von Neumann, J. Contract No. WORD, U. The computer: From Pascal to von Neumann. Princeton, NJ: Princeton University Press. Wing, J. Computational thinking. Communications of the ACM, 49 3 , 33— One needs not understand graphics algorithms or terminology in order to be able to program these processors.

However, understanding the graphics heritage of these processors illuminates the strengths and weaknesses of these processors with respect to major computational patterns. In particular, the history helps to clarify the rationale behind major architectural design decisions of modern programmable GPUs: Insights into the historical developments will also likely give the reader the context needed to project the future evolution of GPUs as computing devices.

During the same period, the performance increased from 50 million pixels per second to 1 billion pixels per second and from , vertices per second to 10 million vertices per second.

Although these advancements have much to do with the relentlessly shrinking feature sizes of semiconductor devices, they also have resulted from innovations in graphics algorithms and hardware design that have shaped the native hardware capabilities of modern GPUs.

The remarkable advancement of graphics hardware performance has been driven by the market demand for high-quality, real-time graphics in computer applications.

In an electronic gaming application, for example, one needs to render ever more complex scenes at an ever-increasing resolution at a rate of 60 frames per second. The net result is that over the last 30 years graphics architecture has evolved from being a simple pipeline for drawing wire-frame diagrams to a highly parallel design consisting of several deep parallel pipelines capable of rendering the complex interactive imagery of 3D scenes.

Concurrently, many of the hardware functionalities involved became far more sophisticated and user programmable. In that same era, major graphics application programming interface API libraries became popular. An API is a standardized layer of software i. An API, for example, can allow a game to send commands to a graphics processing unit to draw objects on a display.

Figure 2. The host interface receives graphics commands and data from the CPU. The commands are typically given by application programs by calling an API function. The host interface typically contains a specialized direct memory access DMA hardware to efficiently transfer 2. The host interface also communicates back the status and result data of executing the commands. Before we describe the other stages of the pipeline, we should clarify that the term vertex usually refers to the corner of a polygon.

The GeForce graphics pipeline is designed to render triangles, so the term vertex is typically used in this case to refer to the corners of a triangle.

The surface of an object is drawn as a collection of triangles. The finer the sizes of the triangles are, the better the quality of the picture typically becomes. The vertex control stage in Figure 2. The vertex control stage then converts the triangle data into a form that the hardware understands and places the prepared data into the vertex cache.

The shading is done by the pixel shader hardware. The vertex shader can assign a color to each vertex, but color is not applied to triangle pixels until later. The triangle setup stage further creates 23 24 CHAPTER 2 History of GPU Computing edge equations that are used to interpolate colors and other per-vertex data such as texture coordinates across the pixels touched by the triangle.

The raster stage determines which pixels are contained in each triangle. For each of these pixels, the raster stage interpolates per-vertex values necessary for shading the pixel, including the color, position, and texture position that will be shaded painted on the pixel. The shader stage in Figure 2. This can be generated as a combined effect of many techniques: Many effects that make the rendered images more realistic are incorporated in the shader stage.

It shows an example in which a world map texture is mapped onto a sphere object. Note that the sphere object is described as a large collection of triangles. The raster operation ROP stage in Figure 2.

It also determines the visible objects for a given viewpoint and discards the occluded pixels. A pixel becomes occluded when it is blocked by pixels from other objects according to the given view point.

Notice the three adjacent triangles with a black background. In the aliased output, each pixel assumes the color of one of the objects or the background. The limited resolution makes the edges look crooked and the shapes of the objects distorted.

The problem is that many pixels are partly in one object and partly in another object or the background. Forcing these pixels to assume the color of one of the objects introduces distortion into the edges of the objects.

The antialiasing operation gives each pixel a color that is blended, or linearly combined, from the colors of all the objects and background that partially overlap the pixel. The contribution of each object to the color of the pixel is the amount of the pixel that the object overlaps. Finally, the frame buffer interface FBI stage in Figure 2. Such bandwidth is achieved by two strategies. One is that graphics pipelines typically use special memory designs that provide higher bandwidth than the system memories.

Second, the FBI simultaneously manages multiple memory channels that connect to multiple memory banks. The combined bandwidth improvement of multiple channels and special memory structures gives the frame buffers much higher bandwidth than their contemporaneous system memories.

Such high memory bandwidth has continued to this day and has become a distinguishing feature of modern GPU design. For two decades, each generation of hardware and its corresponding generation of API brought incremental improvements to the various stages of the graphics pipeline.

Although each generation introduced additional hardware resources and configurability to the pipeline stages, developers were growing more sophisticated and asking for more new features than could be reasonably offered as built-in fixed functions.

The obvious next step was to make some of these graphics pipeline stages into programmable processors. Later GPUs, at the time of DirectX 9, extended general programmability and floating-point capability to the pixel shader stage and made texture accessible from the vertex shader stage.

The GeForce FX added bit floating-point pixel processors. These programmable pixel shader processors were part of a general trend toward unifying the functionality of the different stages as seen by the application programmer. The GeForce and series were built with separate processor designs dedicated to vertex and pixel processing.

In graphics pipelines, certain stages do a great deal of floating-point arithmetic on completely independent data, such as transforming the positions of triangle vertices or generating pixel colors. This data independence as the dominating application characteristic is a key difference between the design 2. The opportunity to use hardware parallelism to exploit this data independence is tremendous. The specific functions executed at a few graphics pipeline stages vary with rendering algorithms.

Such variation has motivated the hardware designers to make those pipeline stages programmable. Two particular programmable stages stand out: Vertex shader programs map the positions of triangle vertices onto the screen, altering their position, color, or orientation.

Typically, a vertex shader thread reads a floating-point x, y, z, w vertex position and computes a floating-point x, y, z screen position. Geometry shader programs operate on primitives defined by multiple vertices, changing them or generating additional primitives. A shader program calculates the floating-point red, green, blue, alpha RGBA color contribution to the rendered image at its pixel sample x, y image position.

These programs execute on the shader stage of the graphics pipeline. For all three types of graphics shader programs, program instances can be run in parallel, because each works on independent data, produces independent results, and has no side effects. This property has motivated the design of the programmable pipeline stages into massively parallel processors. The programmable vertex processor executes the programs designated to the vertex shader stage, and the programmable fragment processor executes the programs designated to the pixel shader stage.

Between these programmable graphics pipeline stages are dozens of fixed-function stages that perform well-defined tasks far more efficiently than a programmable processor could and which would benefit far less from programmability. Together, the mix of programmable and fixed-function stages is engineered to balance extreme performance with user control over the rendering algorithms. Common rendering algorithms perform a single pass over input primitives and access other memory resources in a highly coherent manner.

That is, these algorithms tend to simultaneously access contiguous memory locations, such as all triangles or all pixels in a neighborhood.

Combined with a pixel shader workload that is usually compute limited, these characteristics have guided GPUs along a different evolutionary path than CPUs. In particular, whereas the CPU die area is dominated by cache memories, GPUs are dominated by floating-point datapath and fixed-function logic.

This is illustrated in Figure 2. The unified processor array allows dynamic partitioning of the array to vertex shading, geometry processing, and pixel processing. Because different rendering algorithms present wildly different loads among the three programmable stages, this unification allows the same pool of execution resources to be dynamically allocated to different pipeline stages and achieve better load balance.

By the DirectX 10 generation, the functionality of vertex and pixel shaders had been made identical to the programmer, and a new logical stage was introduced, the geometry shader, to process all the vertices of a primitive rather than vertices in isolation.

The GeForce was designed with DirectX 10 in mind. Developers were coming up with more sophisticated shading algorithms, and this motivated a sharp increase in the available shader operation rate, particularly floating-point operations.

NVIDIA pursued a processor design with higher operating clock frequency than what was allowed by standard-cell methodologies in order to deliver the desired operation throughput as area efficiently as possible.

High-clock-speed design requires substantially greater engineering effort, thus favoring the design of one processor array rather than two or three, given the new geometry stage. It became worthwhile to take on the engineering challenges of a unified processor—load balancing and recirculation of a logical pipeline onto threads of the processor array—while seeking the benefits of one processor design.

Such design paved the way for using the programmable GPU processor array for general numeric computing. An Intermediate Step While the GPU hardware designs evolved toward more unified processors, they increasingly resembled high-performance parallel computers. To access the computational resources, a programmer had to cast his or her problem into native graphics operations so the computation could be launched through OpenGL or DirectX API calls.

To run many simultaneous instances of a compute function, for example, the computation had to be written as a pixel shader.

The collection of input data had to be stored in texture images and issued to the GPU by submitting triangles with clipping to a rectangle shape if that was what was desired. The output had to be cast as a set of pixels generated from the raster operations. The fact that the GPU processor array and frame buffer memory interface were designed to process graphics data proved too restrictive for general numeric applications.

In particular, the output data of the shader programs are single pixels whose memory locations have been predetermined; thus, the graphics processor array is designed with very restricted memory reading and writing capability. More importantly, shaders did not have the means to perform writes with calculated memory addresses, referred to as scatter operations, to memory. The only way to write a result to memory was to emit it as a pixel color value, and configure the frame buffer operation stage to write or blend, if desired the result to a two-dimensional frame buffer.

Furthermore, the only way to get a result from one pass of computation to the next was to write all parallel results to a pixel frame buffer, then use that frame buffer as a texture map input to the pixel fragment shader of the next stage of the computation. There was also no user-defined data types; most data had to be stored in one-, two-, or four-component vector arrays.

Mapping general computations to a GPU in this era was quite awkward. Nevertheless, intrepid researchers demonstrated a handful of useful applications with painstaking efforts.

NVIDIA selected a programming approach in which programmers would explicitly declare the data-parallel aspects of their workload. The designers of the Tesla GPU architecture took another step. The shader processors became fully programmable processors with large instruction memory, instruction cache, and instruction sequencing control logic.

The cost of these additional hardware resources was reduced by having multiple shader processors to share their instruction cache and instruction sequencing control logic. This design style works well with graphics applications because the same shader program needs to be applied to a massive number of vertices or pixels. NVIDIA added memory load and store instructions with random byte addressing capability to support the requirements of compiled C programs.

To nongraphics application programmers, the Tesla GPU architecture introduced a more generic parallel programming model with a hierarchy of parallel threads, barrier synchronization, and atomic 2. In the early days, workstation graphics systems gave customers a choice in pixel horsepower by varying the number of pixel processor circuit boards installed.

Prior to the mids, PC graphics scaling was almost nonexistent. There was one option: As 3D-capable accelerators began to appear, there was room in the market for a range of offerings.

In , 3dfx introduced multiboard scaling with their original Scan Line Interleave SLI on their Voodoo2, which held the performance crown for its time. At present, for a given architecture generation, four or five separate chip designs are needed to cover the range of desktop PC performance and price points. In addition, there are separate segments in notebook and workstation systems. Functional behavior is identical across the scaling range; one application will run unchanged on any implementation of an architectural family.

By switching to the multicore trajectory, CPUs are scaling to higher transistor counts by increasing the number of nearly-constant-performance cores on a die rather than simply increasing the performance of a single core. At this writing, the industry is transitioning from quad-core to hexand oct-core CPUs. Programmers are forced to find four- to eight-fold parallelism to fully utilize these processors. Many of them resort to coarsegrained parallelism strategies where different tasks of an application are performed in parallel.

Such applications must be rewritten often to have more parallel tasks for each successive doubling of core count. Efficient threading support in GPUs allows applications to expose a much larger amount of parallelism than available hardware execution resources with little or no penalty.

Each doubling of GPU core count provides more hardware execution resources that exploit more of the exposed parallelism for higher performance; that is, the GPU parallel programming model for graphics and parallel computing is designed for transparent and portable scalability.

Many of these programs run the application tens or hundreds of times faster than multicore CPUs are capable of running them. Although many of these use single-precision floating-point arithmetic, some problems require double precision. The arrival of double-precision floating point in GPUs enabled an even broader range of applications to benefit from GPU acceleration.

In addition, GPUs will continue to enjoy vigorous architectural evolution. Despite their demonstrated high performance on data parallel applications, GPU core processors are still of relatively simple design.

More aggressive techniques will be introduced with each successive architecture to increase the actual utilization of the calculating units. Because scalable parallel computing on GPUs is still a young field, novel applications are rapidly being created. By studying them, GPU designers will discover and implement new machine optimizations.

Chapter 12 provides more details of such future trends. Reality engine graphics. Akeley, K. High-performance polygon rendering. Blelloch, G. Prefix sums and their applications.

Programming Massively Parallel Processors: A Hands-on Approach

Reif Ed. San Francisco, CA: Morgan Kaufmann. Blythe, D. The Direct3D 10 System. ACM Transactions on Graphics, 25 3 , — Buck, I. Brooks for GPUs: Stream computing on graphics hardware.

ACM Transactions on Graphics, 23 3 , — http: Elder, G. Radeon Fernando, R. Programming techniques, tips, and tricks for realtime graphics. Reading, MA: Addison-Wesley http: The Cg tutorial: The definitive guide to programmable real-time graphics. Foley, J. Interactive computer graphics: Principles and practice, C edition 2nd ed. Hillis, W. Data parallel algorithms. Communications of the ACM, 29 12 , — http: Standard for floating-point arithmetic P Draft.

Piscataway, NJ: Institute of Electrical and Electronics Engineers http: Industrial Light and Magic. San Mateo, CA: Industrial Light and Magic www. Intel 64 and IA Architectures optimization reference manual. Order No. Intel Corp. Kessenich, J.

Madison, AL: Kirk, D. The rendering architecture of the DNVS. Lindholm, E. A user-programmable vertex engine. In Proceedings of the 28th annual ACM conference on computer graphics and interactive techniques pp. A unified graphics and computing architecture. IEEE Micro, 28 2 , 39— Microsoft DirectX 9 programmable graphics pipeline.

Redmond, WA: Microsoft Press. Microsoft DirectX specification. Microsoft Press http: Montrym, J. A realtime graphics system. Owen, T. Mones-Hattal Eds. The GeForce IEEE Micro, 25 2 , 41— Moore, G. Cramming more components onto integrated circuits. Electronics, 38 8 , — Nguyen, H. GPU Gems 3. Nickolls, J. Scalable parallel programming with CUDA. ACM Queue, 6 2 , 40—Such expectation is no longer strictly valid from this day onward.

In Proceedings of the 28th annual ACM conference on computer graphics and interactive techniques pp.

Thread 15 of Block has a threaded value of We tested these xi xii Preface chapters in our spring class and our Summer School. We cannot guarantee that we will cover all of them, however, so we have selected several of the most useful and well-proven techniques to cover in detail. We want to help you to master parallel programming so your programs can scale up to the level of performance of new generations of machines.

The chapter then addresses parallel algorithm structures and their effects on application performance, which is grounded in the performance tuning experience with CUDA. Of course, we also like to know what you liked about the book.

Common rendering algorithms perform a single pass over input primitives and access other memory resources in a highly coherent manner.