[PDF] Multi-core: Adding a New Dimension to Computing

Abstract

Invention of Transistors in 1948 started a new era in technology, called Solid State Electronics. Since then, sustaining development and advancement in electronics and fabrication techniques has caused the devices to shrink in size and become smaller, paving the quest for increasing density and clock speed. That quest has suddenly come to a halt due to fundamental bounds applied by physical laws. But, demand for more and more computational power is still prevalent in the computing world. As a result, the microprocessor industry has started exploring the technology along a different dimension. Speed of a single work unit (CPU) is no longer the concern, rather increasing the number of independent processor cores packed in a single package has become the new concern. Such processors are commonly known as multi-core processors. Scaling the performance by using multiple cores has gained so much attention from the academia and the industry, that not only desktops, but also laptops, PDAs, cell phones and even embedded devices today contain these processors. In this paper, we explore state of the art technologies for multi-core processors and existing software tools to support parallelism. We also discuss present and future trend of research in this field. From our survey, we conclude that next few decades are going to be marked by the success of this "Ubiquitous parallel processing".

Full PDF

aa r X i v : . [ c s . A R ] N ov Multi-core: Adding a New Dimension to Computing

Md. Tanvir Al Amin

Department of Computer Science and EngineeringBangladesh University of Engineering and TechnologyDhaka- 1000, BangladeshEmail: [email protected]

Abstract —Invention of Transistors in 1948 started a new era intechnology, called Solid State Electronics. Since then, sustainingdevelopment and advancement in electronics and fabricationtechniques has caused the devices to shrink in size and becomesmaller, paving the quest for increasing density and clock speed.That quest has suddenly come to a halt due to fundamentalbounds applied by physical laws. But, demand for more and morecomputational power is still prevalent in the computing world.As a result, the microprocessor industry has started exploring thetechnology along a different dimension. Speed of a single workunit (CPU) is no longer the concern, rather increasing the numberof independent processor cores packed in a single packagehas become the new concern. Such processors are commonlyknown as multi-core processors. Scaling the performance byusing multiple cores has gained so much attention from theacademia and the industry, that not only desktops, but alsolaptops, PDAs, cell phones and even embedded devices todaycontain these processors. In this paper, we explore state of theart technologies for multi-core processors and existing softwaretools to support parallelism. We also discuss present and futuretrend of research in this ﬁeld. From our survey, we conclude thatnext few decades are going to be marked by the success of this“Ubiquitous parallel processing”.

I. I

NTRODUCTION

In 1965, Intel co-founder Gordon Moore predicted [1] thatthe transistor density of semiconductor chips would doubleroughly every 18 months, which is commonly known asMoore’s Law. Shrinking the feature size by a factor of x means clock rate can be increased nearly by a factor of x ,and number of transistors per unit area can go up by a factorof x . Due to sustaining advancement in VLSI fabricationtechnology, the computing industry had been experiencing thisevolution of computational power through clock speed, until awall was hit due to fundamental physics. As fabrication densityincreases, manufacturing costs go up, and more importantly,power density becomes higher. Moreover, though computingpower increases linearly with clock speed, heat dissipationrises with the square or cube, depending on the electricalmodel, and clock frequencies beyond 5 GHz easily melt chips[2]. Thus the “free lunch” of performance – in the form ofever faster clock speeds – is over.This physical boundary has forced microprocessor vendorsto take a new route – increase performance through paral-lelism. It is now commonly accepted that as serial processinghas reached a technology edge, it is parallel processing whichcan save the day, if applied correctly [3]. Parallel processingwas once the topic of mainframe or grid processing powereddata centers only, but now it is the story of mass consumer market, as we have so called “Multi-core Processors” – twoor more independent processor cores in a single package [4].Multi-core processors, though leverage the power of parallelprocessing, are different from Multiprocessor systems of su-percomputers or mainframes, both in technical and applicationframework. Multiprocessor systems have either uniform ornon uniform memory access, and processor to processorcommunication implies more loss in signal, as off-chip delay isfar greater than on-chip delay [4]. On the other hand, multi-core systems work in shared memory, and in a lot of casesshared cache systems. Processor to processor communicationis not expensive as both are in a single package.Amdahl’s law limits the speedup from using parallelism[2]. If a fraction p of a computation can be run in parallel,and the rest must run serially, maximum speedup is − p .This law applies for data centers or batch processing jobs,but the scenario with multi-core processors is a bit moreoptimistic, as consumer level computers, due to multitasking,has inherent parallelism of threads. This environment ensurethe performance boast through multi-core, because even ifsoftwares are not written speciﬁcally for multi-core, in amulti-threaded way; there are already lots of threads runningin a personal computer. Use of multi-core processors easilyimproves the response time and perceived performance here.In this paper, we discuss existing technologies for multi-core processors. Almost all microprocessor manufacturingcompanies have produced multi-core versions of their product.Dual-core or Quad-core processors from Intel or AMD areavailable in market for desktop, laptop or small servers [4].Latest graphics processors (GPU) from NVidia or ATI (AMD)are also multi-core. Broadcom, Cavium or TI are alreadyproducing multi-core embedded processors for various DSPor communication necessities [4]. Even the gaming consolesare not apart from this multi-core wave. We also discuss“RAMP”, an approach from UC-Berkeley [3], low clock-speed manycore processors on FPGA boards, designed as atestbed for scalability of multi-core applications.As multi-core operation is another form of parallel comput-ing, scalability depends on how much parallel the tasks are,especially when we have lots of cores. Thus software redesignexploiting parallelism and multithreading with load balancingis necessary for ultimate performance gain. Already severaltools have been developed for this [2]. In this paper, wediscuss CILK++, NI LabView, Intel TBB and several otherframeworks.e observe that the industry is approaching a transitionfrom multi-core to manycore. Intel has already started its Tera-scale computing research. Tilera released TILE64, a 64 coreprocessor. A new form of Moore’s law is now in effect, “Thenumber of cores in a physical processor will double in everytechnology generation”. II. P

ARALLEL C OMPUTING R EVOLUTION

A. ILP and TLP

Chip designers were always in the quest of increasingthroughput of a processor. First and obvious approach is todivide the instruction in multiple steps and use pipelining[5]. As multiple instructions are executed concurrently, thisis termed as ILP (Instruction Level Parallelism). ILP can beincreased more by making the pipeline stages multiple issue,i.e. replicating some of the internal components and launch-ing multiple instructions in every pipeline stage. Techniqueslike loop unrolling, superscalar, dynamic prediction, registerrenaming were [5] devised to ﬁnd independent instructions.But these techniques proved inefﬁcient for hard to predictcodes and it seemed that parallelism could be exercised better,if independent execution paths were already speciﬁed in theprogram code [4]. This resulted in the idea of TLP (ThreadLevel Parallelism) i.e. multiple thread-multiple core.

B. Multi-core Processors

A multi-core processor contains two or more independentprocessing cores in a single package. A dual-core processorcontains two cores in a single die (Fig. 1), and a quad-coreprocessor contains four cores in one or two dies.Each “core” independently implements optimizations suchas superscalar execution, pipelining, and multithreading. Theprocessors share the same interconnection to the rest of thesystem. Hence the bus and memory is shared in general.This is a difference from typical multiprocessing, where eachprocessor in general have separate memory. But, as each coreare independent, they can effectively run separate threads inparallel. Thus a multi-core microprocessor is an implemen-tation of multiprocessing in a single physical package, andit is most efﬁcient when each are presented with independentparallel tasks. These CPUs either use homogeneous (each pro-cessor similar in power and responsibility) or heterogeneousarchitecture (hierarchical processor arrangement) [4].Whether the cores share a single coherent cache at thehighest on-device cache level or may have separate caches islargely implementation and application dependent. Number ofcores fabricated on a die is also design dependent. A processorwith all cores on a single die is called a monolithic processor.III. P

RESENT MULTI - CORE P ROCESSORS

A. IBM

IBM POWER4 chip released in 2001 implemented 64-bitPowerPC [6] architecture. It is the ﬁrst commercial multi-coremicroprocessor, with two cores on a single die. 174 milliontransistors were fabricated using a 180 nm CMOS 8S3 SOIprocess. It consumed 115W power with a Vdd of 1.6V.

Fig. 1. An example conﬁguration of a dual-core processorFig. 2. The high-k Transistors

B. Intel

Dual core Intel processors based on NetBurst microarchi-tecture (Pentium D) were not successful due to high powerconsumption and cooling problems. First successful Intelmulti-core processors were based on Intel Core microarchi-tecture [7] , which was actually targeted for mobile platform.Core 2 microarchitecture was a revision of Core, and severalprocessors like Core 2 Duo, Core 2 Quad or Core 2 Extremefor desktop pc’s were released. Core 2 Duo contained 2processors on a single die, and Core 2 Quad contained 2 suchdies in a single package. Before 2007, these processors werefabricated using a 65 nm process. However, Intel developeda new technology MOS transistor replacing ordinary SiO insulator by Hafnium high-k gate oxide (Fig 2), which scaledthe process down to 45 nm [8]. Newer multi-core processorsbased on this process, codenamed “Penryn” (Fig 3) deﬁne thestate of the art technology for Intel.Multi-core Xeon CPU’s for server line computers includeseveral processors including Clovertown, Harpertown and lat-est Dunnington. Dunnington is a single die hexa-core proces-sor (Fig 4), with 3 uniﬁed 3MB L2 caches, 96KB L1 cacheand 16 MB L3 cache. TDP is less than 130W. C. AMD

Like Intel, AMD also produces similar chips named Athlon64 X2, Opteron, Phenom and so on. Their quad-core processoris codenamed barcelona (Fig. 5). AMD, after acquisition ofATI, also produces multi-core stream processors. FireStreamis such a processor with 10 cores, having 16 5-issue widesuperscalar stream processor per core. ig. 3. 45 nm process wafers using high-k transistorsFig. 4. Die micrograph of Intel’s Dunnington hexa-core processor

D. Gaming Consoles

PlayStation3 features Cell processor. “Cell Broadband En-gine” Architecture is jointly developed by Sony, Toshiba andIBM. Cell is a octa-core processor having novel memorycoherence architecture. (Fig 6) The architecture emphasizesefﬁciency/watt, prioritizes bandwidth over latency, and favorspeak computational throughput over simplicity of programcode.Xbox 360 features another multi-core processor namedXenon. This processor is based on IBM’s PowerPC instructionset architecture, consisting of three independent cores on asingle die. Each of the cores has two symmetric hardware

Fig. 5. AMD quad-core Barcelona die shot Fig. 6. PlayStation3 Cell octa-core Processor threads (SMT), for a total of six hardware threads availableto games. Each individual core also includes 32 KB of L1instruction cache and 32 KB of L1 data cache.

E. NVidia

NVidia produces GPU of GeForce series and GPGPU(General purpose GPU) of Tesla series (CUDA framework)These processors and latest generation 9 or generation 10GPU’s are multi-core as well.Several other manufacturers like Sun, ARM, Cavium,Broadcom, Inﬁneon also has their own multi-core processors.Some of these are solely for embedded applications and someare for specialized communication or DSP.

F. RAMP

RAMP (Research Accelerator for Multiple Processors) isan approach for evaluating newer hardware and softwarearchitectures on multi-core processors. It can be termed as“Academic Manycore” built on FPGA boards [3]. RAMPis a real world simulator for large manycore systems, i.e ituses soft-core processors with entire logic built on FPGAboards. Presently they run at 90-150 MHz speed and arecapable of executing real applications in unmodiﬁed binary.RAMP processors are created with 256, 768 or 1008 soft coreprocessors and are ideal for heterogeneous chip architectures.Fig 7 shows one such processor RAMP Blue V3.0. Theresearch group at Berkeley describe RAMP as a vehicle tolure more researchers to parallel challenge and decrease timeto parallel salvation.

IV. S

OFTWARE A RCHITECTURE

General vision for multi-core based software is Scalability,Code simplicity and Modularity [2]. To exploit the pro-cessing power offered by multi-core, potential opportunityfor multithreading and load balancing must be recognized bythe programmer. Not only that, writing efﬁcient parallel andmultithreaded programs also need careful reasoning, becausethere are now provisions for hard to track bugs created bywrong synchronization, race condition or deadlock. Severalconcurrency platforms are available today.LabVIEW (Laboratory Virtual Instrumentation EngineeringWorkbench) from National Instruments features a graphicallanguage, named “G” which is a dataﬂow programming lan-guage and is inherently capable of parallel execution. ig. 7. RAMP Blue V3.0 1008 Core, Rack and Server

MPI Message passing was the classical method of choice forthe HPC (High Performance Computing) committee. Scientiﬁccodes written with MPI have been ported to multi-core systemsand they run quite well.Intel Threading Building Blocks (TBB) is an open-sourceC++ template library containing data structures and algorithmsfor writing task-based multithreaded applications.OpenMP is an open-source concurrency platform support-ing multithreaded programming through Fortran and C/C++language .CILK++ from Cilk Arts, extends C++ to multi-core via threekeywords cilk spawn, cilk sync and cilk for (parallel loops).Cilkscreen race detector can be used to test a Cilk++ binaryprogram for its parallel correctness.V. F

UTURE OF MULTI - CORE C OMPUTING

It is now widely agreed that multi-core is going to be thewave of the future. But there are differing views [9] aboutprocessor organization, hierarchy and resulting software modelof the future. One group believe that the cores should beheterogeneous in nature, using specialty cores for differentjobs, scheduled by high level API. On the other hand, theother group opposes this idea and expresses that, multi-corechips need to be homogenous collections of general-purposecores to keep the software model simple.Intel [10] Tera scale research program aims to scale today’smulti-core architectures to 100s of cores. Their vision is tocreate platform capable of performing Tera ﬂops of operationon Tera bytes of data. The Teraﬂops Research Chip (also calledPolaris) is the ﬁrst processor prototype developed by Intel’sTera-scale Computing Research Program. A brief demo waspresented the IDF on September 2006 and a working modelwas shown at the 2007 ISSCC. (Fig 8).The chip contains 80 cores, constructed using a 65 nmCMOS process [11]. Each core contains two programmable

Fig. 8. Teraﬂops Research Chip as demonstrated in 2007 ﬂoating point engines and one 5-port messaging passing router.Running at 3.16 GHz it achieved 1.01 TFLOPS [12] with atotal power consumption of 62 W and a on-chip temperature of383 K. Increasing frequency to 5.7 GHz and power to 265 Wincreased performance to 1.81 TFLOPS and later 2 TFLOPS.Multi-core processors will continue to dominate at leastuntil some new device theory offering new limits for density,power efﬁciency and limit of computational speed is available.Present barriers imposed by complex software requirementwill be resolved as more and more research initiatives willbe taken on parallel and multi-core programming. Hence wecan conclude that, by doubling the number of cores everytechnology generation, soon we are going to enter the realmof Tera Scale Computing.R

EFERENCES[1] P. Gelsinger, “Moore’s law, the genius lives on,”

IEEE Solid-StateCircuits Society Newsletter , vol. 20, pp. 18–20, Sept 2006.[2] C. E. Leiserson and I.B.Mirman,

How To Survive the MulticoreSoftware Revolution et. al.

Wikipedia, The Free Ency-clopedia . [Online]. Available: http://en.wikipedia.org/wiki/Multi-core(computing)[5] D. A. Patterson and J. L. Henessy,

Computer Organization and Design

Intel Technology Journal , vol. 10,May 2006.[8] M. T. Bohr, R. S. Chau, T. Ghani, andK. Mistry, “The high-k solution.” [Online]. Available:http://spectrum.ieee.org/semiconductors/design/the-highk-solution[9] R. Merritt, “Cpu designers debate multi-core fu-ture,”