Honghui Lu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Honghui Lu is active.

Explore More

Publication

Featured researches published by Honghui Lu.

IEEE Computer | 1996

TreadMarks: shared memory computing on networks of workstations

Cristiana Amza; Alan L. Cox; Sandhya Dwarkadas; Peter J. Keleher; Honghui Lu; Ramakrishnan Rajamony; Weimin Yu; Willy Zwaenepoel

Shared memory facilitates the transition from sequential to parallel processing. Since most data structures can be retained, simply adding synchronization achieves correct, efficient programs for many applications. We discuss our experience with parallel computing on networks of workstations using the TreadMarks distributed shared memory system. DSM allows processes to assume a globally shared virtual memory even though they execute on nodes that do not physically share memory. We illustrate a DSM system consisting of N networked workstations, each with its own memory. The DSM software provides the abstraction of a globally shared memory, in which each processor can access any data item without the programmer having to worry about where the data is or how to obtain its value.

international conference on computer communications | 2004

Peer-to-peer support for massively multiplayer games

Björn Knutsson; Honghui Lu; Wei Xu; Bryan Hopkins

We present an approach to support massively multiplayer games on peer-to-peer overlays. Our approach exploits the fact that players in MMGs display locality of interest, and therefore can form self-organizing groups based on their locations in the virtual world. To this end, we have designed scalable mechanisms to distribute the game state to the participating players and to maintain consistency in the face of node failures. The resulting system dynamically scales with the number of online players. It is more flexible and has a lower deployment cost than centralized games servers. We have implemented a simple game we call SimMud, and experimented with up to 4000 players to demonstrate the applicability of this approach.

acm sigplan symposium on principles and practice of parallel programming | 2005

Locality aware dynamic load management for massively multiplayer games

Jin Chen; Baohua Wu; Margaret DeLap; Björn Knutsson; Honghui Lu; Cristiana Amza

Most massively multiplayer game servers employ static partitioning of their game world into distinct mini-worlds that are hosted on separate servers. This limits cross-server interactions between players, and exposes the division of the world to players. We have designed and implemented an architecture in which the partitioning of game regions across servers is transparent to players and interactions are not limited to objects in a single region or server. This allows a finer grain partitioning, which combined with a dynamic load management algorithm enables us to better handle transient crowding by adaptively dispersing or aggregating regions from servers in response to quality of service violations.Our load balancing algorithm is aware of the spatial locality in the virtual game world. Based on localized information, the algorithm balances the load and reduces the cross server communication, while avoiding frequent reassignment of regions. Our results show that locality aware load balancing reduces the average user response time by up to a factor of 6 compared to a global algorithm that does not consider spatial locality and by up to a factor of 8 compared to static partitioning.

Journal of Parallel and Distributed Computing | 2000

OpenMP for Networks of SMPs

Y. Charlie Hu; Honghui Lu; Alan L. Cox; Willy Zwaenepoel

In this paper, we present the first system that implements OpenMP on a network of shared-memory multiprocessors. This system enables the programmer to rely on a single, standard, shared-memory API for parallelization within a multiprocessor and between multiprocessors. It is implemented via a translator that converts OpenMP directives to appropriate calls to a modified version of the TreadMarks software distributed shared-memory (SDSM) system. In contrast to previous SDSM systems for SMPs, the modified TreadMarks system uses POSIX threads for parallelism within an SMP node. This approach greatly simplifies the changes required to the SDSM in order to exploit the intranode hardware shared memory. We present performance results for seven applications (Barnes-Hut, CLU, and Water from SPLASH-2, 3D-FFT from NAS, Red-Black SOR, TSP, and MGS) running on an SP2 with four four-processor SMP nodes. A comparison between the thread implementation and the original implementation of TreadMarks shows that using the hardware shared memory within an SMP node significantly reduces the amount of data and the number of messages transmitted between nodes and consequently achieves speedups that are up to 30% better than the original versions. We also compare SDSM against message passing. Overall, the speedups of multithreaded TreadMarks programs are within 7?30% of the MPI versions.

conference on high performance computing (supercomputing) | 1995

Message Passing Versus Distributed Shared Memory on Networks of Workstations

Honghui Lu; Sandhya Dwarkadas; Alan L. Cox; Willy Zwaenepoel

The message passing programs are executed with the Parallel Virtual Machine (PVM) library and the shared memory programs are executed using TreadMarks. The programs are Water and Barnes-Hut from the SPLASH benchmark suite; 3-D FFT, Integer Sort (IS) and Embarrassingly Parallel (EP) from the NAS benchmarks; ILINK, a widely used genetic linkage analysis program; and Successive Over-Relaxation (SOR), Traveling Salesman (TSP), and Quicksort (QSORT). Two different input data sets were used for Water (Water-288 and Water-1728), IS (IS-Small and IS-Large), and SOR (SOR-Zero and SOR-NonZero). Our execution environment is a set of eight HP735 workstations connected by a 100Mbits per second FDDI network. For Water-1728, EP, ILINK, SOR-Zero, and SOR-NonZero, the performance of TreadMarks is within 10%of PVM. For IS-Small, Water-288, Barnes-Hut, 3-D FFT, TSP, and QSORT, differences are on the order of 10%to 30%. Finally, for IS-Large, PVM performs two times better than TreadMarks. More messages and more data are sent in TreadMarks, explaining the performance differences. This extra communication is caused by 1) the separation of synchronization and data transfer, 2) extra messages to request updates for data by the invalidate protocol used in TreadMarks, 3) false sharing, and 4) diff accumulation for migratory data in TreadMarks.

international symposium on computer architecture | 1994

Software versus hardware shared-memory implementation: a case study

Alan L. Cox; Sandhya Dwarkadas; Peter J. Keleher; Honghui Lu; Ramakrishnan Rajamony; Willy Zwaenepoel

We compare the performance of software-supported shared memory on a general-purpose network to hardware-supported shared memory on a dedicated interconnect.Up to eight processors, our results are based on the execution of a set of application programs on a SGI 4D/480 multiprocessor and on TreadMarks, a distributed shared memory system that runs on a Fore ATM LAN of DECstation-5000/240s. Since the DECstation and the 4D/480 use the same processor, primary cache, and compiler, the shared-memory implementation is the principal difference between the systems. Our results show that TreadMarks performs comparably to the 4D/480 for applications with moderate amounts of synchronization, but the difference in performance grows as the synchronization frequency increases. For applications that require a large amount of memory bandwidth, TreadMarks can perform better than the SGI 4D/480.Beyond eight processors, our results are based on execution-driven simulation. Specifically, we compare a software implementation on a general-purpose network of uniprocessor nodes, a hardware implementation using a directory-based protocol on a dedicated interconnect, and a combined implementation using software to provide shared memory between multiprocessor nodes with hardware implementing shared memory within a node. For the modest size of the problems that we can simulate, the hardware implementation scales well and the software implementation scales poorly. The combined approach delivers performance close to that of the hardware implementation for applications with small to moderate synchronization rates and good locality. Reductions in communication overhead improve the performance of the software and the combined approach, but synchronization remains a bottleneck.

international conference on computer communications | 2005

DHARMA: distributed home agent for robust mobile access

Yun Mao; Björn Knutsson; Honghui Lu; Jonathan M. Smith

Mobile wireless devices have intermittent connectivity, sometimes intentional. This is a problem for conventional Mobile IP, beyond its well-known routing inefficiencies and deployment issues. DHARMA selects a location-optimized instance from a distributed set of home agents to minimize routing overheads; set management and optimization are done using the PlanetLab overlay network. DHARMAs session support overcomes both transitions between home agent instances and intermittent connectivity. Cross-layer information sharing between the session layer and the overlay network are used to exploit multiple wireless links when available. The DHARMA prototype supports intermittently connected legacy TCP applications in a variety of scenarios and is largely portable across host operating systems. Experiments with DHARMA deployed on more than 200 PlanetLab nodes demonstrate routing performance consistently better than that for best-case Mobile IP.

acm sigplan symposium on principles and practice of parallel programming | 1997

Compiler and software distributed shared memory support for irregular applications

Honghui Lu; Alan L. Cox; Sandhya Dwarkadas; Ramakrishnan Rajamony; Willy Zwaenepoel

We investigate the use of a software distributed shared memory (DSM) layer to support irregular computations on distributed memory machines. Software DSM supports irregular computation through demand fetching of data in response to memory access faults. With the addition of a very limited form of compiler support, namely the identification of the section of the indirection array accessed by each processor, many of these on-demand page fetches can be aggregated into a single message, and prefetched prior to the access fault.We have measured the performance of this approach for two irregular applications, moldyn and nbf, using the Tread-Marks DSM system on an 8-processor IBM SP2. We find that it has similar performance to the inspector-executor method supported by the CHAOS run-time library, while requiring much simpler compile-time support. For moldyn, it is up to 23% faster than CHAOS, depending on the input problems characteristics; and for nbf, it is no worse than 14% slower. If we include the execution time of the inspector, the software DSM-based approach is always faster than CHAOS. The advantage of this approach increases as the frequency of changes to the indirection array increases. The disadvantage of this approach is the potential for false sharing overhead when the data set is small or has poor spatial locality.

ACM Transactions on Internet Technology | 2003

Architecture and performance of server-directed transcoding

Björn Knutsson; Honghui Lu; Jeffrey C. Mogul; Bryan Hopkins

Proxy-based transcoding adapts Web content to be a better match for client capabilities (such as screen size and color depth) and last-hop bandwidths. Traditional transcoding breaks the end-to-end model of the Web, because the proxy does not know the semantics of the content. Server-directed transcoding preserves end-to-end semantics while supporting aggressive content transformations.We show how server-directed transcoding can be integrated into the HTTP protocol and into the implementation of a proxy. We discuss several useful transformations for image content, and present measurements of the performance impacts. Our results demonstrate that server-directed transcoding is a natural extension to HTTP, can be implemented without great complexity, and can provide good performance when carefully implemented.

conference on high performance computing (supercomputing) | 1998

OpenMP on Networks of Workstations

Honghui Lu; Y. Charlie Hu; Willy Zwaenepoel

We describe an implementation of a sizable subset of OpenMP on networks of workstations (NOWs). By extending the availability of OpenMP to NOWs, we overcome one of its primary drawbacks compared to MPI, namely lack of portability to environments other than hardware shared memory machines. In order to support OpenMP execution on NOWs, our compiler targets a software distributed shared memory system (DSM) which provides multi-threaded execution and memory consistency. This paper presents two contributions. First, we identify two aspects of the current OpenMP standard that make an implementation on NOWs hard, and suggest simple modifications to the standard that remedy the situation. These problems reflect differences in memory architecture between software and hardware shared memory and the high cost of synchronization on NOWs. Second, we present performance results of a prototype implementation of an OpenMP subset on a NOW, and compare them with hand-coded software DSM and MPI results for the same applications on the same platform. We use five applications (ASCI Sweep3d, NAS 3D- FFT, SPLASH-2 Water, QSORT, and TSP) exhibiting various styles of parallelization, including pipelined execution, data parallelism, coarse-grained parallelism, and task queues. The measurements show little difference between OpenMP and hand-coded software DSM, but both are still lagging behind MPI. Further work will concentrate on compiler optimization to reduce these differences.

Explore More