John William Marshall

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where John William Marshall is active.

Explore More

Publication

Featured researches published by John William Marshall.

Network Processor Design | 2003

A Methodology and Simulator for the Study of Network Processors

Deepak Suryanarayanan; John William Marshall; Gregory T. Byrd

Network processors (NPs) are emerging new class of processors that combine programmable ASICs and microprocessors to implement adaptive network services. NPs influence the flexibility of software solutions with the high performance of custom hardware. The development of such sophisticated hardware requires a holistic methodology that can facilitate the study of network processors and their performance with different networking applications and traffic conditions. It is noted that this combination of study techniques is essentially accomplished in the component network simulator (ComNetSim). The simulator includes both a traffic-modeling component and a detailed architectural framework that allows the study of complete networking applications under varying network traffic conditions. The chapter illustrates a weighted round robin scheduling algorithm, adapted to the Toaster architecture. It describes high-level simulator design and details the Toaster network processor and the implementation of the simulator including the cycle-accurate model of the Toaster architecture. The chapter also briefly presents the simulator organization along with performance results and analysis.

Network Processor Design | 2003

Chapter 11 – Cisco Systems–Toaster2

John William Marshall

It is noted that Ciscos Toaster2 ASIC is a high-performance network processor capable of forwarding millions of packets per second. It has provided the system infrastructure necessary to deliver a variety of products that provide flexibility and high performance. Toaster2 has also been successful in bridging the performance gap between microprocessor-based forwarding engines and hardwired ASIC implementations. The ASIC is internally organized as a pipelined, multiprocessor, parallel processing engine. Toasters multiprocessor matrix consists of 16 uniform processors arranged as 4 rows by 4 columns. It is observed that the Toaster microcontroller, TMC, performs processing within each element of the matrix. Each TMC has a local instruction RAM and a local memory controller used to access a number of internal and external memory devices. It is also observed that evolutions in network processor programming methods continue to abstract the underlying multiprocessor environment in order to ease the transition from traditional software design to network processor-based software design.

symposium on sdn research | 2017

PVPP: A Programmable Vector Packet Processor

Sean Choi; Xiang Long; Muhammad Shahbaz; Skip Booth; Andy Keep; John William Marshall; Changhoon Kim

Recent work on simplifying data plane programming focuses on providing simple, high-level domain-specific languages (DSLs). These languages hide the complex and intricate details of the underlying switching substrate. Programmers write their data-plane programs in these languages which are then compiled to run on a given switch target, which further runs on a particular CPU architecture. However, the simplicity and the domain-specific nature of these DSLs and the lack of flexible switch interfaces that can be targeted by a DSL compiler restrict the ability to optimize generated code. In this work, we present our findings on how the complexity of interfaces on a software switch target available to a compiler can affect the performance of compiled data plane programs. For our experiment platform, we built a P4 compiler called the Programmable Vector Packet Processor (PVPP) targeting the existing Vector Packet Processor (VPP) software switch. P4 is a data plane DSL based on match-action tables, while VPP uses a packet processing node graph model. PVPP compiles a data plane program written in P4 to VPPs internal graph representation. VPP also exposes a sophisticated interface for PVPP to interact with the various features of the underlying architecture e.g., execution modes, memory types, and the batch I/O. Our evaluation shows that PVPP can efficiently exploit these features, resulting in the increased performance of the same data plane program.

design automation conference | 2017

Developing Dynamic Profiling and Debugging Support in OpenCL for FPGAs

Anshuman Verma; Huiyang Zhou; Skip Booth; Robbie King; James Coole; Andy Keep; John William Marshall; Wu-chun Feng

With FPGAs emerging as a promising accelerator for general-purpose computing, there is a strong demand to make them accessible to software developers. Recent advances in OpenCL compilers for FPGAs pave the way for synthesizing FPGA hardware from OpenCL kernel code. To enable broader adoption of this paradigm, significant challenges remain. This paper presents our efforts in developing dynamic profiling and debugging support in OpenCL for FPGAs. We first propose primitive code patterns, including a timestamp and an event-ordering function, and then develop a framework, which can be plugged easily into OpenCL kernels, to dynamically collect and process run-time information.

application-specific systems, architectures, and processors | 2017

OpenCL-based design pattern for line rate packet processing

Jehandad Khan; Peter M. Athanas; Skip Booth; John William Marshall

The ever changing nature of network technology requires a flexible platform that can change as the technology evolves. In this work, a complete networking switch designed in OpenCL is presented, identifying several high-level constructs that form the building blocks of any network application targeting FPGAs. These include the notion of an on-chip global memory and kernels constantly processing data without the intervention of the host. The use of OpenCL is motivated by the ability to rapidly change designs and to be maintainable by a wider developer community. Pieces of the design that cannot be realized using current OpenCL technology are also identified and a solution to the problem is presented.

acm special interest group on data communication | 2017

Use of Cuckoo Filters with FD.io VPP for Software IPv6 Routing Lookup

Minseok Kwon; Shailesh Vajpayee; Pragash Vijayaragavan; Arjun Dhuliya; John William Marshall

The filter technologies, e.g., Bloom filters, have been used for IP lookup for their compactness and efficiency. We investigate the performance of cuckoo filters with Ciscos VPP (Vector Packet Processing) for IP lookup. We also introduce a variant called a length-aware cuckoo filter that treats incoming IP addresses discriminatively, and study its performance with VPP. As proof-of-concept, we implement cuckoo filters with VPP, and test them on both functions and performance with focus on the ip6-input node in VPP.

Proceedings of the First Asia-Pacific Workshop on Networking | 2017

The Case for a Flexible Low-Level Backend for Software Data Planes

Sean Choi; Xiang Long; Muhammad Shahbaz; Skip Booth; Andy Keep; John William Marshall; Changhoon Kim

Recent efforts to simplify network data plane programming focus on providing simple, high-level domain-specific languages (DSLs). In the case of software switches, data plane programs are written in these DSLs and then compiled to run on CPU-based architecture. However, the simplicity of these DSLs, along with the lack of low-level interfaces exposed by the software switch, restrict compilers from generating optimal data plane programs for CPU-based architecture. In this paper, we argue that increased exposure of low-level interfaces to a software switch would enable more effective data plane programs. To demonstrate this, we present Programmable Vector Packet Processor (PVPP), which adds programmability to the Vector Packet Processing (VPP) framework. VPP provides fine-grain access to various low-level features of a CPU-architecture and offers better performance compared to other software switches, such as Open vSwitch (OVS), that operate at a higher level of abstraction. However, there is a cost to programming directly using VPPs low-level features. The programmer must have specialized knowledge about the architecture in order to produce an efficient implementation, resulting in difficulties when optimizing the program. PVPP attempts to alleviate this cost by allowing the compilation of a program written in P4 to VPPs internal node-graph representation. Our preliminary results show that PVPP improves performance of data plane programs by around 30% compared to naïve VPP implementations.

application-specific systems, architectures, and processors | 2016

OpenCL-based erasure coding on heterogeneous architectures

Guoyang Chen; Huiyang Zhou; Xipeng Shen; Joshua B. Gahm; Narayan Venkat; Skip Booth; John William Marshall

Erasure coding, Reed-Solomon coding in particular, is a key technique to deal with failures in scale-out storage systems. However, due to the algorithmic complexity, the performance overhead of erasure coding can become a significant bottleneck in storage systems attempting to meet service level agreements (SLAs). Previous work has mainly leveraged SIMD (single-instruction multiple-data) instruction extensions in general purpose processors to improve the processing throughput. In this work, we exploit state-of-art heterogeneous architectures, including GPUs, APUs, and FPGAs, to accelerate erasure coding. We leverage the OpenCL framework for our target heterogeneous architectures and propose code optimizations for each target architecture. Given their different hardware characteristics, we highlight the different optimization strategies for each of the target architectures. Using the throughput metric as the ratio of the input file size over the processing latency, we achieve 2.84 GB/s on a 28-core Xeon CPU, 3.90 GB/s on an NVIDIA K40m GPU, 0.56 GB/s on an AMD Carrizo APU, and 1.19 GB/s (5.35 GB/s if only considering the kernel execution latency) on an Altera Stratix V FPGA, when processing a 836.9MB zipped file with a 30×33 encoding matrix. In comparison, the single-thread code using the Intels ISA-L library running on the Xeon CPU has the throughput of 0.13 GB/s.

Archive | 2002