Fangyu Zheng | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Fangyu Zheng is active.

Explore More

Publication

Featured researches published by Fangyu Zheng.

international conference on information security | 2014

Exploiting the Floating-Point Computing Power of GPUs for RSA

Fangyu Zheng; Wuqiong Pan; Jingqiang Lin; Jiwu Jing; Yuan Zhao

Asymmetric cryptographic algorithms (e.g., RSA and ECC) have been implemented on Graphics Processing Units (GPUs) for several years. These implementations mainly exploit the highly parallel GPU architecture and port the integer-based algorithms for common CPUs to GPUs, offering high performance. However, the great potential cryptographic computing power of GPUs, especially by the more powerful floating-point instructions, has not been comprehensively investigated in fact. In this paper, we try to fully exploit the floating-point computing power of GPUs for RSA, by various designs, including the floating-point-based Montgomery multiplication algorithm, the optimization for the fundamental operations and the utilization of the latest thread data sharing instruction shuffle. The experimental result on NVIDIA GTX Titan of 2048-bit RSA decryption reaches a throughput of 38,975 operations per second, achieves 2.21 times performance of the existing fastest integer-based work and outperforms the previous floating-point-based implementation by a large margin.

IEEE Transactions on Information Forensics and Security | 2017

An Efficient Elliptic Curve Cryptography Signature Server With GPU Acceleration

Wuqiong Pan; Fangyu Zheng; Yuan Zhao; Wen Tao Zhu; Jiwu Jing

Over the Internet, digital signature has been an indispensable approach to securing e-commerce and other online transactions requiring authentication. Concerning the computing costs of signature generation and verification, it has become a more and more common practice for security practitioners to outsource such computations from heavily loaded application servers called tenants to dedicated proxies like signature servers in the enterprise private cloud. In this paper, we present our high-performance signature server called Guess. It implements the elliptic curve digital signature algorithm (ECDSA) with 256-b key size on a Linux-powered commodity computer, harnessing a desktop graphics processing unit as a featured cryptographic accelerator. We demonstrate our experience in maximizing the computing power of Guess and also its capability to deliver such power to the tenants, which includes down-to-earth customization and optimization considering various hardware and software factors. Our comprehensive implementation of ECDSA is tested against intensive network traffic. Field experiments show that Guess achieves Ts = 8.71 × 106 operations per second (OPS) for signature generation or Tv = 9.29 × 105 OPS for verification, which is significantly faster than existent prototypes and products. Guess is a universal server that readily supports various categories of elliptic curve cryptographic schemes, such as digital signature, key agreement, and encryption.

workshop on information security applications | 2014

Exploiting the Potential of GPUs for Modular Multiplication in ECC

Fangyu Zheng; Wuqiong Pan; Jingqiang Lin; Jiwu Jing; Yuan Zhao

In traditional multiple precision large integer multiplication algorithm, the required number of additions approximates the number of multiplications needed. In some platforms, the great number of add instructions will occupy about half of computing latency in the overall implementation. In this paper, we propose a multiplication algorithm using separated multiply-add-with-carry instruction supported by NVIDIA GPUs. In the algorithm, we reorder the computational sequence, in which nearly all additions and carry flags handling can be combined with the multiplication instructions. The number of add instructions needed decreases from \(O(n^2)\) in prevailing schoolbook algorithm to \(O(n)\). Our resulting 256-bit modular multiplication and modular square over Mersenne prime respectively achieve 3.3837 billion and 5.9928 billion operations per second and reach 96 % of GPU hardware limitation. An elliptic curve point multiplication implementation using our algorithm achieves 43.6 % speedup compared to the existing fastest work.

wireless algorithms systems and applications | 2018

Secure and Efficient Outsourcing of Large-Scale Matrix Inverse Computation

Shiran Pan; Qiongxiao Wang; Fangyu Zheng; Jiankuo Dong

Matrix inverse computation (MIC) is one of the fundamental mathematical tasks in linear algebra, and finds applications in many areas of science and engineering. In practice, MIC tasks often involve large-scale matrices and impose prohibitive computation costs on resource-constrained users. As cloud computing gains much momentum, a resource-constrained client can choose to outsource the large-scale MIC task to a powerful but untrustworthy cloud. As the input of and the solution to the MIC task usually contain the client’s private information, appropriated mechanisms should be placed for privacy concerns. In this paper, we employ certain matrix transformations and construct an outsourcing scheme known as SEMIC, which can solve the MIC task in a masked yet verifiable manner. Thorough theoretical analysis shows that SEMIC is correct, verifiable, and privacy-preserving. Extensive experimental results demonstrate that SEMIC significantly reduces the computation costs of the client. Compared with the most related work, our solution offers enhanced privacy protection without impairing the efficiency.

Archive | 2018

Utilizing GPU Virtualization to Protect the Private Keys of GPU Cryptographic Computation

Ziyang Wang; Fangyu Zheng; Jingqiang Lin; Jiankuo Dong

Nowadays graphics processing units (GPUs) have become popular parallel computing platforms known as General-Purpose GPU (GPGPU) computing. GPUs thereby are chosen by some security researchers as cryptographic accelerators to secure massive volumes of transactions. However, their security issues are ignored in spite of their popularity and performance. There are some possible information leakages faced with malicious attacks or even in the normal GPU computing. Our objective is to secure the confidentiality of cryptographic keys in GPU computing environments and provide easy-to-use programming with few constraints. In this paper, we propose a prototype in Linux, a system of GPGPU computing solution empowered by GPU virtualization technology, which keeps encrypted keys in guest machine to protect secret keys from leakage even in the event of full system compromise. With the API interception and redirection of CUDA, applications in Virtual Machines (VMs) can access the GPU device in a transparent way. Besides, we use virtio, a dedicated virtual I/O device, to transfer data between virtual and host machines in high performance. In our current study, we evaluate our prototype with the GPU implementation of ECC. We show that it can protect private keys of GPU cryptographic computation and it also incurs low performance penalty compared with the native environment, therefore demonstrating the prototype is secure and requires reasonable overhead.

international conference on information and communication security | 2017

High-Performance Symmetric Cryptography Server with GPU Acceleration

Wangzhao Cheng; Fangyu Zheng; Wuqiong Pan; Jingqiang Lin; Huorong Li; Bingyu Li

With more and more sensitive and private data transferred on the Internet, various security protocols have been developed to secure end-to-end communication. However, in practical situations, applying these protocols would decline the overall performance of the whole system, of which frequently-used symmetric cryptographic operations on the server side is the bottleneck. In this contribution, we present a high-performance symmetric cryptography server. Firstly, a symmetric algorithm SM4 is carefully scheduled in GPUs, including instruction-level implementation and variable location improvement. Secondly, optimization methods is provided to speed up the inefficient data transfer between CPU and GPU. Finally, the overall server architecture is adopted for mass data encryption, which can deliver 15.96 Gbps data encryption through network, 1.23 times of the existing fastest symmetric cryptographic server. Furthermore, the server can be boosted by 2.02 times with the high-speed pre-calculation technique for long-term-key applications such as IPSec VPN gateways.

Security and Communication Networks | 2017

Utilizing the Double-Precision Floating-Point Computing Power of GPUs for RSA Acceleration

Jiankuo Dong; Fangyu Zheng; Wuqiong Pan; Jingqiang Lin; Jiwu Jing; Yuan Zhao

Asymmetric cryptographic algorithm (e.g., RSA and Elliptic Curve Cryptography) implementations on Graphics Processing Units (GPUs) have been researched for over a decade. The basic idea of most previous contributions is exploiting the highly parallel GPU architecture and porting the integer-based algorithms from general-purpose CPUs to GPUs, to offer high performance. However, the great potential cryptographic computing power of GPUs, especially by the more powerful floating-point instructions, has not been comprehensively investigated in fact. In this paper, we fully exploit the floating-point computing power of GPUs, by various designs, including the floating-point-based Montgomery multiplication/exponentiation algorithm and Chinese Remainder Theorem (CRT) implementation in GPU. And for practical usage of the proposed algorithm, a new method is performed to convert the input/output between octet strings and floating-point numbers, fully utilizing GPUs and further promoting the overall performance by about 5%. The performance of RSA-2048/3072/4096 decryption on NVIDIA GeForce GTX TITAN reaches 42,211/12,151/5,790 operations per second, respectively, which achieves 13 times the performance of the previous fastest floating-point-based implementation (published in Eurocrypt 2009). The RSA-4096 decryption precedes the existing fastest integer-based result by 23%.

international conference on selected areas in cryptography | 2016

PhiRSA: Exploiting the Computing Power of Vector Instructions on Intel Xeon Phi for RSA

Yuan Zhao; Wuqiong Pan; Jingqiang Lin; Peng Liu; Cong Xue; Fangyu Zheng

Efficient implementations of public-key cryptographic algorithms on general-purpose computing devices, facilitate the applications of cryptography in communication security. Existing solutions work in two different directions: implementations on GPUs achieve high throughput but great latency, while those on CPUs are with low throughput and small latency. Intel Xeon Phi is the first highly parallel coprocessor of Many Integrated Core (MIC) architecture, with up to 61 cores and one 512-bit Vector Processing Unit (VPU) in each core, which offers the potential to achieve both high throughput and small latency. In this paper, we propose a vector-oriented Montgomery multiplication design based on vector carry propagation chain (VCPC) method to fully exploit the computing power of vector instructions on Intel Xeon Phi. Two key features of our design sharply reduce the number of instructions: (1) organizing the additions in Montgomery multiplication to be four VCPCs for saving the overhead of handling carry bits; (2) computing the intermediate scalar variable q in every round without breaking the flow of VCPCs. Furthermore, we offer the optimal Montgomery multiplication implementation of our design on Intel Xeon Phi, which make VPUs fully pipelined and maintain carry bits in vector mask registers. Based on the above, we implement RSA named PhiRSA and evaluate it on Intel Xeon Phi 7120P. For 1024, 2048 and 4096-bit RSA, PhiRSA performs 258,370, 41,803 and 5,358 decryptions per second, and the latencies are 0.94, 5.84 and 45.54 ms, respectively. These results achieve 4.1 to 8.5 times performance of the existing RSA implementations on Intel Xeon Phi, exhibit high throughput comparable to those on GPUs but with much less parallel tasks, and small latency comparable to those on CPUs.

information security | 2016

RegRSA: Using Registers as Buffers to Resist Memory Disclosure Attacks

Yuan Zhao; Jingqiang Lin; Wuqiong Pan; Cong Xue; Fangyu Zheng; Ziqiang Ma

Memory disclosure attacks, such as cold-boot attacks and DMA attacks, allow attackers to access all memory contents, therefore introduce great threats to plaintext sensitive data in memory. Register-based and cache-based schemes have been used to implement RSA securely, at the expense of decreased performance. In this paper, we propose another concept named register buffer, which makes use of all available registers as secure data buffer, no matter scalar registers or vector registers. The plaintext sensitive data only appear in register buffer. Based on this concept, we finish a security implementation of 2048-bit RSA called RegRSA, to defeat against memory disclosure attacks. The 1024-bit Montgomery multiplication in RegRSA runs entirely in register buffer, by performing computations using scalar instructions and registers, maintaining intermediate variables in vector registers. Due to the size limitation of register buffer, several variables out of Montgomery multiplications are spilled into memory. RegRSA encrypts these variables with AES before saving in memory. Furthermore, RegRSA employs a windowing method and the CRT speed-up to accelerate RSA, and minimizes the data exchange between registers and memory to reduce the workload of AES encryption/decryption. The evaluation on Intel Haswell i7-4770R shows that, the performance of RegRSA achieves a factor of 0.74 compared to the regular RSA implementation in OpenSSL and is much greater than PRIME, the existing register-based scheme for 2048-bit RSA. Moreover, RegRSA allows multiple instances to run on a multi-core CPU simultaneously, which makes it more practical for the real-world applications.

communications and networking symposium | 2018