Jizeng Wei
Tianjin University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jizeng Wei.
IEEE Transactions on Circuits and Systems Ii-express Briefs | 2015
Jingwei Hu; Wei Guo; Jizeng Wei; Ray C. C. Cheung
Finite field inversion is the most computationally intensive field operation in public-key cryptographic algorithms such as elliptic curve cryptography. In this brief, we propose two inversion acceleration techniques for the Itoh-Tsujii algorithm (ITA) over binary extended field. First, we reformulate the ternary-ITA algorithm to generalize the primitive one, so that a universal algorithm procedure for all fields is achieved. Next, we devise a parallel-ITA algorithm to advance the parallelism of ITA. These two techniques are implemented on FPGA platform, and it is experimentally shown that a fast ternary-ITA inverter supporting all NIST fields can be obtained, with 22.9% timing improvement on average compared to the ITA inverter. In addition, the parallel-ITA inverter is a more balancing design that achieves averagely 25.7% of timing decrease compared to the ITA inverter while maintaining 31.3% reduction of area-time product compared to the ternary-ITA inverter.
international symposium on parallel architectures, algorithms and programming | 2011
Jingwei Hu; Wei Guo; Jizeng Wei; Yisong Chang; Da-Zhi Sun
RSA key generation is of great concern for implementation of RSA cryptosystem on embeded system due to its long processing latency. In this paper, a novel architecture is presented to provide high processing speed to RSA key generation for embedded platform with limited processing capacity. In order to exploit more data level parallelism, Residue Number System (RNS) is introduced to accelerate RSA key pair generation, in which these independent elements can be processed simultaneously. A cipher processor based on Transport Triggered Architecture (TTA) is proposed to realized the parallelism at the architecture level.In the meantime,division is avoided in the proposed architecture,which reduces the expense of hardware implementation remarkably. The proposed design is implemented by Verilog HDL and synthesized in a 0.18µm CMOS process. A rate of 3 pairs per second can be achieved for 1024-bit RSA key generation at the frequency of 100 MHz.
australasian conference on information security and privacy | 2014
Zhaojing Ding; Wei Guo; Liangjian Su; Jizeng Wei; Haihua Gu
In 2005, Yen et al. firstly proposed the N − 1 attack against cryptosystems implemented based on BRIP and square-multiply-always algorithms. This attack uses the input message N − 1 to obtain relevant side-channel information from the attacked cryptosystem. In this paper we conduct an in-depth study on the N − 1 attack and find that two more special values taken as the input message also can be exploited by an attacker. According to this, we present our chosen-message attack against Boscher’s right-to-left exponentiation algorithm which is a side-channel resistant exponentiation algorithm. Furthermore, immunity of the Montgomery Powering Ladder against the N − 1 attack is investigated. The result is that the Montgomery Powering Ladder is subjected to the N − 1 attack. But a different approach to retrieve the key is used which derives from the relative doubling attack. To validate our ideas, we implement the two algorithms in hardware and carry out the attacks on them. The experiment results show that our attacks are powerful attacks against these two algorithms and can be easily implemented with one power consumption curve.
Microprocessors and Microsystems | 2013
Yisong Chang; Jizeng Wei; Wei Guo; Jizhou Sun
A fully programmable vertex shader based on Transport Triggered Architecture (TTA) is proposed in this paper to provide high efficiency of performance and connectivity for embedded applications. At the architecture level, fine-grained data transport in TTA datapath and multi-threading method are adopted to exploit instruction and data level parallelism respectively in the graphics applications. The datapath connectivity can be optimized mainly by native architectural visible bypass in TTA and hybrid result re-collection schemes. At the shader core level, a novel SIMD multi-functional dot-production unit and an area efficient special function unit are introduced for floating-point processing. The proposed processor which achieves peak capacity of 1.5 GFLOPS and 125 Mvertices/s can totally acquire 17.6% reduction in hardware cost and can provide 1.3 times improvement in performance per logic cost ratio under a 0.18@mm CMOS process for real graphics benchmarks compared to previous expanded VLIW vertex processor.
conference on e business technology and strategy | 2012
Yanhua Liu; Wei Guo; Ya Tan; Jizeng Wei; Da-Zhi Sun
In this paper, we proposed an efficient implementation scheme for digital signature based on the cryptography algorithm SM2, which is established as the Elliptic Curve Cryptography (ECC) standard of China. Algorithm analysis reveals speed bottleneck lies in scalar multiplication, which is time consuming for the master processor to implement. Therefore, a configurable ECC coprocessor is employed in the scheme to improve the processing speed. In order to improve the efficiency of data transport within digital signature, a fine-grained programming and high Instruction Level Parallelism architecture is employed. To decrease intermediate registers, point doubling algorithm is optimized to reduce space complexity. The speed of critical steps within SM2 digital signature is improved significantly by the coprocessor. With these improvements, scalar multiplication can be achieved in 3 ms at 80 MHz for 192-bit ECC. The results show that our scheme is competitive for embedded platforms.
2011 IEEE/IFIP 19th International Conference on VLSI and System-on-Chip | 2011
Jizeng Wei; Yisong Chang; Wei Guo; Jizhou Sun
An alternative VLIW architecture of vertex shader datapath based on transport triggered architecture (TTA) is proposed in details. This architecture can exploit more instruction level parallelism (ILP) than traditional VLIW architecture by the fine-grained data transport. The proposed vertex shader architecture can also provide a simple and user-optimized inter-connection network which can efficiently reduce the complexity of interconnections design. The evaluation results show that the proposed architecture can achieve almost 18% reduction in interconnection number and 1.4 times improvement in code density compared with the multi-threaded expanded VLIW architecture (MT-eVLIW).
asia pacific conference on circuits and systems | 2008
Jizeng Wei; Wei Guo; Jizhou Sun; Zaifeng Shi
Application-specific instruction processors (ASIP) tailored for the requirements are often at the center of todaypsilas embedded systems. Therefore, considerable effort has been spent on constructing tools that assist in co-designing ASIP. It is desirable that such design toolsets support an automated design flow from application source code down to synthesizable processor description and optimized machine code. In this paper, we will describe such a toolset for Tcore processor which is derived Transport Triggered Architecture (TTA). We have addressed some of the pressing shortcomings found in existing toolsets, especially the design of compiler. Finally, we present a satisfied result of an image contrast enhancement algorithm implemented using Tcore processor under many kinds of configuration through the toolset.
trust, security and privacy in computing and communications | 2016
Zhongyuan Hao; Wei Guo; Jizeng Wei; Da-Zhi Sun
Recently, pairing based cryptography has drawn a lot of attention since it is the key technology to construct identity based encryption schemes. Major operations of pairing are elliptic curve computations defined upon finite field arithmetic. Hardware implementation of large prime field operation is critical to improve the practicality of pairing based cryptography. In this paper, we present a parallel processing architecture of cryptoprocessor for optimal ate pairings over BN-curves in prime field. The proposed design contains two arithmetic processing engines to complete prime field operations in parallel. Each engine is designed by the unified Fp arithmetic unit to save hardware resources. This architecture can also be used to implement RSA and ECC cryptography schemes in parallel to satisfy industry demand. The design is implemented on a Virtex-5 FPGA device. The result shows our design consumes 10592 Slices and computes optimal-ate pairing within 283,111 clock cycles, achieving a better balance between speed and area than other designs.
international midwest symposium on circuits and systems | 2011
Bing Liu; Jizeng Wei; Shaofei Shi; Yisong Chang; Wei Guo
Depth Image Based Rendering (DIBR) is the most popular method to generate stereoscopic images. In this paper, a novel pixel-level full-pipelined hardware accelerator is presented. The proposed architecture with division elimination algorithm and cache window design can achieve real-time rendering speed with low cost. The hardware design is implemented and verified on FPGA platform. The result shows the design can be applied to handheld devices due to its high efficiency.
PLOS ONE | 2018
Qingran Wang; Wei Guo; Jizeng Wei
Microprocessors in safety-critical system are extremely vulnerable to hacker attacks and circuit crosstalk, as they can modify binaries and lead programs to run along the wrong control flow paths. It is a significant challenge to design a run-time validation method with few hardware modification. In this paper, an efficient control flow validation method named DCM (Dual-Processor Control Flow Validation Method) is proposed basing on dual-processor architecture. Since a burst of memory-access-intensive instructions could block pipeline and cause lots of waiting clocks, the DCM assigns the idle pipeline cycles of the blocked processor to the other processor to validate control flow at run time. An extra lightweight monitor unit in each processor is needed and a special dual-processor communication protocol is also designed to schedule the redundant computing capacity between two processors to do validation tasks better. To further improve the efficiency, we also design a software-based self-validation algorithm to help reduce validation times. The combination of both hardware method and software method can speed up the validation procedure and protect the control flow paths with different emphasis. The cycle-accurate simulator GEM5 is used to simulate two ARMv7-A processors with out-of-order pipeline. Experiment shows the performance overhead of DCM is less than 22% on average across the SPEC 2006 benchmarks.