Qlib: An AI-oriented Quantitative Investment Platform
QQlib : An AI-oriented Quantitative Investment Platform
Xiao Yang , Weiqing Liu , Dong Zhou , Jiang Bian and
Tie-Yan Liu
Microsoft Research { Xiao.Yang, Weiqing.Liu, Zhou.Dong, Jiang.Bian, Tie-Yan.Liu } @microsoft.com Abstract
Quantitative investment aims to maximize the re-turn and minimize the risk in a sequential trad-ing period over a set of financial instruments. Re-cently, inspired by rapid development and great po-tential of AI technologies in generating remark-able innovation in quantitative investment, therehas been increasing adoption of AI-driven work-flow for quantitative research and practical invest-ment. In the meantime of enriching the quan-titative investment methodology, AI technologieshave raised new challenges to the quantitative in-vestment system. Particularly, the new learningparadigms for quantitative investment call for an in-frastructure upgrade to accommodate the renovatedworkflow; moreover, the data-driven nature of AItechnologies indeed indicates a requirement of theinfrastructure with more powerful performance; ad-ditionally, there exist some unique challenges forapplying AI technologies to solve different tasksin the financial scenarios. To address these chal-lenges and bridge the gap between AI technologiesand quantitative investment, we design and developQlib that aims to realize the potential, empower theresearch, and create the value of AI technologies inquantitative investment.
Quantitative investment, one of the hottest research fields,has been attracting numerous brilliant minds from both theacademia and financial industry. In the last decades, withcontinuous efforts in optimizing the quantitative methodol-ogy, the whole community of professional investors has sum-marized a well-established yet imperfect quantitative researchworkflow. Recently, emerging AI technologies start a newtrend in this research field. With increasing attention to ex-ploring AI’s great potential in quantitative investment, AItechnologies have been widely adopted in the practical in-vestment by quantitative researchers.While AI technologies have been enriching the quantita-tive investment methodology, they also put forward new chal-lenges to the quantitative investment system from multiple perspectives. First, the technological revolution in the quan-titative investment workflow, caused by the flexibility of AItechnologies, tends to require new supportive infrastructure.For example, while the traditional quantitative investmentusually splits the whole workflow into a couple of sub-tasks,including stock trend prediction, portfolio optimization, etc.,AI technologies make it possible to establish an end-to-endsolution that generates the final portfolio directly. To supportsuch end-to-end solution, it is necessary to upgrade the cur-rent infrastructure due to its data-driven nature.Meanwhile, the AI technologies have to deal with theunique problems in some new scenarios, which require bothplenty of domain knowledge in finance and rich experience indata science. Applying the solutions to quantitative researchtasks without any domain adaptation rarely works. Such acircumstance leads to urgent demands for a platform to ac-commodate such a modern quantitative research workflow inthe age of AI and provide guidance for the application of AItechnologies in the financial scenario.Therefore, we propose a new AI-oriented Quantitative In-vestment Platform called Qlib . It aims to assist the researchefforts of exploring the great potential of AI technologiesin quantitative investment as well as empower quantitativeresearchers to create more significant values on AI-drivenquantitative investment. Specifically, the AI-oriented frame-work of Qlib is designed to accommodate the AI-based solu-tions. Moreover, it provides high-performance infrastructurededicated for quantitative investment scenario, which makesmany AI research topics possible. In addition, a batch of toolsdesigned for machine learning in the quantitative investmentscenario is integrated with Qlib to benefit users in makingfully utilization of AI technologies.At last, we demonstrate some use cases and evaluate theperformance of the infrastructure of Qlib by comparing sev-eral solutions for a typical task in quantitative investment.The results show the infrastructure of Qlib dedicated to quan-titative investment outperforms most of existing solutions onthis task. In this section, we will first demonstrate the major practicalproblems of a modern quantitative researcher when applying The code is available at https://github.com/microsoft/qlib a r X i v : . [ q -f i n . GN ] S e p f AI technologies in quantitative investment, which moti-vates the birth of Qlib. After that, we will briefly introducethe related work. Quantitative research workflow revolution
In the traditional investment research workflow, researchersoften develop trading signals by linear models [Petkova,2006] or manually designed rules[Murphy, 1999] based onseveral factors(factors are similar to features in machinelearning) and basic financial data. And then, a trading strat-egy(typically Barra[Sheikh, 1996]) is followed to generatethe target portfolio. At last, researchers evaluate the tradingsignal and portfolio by a back-testing function.With the rise of AI technologies, it launches a technologi-cal revolution of traditional quantitative investment. The tra-ditional quantitative research workflow is too primitive to ac-commodate such flexible technologies. In order to show thedifference more intuitively, we’ll demonstrate a typical mod-ern research workflow based on AI technologies. It starts witha dataset with lots of features(typically more than hundredsof dimensions). Manually designing such amount of featurestakes lots of time. It is common to leverage machine learningalgorithms to generate such features automatically[Potvin etal. , 2004; Neely et al. , 1997; Allen and Karjalainen, 1999;Kakushadze, 2016]. Generating data [Feng et al. , 2019]is another option for constructing a dataset. Based on di-verse datasets, researchers have provided hundreds of ma-chine learning methods to mine trading signals [Sezer et al. ,2019]. Researchers could generate the target portfolio basedon such trading signals. But such a workflow is not theonly choice. Instead of dividing a task into several stages,RL(reinforcement learning) provides an end-to-end solutionfrom the data to the final trading actions directly [Deng etal. , 2016]. RL optimizes the trading strategy by interactingwith the environment, which is a trading simulator in the fi-nancial scenario. RL needs a responsive simulator instead ofa back-testing function in the traditional research workflow.Moreover, most of the AI algorithms have complicated hy-perparameters, which need to be tuned carefully.AI technologies are so flexible and already beyond thescope of existing tools designed for traditional methodolo-gies. Building a research workflow based on AI technologiesfrom scratch takes much time.
High performance requirements for infrastructure
With the emerging of AI technologies, the requirements forinfrastructure have changed. Such a data-driven methodcould leverage a huge amount of data. The amount of datacould reach the order of TB magnitude in the scenario ofhigh-frequency trading. Besides, it is very common to de-rive thousands of new features (e.g., Alpha101 [Kakushadze,2016]) from the basic price and volume data, which con-sist of only five dimensions in total. Some researcherseven try to create new factors or features by searching ex-pressions [Allen and Karjalainen, 1999; Neely et al. , 1997;Potvin et al. , 2004]. Such heavy work of data processingoverburdens the researchers and even make some research topics impossible. Such circumstances put forward morestringent performance requirements for the infrastructure.
Obstacles to apply machine learning solutions
The financial data and task have their uniqueness and chal-lenges. Applying the machine learning solutions to quantita-tive research tasks without any adaptation rarely works. Dueto the extremely low SNR(Signal to Noise Ratio) in financialdata, it is very hard to build a successful data-driven strategyin financial markets. Most machine learning algorithms aredata-driven and have to deal with such difficulties. Withoutcarefully handling the details, machine learning models canhardly achieve satisfying performance. Even a minor mis-take can make the model over-fit the noise rather than learneffective patterns. Rightly handling the details requires a lotof domain knowledge of the financial industry. Moreover, thetypical objectives, such as annualized return, are often not dif-ferentiable, which makes it hard to train models directly formachine learning methods. Defining a reasonable task withappropriate supervised targets is very important for model-ing the finance data. Such barriers daunt quite a lot of datascientists without much domain knowledge of the financialindustry.Another necessary step to build a machine learning ap-plication is hyperparameter optimization. Different machinelearning algorithms have different hyperparameter searchspaces, each of which has multiple dimensions with differentmeanings and priorities. Some of the quantitative researcherscome from the traditional financial industry and don’t havemuch knowledge about machine learning. Such huge learningcost stops many users from giving full play to the maximumvalue of machine learning.
In the financial industry, an investment strategy will becomeless profitable with more investors following it. Therefore,the financial practitioners, especially quantitative researchers,are never keen to share their own algorithms and tools. OLPS[Li et al. , 2016] is the first open-source toolbox for portfo-lio selection. It consists of a family of classical strategiespowered by machine learning algorithms as benchmarks andtoolkit to facilitate the development of new learning meth-ods. This toolbox only supports Matlab and Octave, which isnot compatible with current scientific mainstream languagePython and thus not friendly to the modern machine learn-ing algorithms. Its framework is quite simple, and modernquantitative research workflow based on AI technologies ismuch more complicated. Other quantitative tools emerge inrecent years. QuantLib [Firth, 2004] only focuses part ofmodern quantitative research workflow. QUANTAXIS fo-cuses more on the IT infrastructure instead of the researchworkflow. Quantopian releases a series of open-source tools1) Alphalens: a Python Library for performance analysis ofpredictive (alpha) stock factors 2) Zipline: an event-drivensystem for back-testing 3) Pyfolio: a Python library for per-formance and risk analysis of financial portfolios. All of themonly focus on the analysis of trading signals or an investmentportfolio. https://github.com/QUANTAXIS/QUANTAXIS verall, Qlib is the first open-source platform that accom-modates the workflow of a modern quantitative researcherin the age of AI. It aims to empower every quantitative re-searcher to realize the great potential of AI technologies inquantitative investment. In the cooperation with the quantitative researcher with yearsof hands-on experience in the financial market, we’ve en-countered all of the above problems and explored all kinds ofsolutions. Motivated by current circumstances, we implementQlib to apply AI technologies in quantitative investment.
AI-oriented framework
Qlib is designed in a modularizedway based on modern research workflow to provide the max-imum flexibility to accommodate AI technologies. Quan-titative researchers could extend the modules and build aworkflow to try their ideas efficiently. In each module, Qlibprovides several default implementation choices which workvery well in practical investment. With these off-the-shelfmodules, quantitative researchers could focus on the problemthey are interested in a specific module without distracted byother trivial details. Besides code, computation and data canalso be shared in some modules, so Qlib is designed to serveusers as a platform rather than a toolbox.
High-performance infrastructure
The performance ofdata processing is important to data-driven methods like AItechnologies. As an AI-oriented platform, Qlib provides ahigh-performance data infrastructure. Qlib provides a time-series flat-file database . Such a database is dedicated to sci-entific computing on finance data. It greatly outperforms cur-rent popular storage solutions like general-purpose databasesand time-series databases on some typical data processingtasks in quantitative investment research. Furthermore, thedatabase provides an expression engine, which could acceler-ate the implementation and computation of factors/features,which make research topics that rely on expressions compu-tation possible. Guidance for machine learning
Qlib has been integratedwith some typical datasets for quantitative investment, onwhich typical machine learning algorithms could success-fully learn patterns with generalization ability. Qlib pro-vides some basic guidance for machine learning users andintegrates some reasonable tasks which consist of reasonablefeature space and target label. Some typical hyperparameteroptimization tools are provided. With guidance and reason-able settings, machine learning models could learn patternswith better generalization ability instead of just over-fittingthe noise.
Figure 1 shows the overall framework of Qlib. This frame-work aims to 1) accommodate the modern AI technology, 2) https://en.wikipedia.org/wiki/Flat-file database Data Server Model CreatorPortfolio GenratorCreatorOrder ExecutorData EnhancementEnsemble
Data
Order ExecutorCreatorModel Manager
ModelModel
Model
Static Workflow Dynamic Modeling Analysis D a t a La y e r I n t e r da y M ode l I n t r ada y T r ad i ng I n t e r da y S t r a t eg y Models
ModelModel
Model
Ensemble CreatorPortfolio Generator
Highly user-customizable
Executed Result
Forecasting/Alpha
AlphaAnalyser PortfolioAnalyserAlpha AnalysisReportreturn risk Portfolio AnalysisReportreturn risk
Portfolio & Orders
ExecutionAnalyserExecutionAnalysisReport
Normal Module
Data FlowFeedback
Figure 1: modules and a typical workflow built with Qlib help the quantitative researchers build a whole research work-flow with minimal efforts 3) and leave them the maximal flex-ibility to explore problems they are interested without gettingdistracted by other parts.Such a target leads to a modularized design from the per-spective of system design. The system is split into severalindividual modules based on the modern practical researchworkflow. Most of the quantitative investment research direc-tions, no matter traditional or AI-based, could be regarded asimplementations of one or multiple modules’ interfaces. Qlibprovides several typical implementations that work well inpractical investment for users in each module . Moreover, themodules provide the flexibility for researchers to override ex-isting methods to explore new ideas. With such a framework,researchers could try new ideas and test the overall perfor-mance with other modules with minimal cost.The modules of Qlib are listed in Figure 1 and connectedin a typical workflow. Each module corresponds to a typi-cal sub-task in quantitative investment. A implementation inthe module can be regarded as a solution for this task. We’llintroduce each module and give some related examples of ex-isting quantitative research to show how Qlib accommodatethem.It starts with the
Data Server module in the bottom leftcorner, which provides a data engine to query and processraw data. With retrieved data, researcher could build hisown dataset in the
Data Enhancement module . Researchershave tried a lot solutions to build better datasets by exploringand constructing effective factors/features[Potvin et al. , 2004;Neely et al. , 1997; Allen and Karjalainen, 1999; Kakushadze,2016]. Generating datasets for training[Feng et al. , 2019] isanother research direction to provide datasets solution. The
Model Creater module learns models based on datasets. Inrecent years, numerous researchers have explored all kinds ofmodels to mine trading signals from financial dataset[Sezer et al. , 2019]. Moreover, meta-learning [Vilalta and Drissi,2002] that tries to learn to learn provides a new learningparadigm for the Model Creator module. Given plenty ofmethods to model the financial data in a modern researchworkflow, the model management system has become a nec-ssary part of the workflow. The
Model Manager module is designed to handle such problems for modern quantita-tive researchers. With diverse models, ensemble learning isquite an effective way to enhance the performance and ro-bustness of machine learning models, and it is frequentlyused in the financial area[Qiu et al. , 2014; Yang et al. , 2017;Zhao et al. , 2017]. It is supported by
Model Ensemble mod-ule . Portfolio Generator module aims to generate a port-folio from trading signals output by models, which is knownas portfolio management[Qian et al. , 2007]. Barra [Sheikh,1996] provides the most popular solution for this task. Withthe target portfolio, we provide a high-fidelity trading simu-lator,
Orders Executor module , to examine the performanceof a strategy and
Analyser modules to automatically analyzethe trading signals, portfolio and execution results. The OrderExecutor module is designed as a responsive simulator ratherthan a back-testing function, which could provide the infras-tructure for some learning paradigm(e.g., RL) that requiresfeedback of the environment produced by the Analyser mod-ules.The data in quantitative investment are in time-series for-mat and updated by time. The size of in-sample dataset in-creases by time. A typical practice to leverage the new data isto update our models regularly [Wang et al. , 2019b]. Besidesbetter utilization of increasing in-sample data, dynamicallyupdating models [Yang et al. , 2019] and trading strategies[Wang et al. , 2019a] will improve the performance furtherdue to dynamic nature of the stock market[Adam et al. , 2016].Therefore, it is obviously not the optimal solution to use aset of static model and trading strategies in
Static Workflow .Dynamic updating of models and strategies is a important re-search direction in quantitative investment. The modules inthe
Dynamic Modeling provide interfaces and infrastructureto accommodate such solutions.
Financial data
We’ll summarise the data requirements in quantitative re-search in this section. In quantitative research, the mostfrequently-used format of data follow such format
BasicData T = { x i,t,a } , i ∈ Inst, t ∈ T ime, a ∈ Attr where x i,t,a is the value of basic type(e.g. float, int), Inst denotes the financial instruments set(e.g. stock, option, etc.),
T ime denotes the timestampes set(e.g. trading days of stockmarket),
Attr denotes the possible attributes set of an instru-ment(e.g. open price, volume, market value), T denote thelatest timestamp of the data(e.g. the latest trading date). x i,t,a denotes the value of attribute a of instrument i at time t .Besides, instruments pools are necessary information tospecify a set of financial instruments which change over time P ool T = { pool t } , t ∈ T ime, pool t ⊆ Inst
S&P 500 Index is a typical example of P ool .Data update is an essential feature. The existing historicaldata will not change over time. Only the append operation of https://en.wikipedia.org/wiki/S%26P 500 Index new data is necessary. The formalized update operation is BasicData T = OldBasicData T ∪ { x i,t,a new } BasicData T +1 = BasicData T ∪ { x i,T +1 ,a } P ool T +1 = P ool T ∪ { pool t +1 } User queries can be formalized as
Data
Query = { x i,t,a | i t ∈ pool t , pool t ∈ P ool query a ∈ Attr query , time start ≤ t ≤ time end } which represents data query of some attributes of instrumentsin a specific time range in a specific pool.Such requirements are quite simple. Many off-the-shelfopen-source solutions support such operations. We classifythem into three categories and list the popular implementa-tions in each category. • General-purpose database: MySQL[MySQL, 2001],MongoDB[Chodorow, 2013] • Time-series database: InfluxDB [Naqvi et al. , 2017] • Data file for scientific computing: Data organizedby numpy[Oliphant, 2006] array or pandas[McKinney,2011] dataframeThe general-purpose database supports data with diverseformats and structures. Besides, it provides lots of sophis-ticated mechanisms, such as indexing, transaction, entity-relationship model, etc. Most of them add heavy dependen-cies and unnecessary complexity to a specific task rather thansolving the key problems in a specific scenario. The time-series database optimizes the data structures and queries fortime-series data. But they are still not designed for quanti-tative research, where the data are usually in compact array-based format for scientific computation to take advantage ofhardware acceleration. It will save a great amount of time ifthe data keep the compact array-based format from the diskto the end of clients without format transformation. How-ever, both general-purpose and time-series database store andtransfer the data in a different format for the general purpose,which is inefficient for scientific computation.Due to the inefficiency of databases, array-based data gainpopularity in the scientific community. Numpy array and pan-das dataframe are the mainstream implementations in scien-tific computation, which are often stored as HDF or pickle on the disk. Data in such formats have light dependenciesand are very efficient for scientific computing. However, suchdata are stored in a single file and hard to update or query.After an investigation of above storage solutions, we findnone could fit the quantitative research scenario very well. Itis necessary to design a customized solution for quantitativeresearch. File storage design
Figure 2 demonstrates the file storage design. As shown in theleft part of the figure, Qlib organize files in a tree structure.Data are separated into folders and files according to different https://en.wikipedia.org/wiki/Hierarchical Data Format https://docs.python.org/3/library/pickle.html ailyfeaturesGOOGLopen.bincalendar.txtfor a specific frequencyattributesinstrumentinstruments poolssp500.txtshared timeline Reference
GOOGL 2004-08-19MSFT 1986-03-13AAPL 1980-12-12AMZN 1997-05-15... size = (N + 1) * 4 BytesTimestamp 1 Timestamp 2 Timestamp TFixed-width binary datavalue1 value 2 value NStart timestamp Index Time
Figure 2: The description of the flat-file database; the left part is thestructure of files; the right part is the content of files frequencies, instruments and attributes. All the values of at-tributes are stored in binary data in a compact fixed-width for-mat so that indexing by bytes becomes possible. The sharedtimeline is stored separately in a file named ”calendar.txt”.The data file of attribute values sets its first 4 bytes to the in-dex value of the timeline to indicates the start timestamp ofthe series of data. With the start time index, Qlib could alignall the values on the time dimension.The data are stored in a compact format, which is effi-cient to be combined into arrays for scientific computation.While it achieves high performance like array-based data inscientific computation, it meets data update requirements inthe quantitative investment scenario. All data are arrangedin the order of time. New data could be updated by ap-pending, which is quite efficient. Adding and removing at-tributes or instruments are quite straightforward and efficient,because they are stored in separate files. Such a design is ex-tremely light-weighted. Without the overheads of databases,Qlib achieves high performance.
Expression Engine
It is quite a common task to develop new factors/featuresbased on basic data. Such a task takes a large proportion ofthe time of many quantitative researchers. Both Implementsuch factors by code, and the computation process is time-consuming. Therefore, Qlib provides an expression engine tominimize the effort of such tasks.Actually, the nature of factors/features is a function thattransforms the basic data into the target values. The func-tion could break down into a combination of a series of ex-pressions. The expression engine is designed based on thisidea. With this expression engine, quantitative researcherscould implement new factors/features by writing expressionsinstead of complicated code. For example, The BollingerBand technical indicator [Bollinger, 2002] is a widely usedtechnical factor and its upper bounds can be implemented byjust a simple expression ”(MEAN( $ close, N)+2*STD( $ close,N)- $ close)/MEAN( $ close, N)” with the expression engine.Such an implementation is simple, readable, reusable andmaintainable. Users can easily build a dataset with just a se-ries of simple expressions. Searching expressions to constructeffective trading signals is a typical research topic, which hasbeen explored by many researchers [Allen and Karjalainen,1999; Neely et al. , 1997; Potvin et al. , 2004]. An expression engine is an essential tool for such a research topic. Cache system S t o cks A tt r and F ( A tt r) time GOOGLMSFT S t o cks A tt r close.binopen.bin time S t o cks F ( A tt r) high/closeopen/close time UpdateExpressionCache
Cache for Saving Calculation Time Cache For Saving Combination Time
GOOGLMSFT
UpdateDatasetCache
Figure 3: The disk cache system of Qlib; expression cache for savingtime of expression computation; dataset cache for saving time ofdata combination
To avoid replicated computation, Qlib has a built-in cachesystem. It consists of memory cache and disk cache.
In-memory cache
When Qlib computes factors/featureswith its expression engine, it parses the expression into a syn-tax tree. All computed results of nodes will be stored in anLRU(Least Recently Used) cache in memory. The replicatedcomputation of same (sub-)expressions can be saved.
Disk cache
A typical workflow of data processing in quan-titative investment can be divided into three steps: fetchingoriginal data, computing expressions and combining data intoarrays for scientific computation. Computing expressions andcombining data are very time-consuming. It could save muchtime if we can cache the shared intermediate data. In prac-tical data processing tasks, many intermediate results can beshared. For example, the same expression computation canbe shared by different data processing tasks. Therefore Qlibdesigned a 2-level disk cache mechanism. The cache sys-tem is shown in Figure 3. The left part is the original datawe described in Section 3.3. The first level is expressioncache, which will save all the computed expressions to thedisk cache. The data structure of the expression cache isthe same as the original data. With the expression cache,the same expression will be computed only once. After theexpression cache is dataset cache, which stores the combineddata to save the combination time. The cache data of both lev-els are arranged by time and indexable on the time dimension,so the disk cache can be shared even when the query timechanges. Moreover, Qlib support data update by appendingnew data thanks to the data arrangement by time. The main-tenance of the data is much easier with such a mechanism.
As we discussed in Section 2, guidance for machine learningalgorithms is very important. Qlib provides typical datasetsfor machine learning algorithms. Some typical task settingscan be found in Qlib , such as data pre-processing, learn-ing targets, etc. Researchers don’t have to explore everythingfrom scratch. Such guidances provide lots of domain knowl-edge for researchers to start their journey in this research area.For most machine learning algorithms, hyperparameter op-timization is a necessary step to achieve better generaliza-tion. Although it is important, it takes a lot of effort and isDF5 MySQL MongoDB InfluxDB Qlib -E -D Qlib +E -D Qlib +E +DStorage(MB)
394 303
802 1,000Load Data(s) ± ± ± ± ± ± ± ± ± ± -Convert Index(s) - 3.6 ± ± ± ± ± ± ± ± ± ± Total(64CPUs) (s) - 8.8 ± ± - Table 1: Performance comparison of different storage solutions quite repetitive. Therefore, Qlib provides a
Hyperparame-ters Tuning Engine(HTE) to make such a task easier. HTEprovides an interface to define a hyperparameter search space Θ and then search the best hyperparameters θ automatically.In a typical financial task of modeling time-series data, thenew data comes in sequence by time. To leverage the newdata, models have to be re-trained on new data periodically.The new best hyperparameters θ change but are often close toprevious best hyperparameters. HTE provides a mechanismdedicated to hyperparameter optimization on financial tasks.It generates a new distribution for hyperparameters searchspace for better a chance to reach the best point with fewertrials. The distribution for searching θ can be formalized as p new ( x ) = p prior ( x ) ϕ θ prev ,σ ( x ) E x ∼ p prior [ ϕ θ prev ,σ ( x )] where p prior is the original hyperparameters search space; ϕ θ prev ,σ ( x ) ∼ N ( θ prev , σ ) ; θ prev is the best hyperparam-eter in last model training. The domain of hyperparametersearch space remains the same, but the probability densityaround θ prev increases. Qlib provide a
Config-Driven Pipeline Engine(CDPE) tohelp researchers build the whole research workflow show inFigure 1 easier. The user could define a workflow with justa simple config file like List ?? (some trivial details are re-placed by ”...”). Such an interface is not mandatory, and weleave the maximal flexibility to users to build a quantitativeresearch workflow by code like building blocks. The performance of data processing is important to data-driven methods like AI technologies. As an AI-oriented plat-form, Qlib provides a solution for data storage and data pro-cessing. To demonstrate the performance of Qlib, We com-pare Qlib with several other solutions discussed in Section3.3, which includes
HDF5, MySQL, MongoDB, InfluxDb andQlib . The
Qlib +E -D indicates Qlib with expression cacheenabled and dataset cache disabled, and so forth.
Figure 4: A Configuration example of CDPE
The task for the solutions is to create a dataset from thebasic OHLCV daily data of a stock market, which involvesdata query and processing. The final dataset consists of 14factors/features derived from OHLCV data(e.g. ”Std($close,5)/$close”). The time of the data ranges from 1/1/2007 to1/1/2020. The stock pool consists of 800 stocks each day,which changes daily.Besides the comparison of the total time of each solution,we break down the task into following steps for more details. • Load Data
Load the OHCLV data or cache into RAMas the array-based format for scientific computation. • Compute Expr.
Compute the derived factors/features. • Convert Index
It only applies to Qlib. Because Qlibdoesn’t store the indices(i.e., timestamp, stock id) in theoriginal data, it has to set up data indices. • Filter data
Filter the stock data by a specific pool. Forexample, SP500 involves more than 1 thousand stock intotal, but it only includes 500 stocks daily. The data notincluded in SP500 on a specific day should be filteredout, though it has ever been in SP500. It is impossible tofilter out data when loading data, because some derivedfeatures rely on historical OHLCV data. • Combine data
Concatenate all the data of differentstocks into a single piece of array-based dataAs we can seen in Table 1. Qlib’s compact storage achievessimilar size and loading speed as the dedicated scientific The open, high, low, close price and trading volume of a stock
DF5 data file. The databases take too much time on loadingdata. After looking into the underlying implementation, wefind that data go through too many layers of interfaces andunnecessary format transformations in both general-purposedatabase and time-series database solution. Such overheadsgreatly slow down the data loading process. Due to the mem-ory cache of Qlib, Qlib -E -D saves about 24% of the time ofCompute Expr. Moreover, Qlib provides expression cacheand dataset cache mechanism. With expression cache en-abled in Qlib +E -D, 80.4% of the time for Compute Expr.is saved if no expression cache is missed. Combining the fac-tors/features into one piece of array-based data for each stockaccounts for the major time consuming of Qlib +E -D, whichis included in the Compute Expr. step. Besides the computa-tion cost, the most time-consuming step is data combination.The dataset cache is designed to reduce such overheads. Asshown in the column Qlib +E +D, the time cost is further re-duced.Moreover, Qlib can leverage multiple CPU cores to accel-erate computation. As we can see in the last line of Tabel 1,the time cost is significantly reduced for Qlib with multipleCPUs. Qlib +E +D can’t be accelerated further due to it justreads the existing cache and almost computes nothing.
Qlib an opensource platform in continuous development.More detailed documentations can be found in its githubrepository . A lot of features(e.g. data service with client-server architecture, analysis system, automatic deploymenton the cloud) not introduced in detail in this paper could befound in the online repository. Your contributions are wel-comed. In this paper, we present practical problems of modern quan-titative researchers in the age of AI. Based on these practicalproblems, we design and implement Qlib that aims to em-power every quantitative researcher to realize the great po-tential of AI-technologies in quantitative investment.
References [Adam et al. , 2016] Klaus Adam, Albert Marcet, andJuan Pablo Nicolini.
Stock market volatility and learning ,2016.[Allen and Karjalainen, 1999] Franklin Allen and Risto Kar-jalainen. Using genetic algorithms to find technical trad-ing rules.
Journal of financial Economics , 51(2):245–271,1999.[Bollinger, 2002] John Bollinger.
Bollinger on Bollingerbands . McGraw Hill Professional, 2002.[Chodorow, 2013] Kristina Chodorow.
MongoDB: thedefinitive guide: powerful and scalable data storage . ”O’Reilly Media, Inc.”, 2013. https://github.com/microsoft/qlib/ [Deng et al. , 2016] Yue Deng, Feng Bao, Youyong Kong,Zhiquan Ren, and Qionghai Dai. Deep direct reinforce-ment learning for financial signal representation and trad-ing. IEEE transactions on neural networks and learningsystems , 28(3):653–664, 2016.[Feng et al. , 2019] Fuli Feng, Huimin Chen, Xiangnan He,Ji Ding, Maosong Sun, and Tat-Seng Chua. Enhancingstock movement prediction with adversarial training. In
Proceedings of the 28th International Joint Conferenceon Artificial Intelligence , pages 5843–5849. AAAI Press,2019.[Firth, 2004] N Firth. Why use quantlib. ,2004.[Kakushadze, 2016] Zura Kakushadze. 101 formulaic al-phas.
Wilmott , 2016(84):72–81, 2016.[Li et al. , 2016] Bin Li, Doyen Sahoo, and Steven CH Hoi.Olps: a toolbox for on-line portfolio selection.
The Journalof Machine Learning Research , 17(1):1242–1246, 2016.[McKinney, 2011] Wes McKinney. pandas: a foundationalpython library for data analysis and statistics.
Python forHigh Performance and Scientific Computing , 14, 2011.[Murphy, 1999] John J Murphy.
Technical analysis of the fi-nancial markets: A comprehensive guide to trading meth-ods and applications . Penguin, 1999.[MySQL, 2001] AB MySQL. Mysql, 2001.[Naqvi et al. , 2017] Syeda Noor Zehra Naqvi, Sofia Yfanti-dou, and Esteban Zim´anyi. Time series databases and in-fluxdb.
Studienarbeit, Universit´e Libre de Bruxelles , 2017.[Neely et al. , 1997] Christopher Neely, Paul Weller, and RobDittmar. Is technical analysis in the foreign exchange mar-ket profitable? a genetic programming approach.
Jour-nal of financial and Quantitative Analysis , 32(4):405–426,1997.[Oliphant, 2006] Travis E Oliphant.
A guide to NumPy , vol-ume 1. Trelgol Publishing USA, 2006.[Petkova, 2006] Ralitsa Petkova. Do the fama–french factorsproxy for innovations in predictive variables?
The Journalof Finance , 61(2):581–612, 2006.[Potvin et al. , 2004] Jean-Yves Potvin, Patrick Soriano, andMaxime Vall´ee. Generating trading rules on the stock mar-kets with genetic programming.
Computers & OperationsResearch , 31(7):1033–1047, 2004.[Qian et al. , 2007] Edward E Qian, Ronald H Hua, andEric H Sorensen.
Quantitative equity portfolio manage-ment: modern techniques and applications . CRC Press,2007.[Qiu et al. , 2014] Xueheng Qiu, Le Zhang, Ye Ren, Pon-nuthurai N Suganthan, and Gehan Amaratunga. Ensembledeep learning for regression and time series forecasting.In , pages 1–6. IEEE, 2014.Sezer et al. , 2019] Omer Berat Sezer, Mehmet UgurGudelek, and Ahmet Murat Ozbayoglu. Financial time se-ries forecasting with deep learning: A systematic literaturereview: 2005-2019. arXiv preprint arXiv:1911.13288 ,2019.[Sheikh, 1996] Aamir Sheikh. Barra’s risk models.
BarraResearch Insights , pages 1–24, 1996.[Vilalta and Drissi, 2002] Ricardo Vilalta and YoussefDrissi. A perspective view and survey of meta-learning.
Artificial intelligence review , 18(2):77–95, 2002.[Wang et al. , 2019a] Lewen Wang, Weiqing Liu, Xiao Yang,and Jiang Bian. Conservative or aggressive? confidence-aware dynamic portfolio construction. In , pages 1–5. IEEE, 2019.[Wang et al. , 2019b] Shouxiang Wang, Xuan Wang,Shaomin Wang, and Dan Wang. Bi-directional longshort-term memory method based on attention mechanismand rolling update for short-term load forecasting.
Inter-national Journal of Electrical Power & Energy Systems ,109:470–479, 2019.[Yang et al. , 2017] Bing Yang, Zi-Jia Gong, and WenqiYang. Stock market index prediction using deep neuralnetwork ensemble. In , pages 3882–3887. IEEE, 2017.[Yang et al. , 2019] Xiao Yang, Weiqing Liu, Lewen Wang,Cheng Qu, and Jiang Bian. A divide-and-conquer frame-work for attention-based combination of multiple invest-ment strategies. In , pages 1–5.IEEE, 2019.[Zhao et al. , 2017] Yang Zhao, Jianping Li, and Lean Yu. Adeep learning ensemble approach for crude oil price fore-casting.