[PDF] Qlib: An AI-oriented Quantitative Investment Platform

Abstract

Quantitative investment aims to maximize the return and minimize the risk in a sequential trading period over a set of financial instruments. Recently, inspired by rapid development and great potential of AI technologies in generating remarkable innovation in quantitative investment, there has been increasing adoption of AI-driven workflow for quantitative research and practical investment. In the meantime of enriching the quantitative investment methodology, AI technologies have raised new challenges to the quantitative investment system. Particularly, the new learning paradigms for quantitative investment call for an infrastructure upgrade to accommodate the renovated workflow; moreover, the data-driven nature of AI technologies indeed indicates a requirement of the infrastructure with more powerful performance; additionally, there exist some unique challenges for applying AI technologies to solve different tasks in the financial scenarios. To address these challenges and bridge the gap between AI technologies and quantitative investment, we design and develop Qlib that aims to realize the potential, empower the research, and create the value of AI technologies in quantitative investment.

Full PDF

QQlib : An AI-oriented Quantitative Investment Platform

Xiao Yang , Weiqing Liu , Dong Zhou , Jiang Bian and

Tie-Yan Liu

Microsoft Research { Xiao.Yang, Weiqing.Liu, Zhou.Dong, Jiang.Bian, Tie-Yan.Liu } @microsoft.com Abstract

Quantitative investment aims to maximize the re-turn and minimize the risk in a sequential trad-ing period over a set of ﬁnancial instruments. Re-cently, inspired by rapid development and great po-tential of AI technologies in generating remark-able innovation in quantitative investment, therehas been increasing adoption of AI-driven work-ﬂow for quantitative research and practical invest-ment. In the meantime of enriching the quan-titative investment methodology, AI technologieshave raised new challenges to the quantitative in-vestment system. Particularly, the new learningparadigms for quantitative investment call for an in-frastructure upgrade to accommodate the renovatedworkﬂow; moreover, the data-driven nature of AItechnologies indeed indicates a requirement of theinfrastructure with more powerful performance; ad-ditionally, there exist some unique challenges forapplying AI technologies to solve different tasksin the ﬁnancial scenarios. To address these chal-lenges and bridge the gap between AI technologiesand quantitative investment, we design and developQlib that aims to realize the potential, empower theresearch, and create the value of AI technologies inquantitative investment.

Quantitative investment, one of the hottest research ﬁelds,has been attracting numerous brilliant minds from both theacademia and ﬁnancial industry. In the last decades, withcontinuous efforts in optimizing the quantitative methodol-ogy, the whole community of professional investors has sum-marized a well-established yet imperfect quantitative researchworkﬂow. Recently, emerging AI technologies start a newtrend in this research ﬁeld. With increasing attention to ex-ploring AI’s great potential in quantitative investment, AItechnologies have been widely adopted in the practical in-vestment by quantitative researchers.While AI technologies have been enriching the quantita-tive investment methodology, they also put forward new chal-lenges to the quantitative investment system from multiple perspectives. First, the technological revolution in the quan-titative investment workﬂow, caused by the ﬂexibility of AItechnologies, tends to require new supportive infrastructure.For example, while the traditional quantitative investmentusually splits the whole workﬂow into a couple of sub-tasks,including stock trend prediction, portfolio optimization, etc.,AI technologies make it possible to establish an end-to-endsolution that generates the ﬁnal portfolio directly. To supportsuch end-to-end solution, it is necessary to upgrade the cur-rent infrastructure due to its data-driven nature.Meanwhile, the AI technologies have to deal with theunique problems in some new scenarios, which require bothplenty of domain knowledge in ﬁnance and rich experience indata science. Applying the solutions to quantitative researchtasks without any domain adaptation rarely works. Such acircumstance leads to urgent demands for a platform to ac-commodate such a modern quantitative research workﬂow inthe age of AI and provide guidance for the application of AItechnologies in the ﬁnancial scenario.Therefore, we propose a new AI-oriented Quantitative In-vestment Platform called Qlib . It aims to assist the researchefforts of exploring the great potential of AI technologiesin quantitative investment as well as empower quantitativeresearchers to create more signiﬁcant values on AI-drivenquantitative investment. Speciﬁcally, the AI-oriented frame-work of Qlib is designed to accommodate the AI-based solu-tions. Moreover, it provides high-performance infrastructurededicated for quantitative investment scenario, which makesmany AI research topics possible. In addition, a batch of toolsdesigned for machine learning in the quantitative investmentscenario is integrated with Qlib to beneﬁt users in makingfully utilization of AI technologies.At last, we demonstrate some use cases and evaluate theperformance of the infrastructure of Qlib by comparing sev-eral solutions for a typical task in quantitative investment.The results show the infrastructure of Qlib dedicated to quan-titative investment outperforms most of existing solutions onthis task. In this section, we will ﬁrst demonstrate the major practicalproblems of a modern quantitative researcher when applying The code is available at https://github.com/microsoft/qlib a r X i v : . [ q -f i n . GN ] S e p f AI technologies in quantitative investment, which moti-vates the birth of Qlib. After that, we will brieﬂy introducethe related work. Quantitative research workﬂow revolution

In the traditional investment research workﬂow, researchersoften develop trading signals by linear models [Petkova,2006] or manually designed rules[Murphy, 1999] based onseveral factors(factors are similar to features in machinelearning) and basic ﬁnancial data. And then, a trading strat-egy(typically Barra[Sheikh, 1996]) is followed to generatethe target portfolio. At last, researchers evaluate the tradingsignal and portfolio by a back-testing function.With the rise of AI technologies, it launches a technologi-cal revolution of traditional quantitative investment. The tra-ditional quantitative research workﬂow is too primitive to ac-commodate such ﬂexible technologies. In order to show thedifference more intuitively, we’ll demonstrate a typical mod-ern research workﬂow based on AI technologies. It starts witha dataset with lots of features(typically more than hundredsof dimensions). Manually designing such amount of featurestakes lots of time. It is common to leverage machine learningalgorithms to generate such features automatically[Potvin etal. , 2004; Neely et al. , 1997; Allen and Karjalainen, 1999;Kakushadze, 2016]. Generating data [Feng et al. , 2019]is another option for constructing a dataset. Based on di-verse datasets, researchers have provided hundreds of ma-chine learning methods to mine trading signals [Sezer et al. ,2019]. Researchers could generate the target portfolio basedon such trading signals. But such a workﬂow is not theonly choice. Instead of dividing a task into several stages,RL(reinforcement learning) provides an end-to-end solutionfrom the data to the ﬁnal trading actions directly [Deng etal. , 2016]. RL optimizes the trading strategy by interactingwith the environment, which is a trading simulator in the ﬁ-nancial scenario. RL needs a responsive simulator instead ofa back-testing function in the traditional research workﬂow.Moreover, most of the AI algorithms have complicated hy-perparameters, which need to be tuned carefully.AI technologies are so ﬂexible and already beyond thescope of existing tools designed for traditional methodolo-gies. Building a research workﬂow based on AI technologiesfrom scratch takes much time.

High performance requirements for infrastructure

With the emerging of AI technologies, the requirements forinfrastructure have changed. Such a data-driven methodcould leverage a huge amount of data. The amount of datacould reach the order of TB magnitude in the scenario ofhigh-frequency trading. Besides, it is very common to de-rive thousands of new features (e.g., Alpha101 [Kakushadze,2016]) from the basic price and volume data, which con-sist of only ﬁve dimensions in total. Some researcherseven try to create new factors or features by searching ex-pressions [Allen and Karjalainen, 1999; Neely et al. , 1997;Potvin et al. , 2004]. Such heavy work of data processingoverburdens the researchers and even make some research topics impossible. Such circumstances put forward morestringent performance requirements for the infrastructure.

Obstacles to apply machine learning solutions

The ﬁnancial data and task have their uniqueness and chal-lenges. Applying the machine learning solutions to quantita-tive research tasks without any adaptation rarely works. Dueto the extremely low SNR(Signal to Noise Ratio) in ﬁnancialdata, it is very hard to build a successful data-driven strategyin ﬁnancial markets. Most machine learning algorithms aredata-driven and have to deal with such difﬁculties. Withoutcarefully handling the details, machine learning models canhardly achieve satisfying performance. Even a minor mis-take can make the model over-ﬁt the noise rather than learneffective patterns. Rightly handling the details requires a lotof domain knowledge of the ﬁnancial industry. Moreover, thetypical objectives, such as annualized return, are often not dif-ferentiable, which makes it hard to train models directly formachine learning methods. Deﬁning a reasonable task withappropriate supervised targets is very important for model-ing the ﬁnance data. Such barriers daunt quite a lot of datascientists without much domain knowledge of the ﬁnancialindustry.Another necessary step to build a machine learning ap-plication is hyperparameter optimization. Different machinelearning algorithms have different hyperparameter searchspaces, each of which has multiple dimensions with differentmeanings and priorities. Some of the quantitative researcherscome from the traditional ﬁnancial industry and don’t havemuch knowledge about machine learning. Such huge learningcost stops many users from giving full play to the maximumvalue of machine learning.

In the ﬁnancial industry, an investment strategy will becomeless proﬁtable with more investors following it. Therefore,the ﬁnancial practitioners, especially quantitative researchers,are never keen to share their own algorithms and tools. OLPS[Li et al. , 2016] is the ﬁrst open-source toolbox for portfo-lio selection. It consists of a family of classical strategiespowered by machine learning algorithms as benchmarks andtoolkit to facilitate the development of new learning meth-ods. This toolbox only supports Matlab and Octave, which isnot compatible with current scientiﬁc mainstream languagePython and thus not friendly to the modern machine learn-ing algorithms. Its framework is quite simple, and modernquantitative research workﬂow based on AI technologies ismuch more complicated. Other quantitative tools emerge inrecent years. QuantLib [Firth, 2004] only focuses part ofmodern quantitative research workﬂow. QUANTAXIS fo-cuses more on the IT infrastructure instead of the researchworkﬂow. Quantopian releases a series of open-source tools1) Alphalens: a Python Library for performance analysis ofpredictive (alpha) stock factors 2) Zipline: an event-drivensystem for back-testing 3) Pyfolio: a Python library for per-formance and risk analysis of ﬁnancial portfolios. All of themonly focus on the analysis of trading signals or an investmentportfolio. https://github.com/QUANTAXIS/QUANTAXIS verall, Qlib is the ﬁrst open-source platform that accom-modates the workﬂow of a modern quantitative researcherin the age of AI. It aims to empower every quantitative re-searcher to realize the great potential of AI technologies inquantitative investment. In the cooperation with the quantitative researcher with yearsof hands-on experience in the ﬁnancial market, we’ve en-countered all of the above problems and explored all kinds ofsolutions. Motivated by current circumstances, we implementQlib to apply AI technologies in quantitative investment.

AI-oriented framework

Qlib is designed in a modularizedway based on modern research workﬂow to provide the max-imum ﬂexibility to accommodate AI technologies. Quan-titative researchers could extend the modules and build aworkﬂow to try their ideas efﬁciently. In each module, Qlibprovides several default implementation choices which workvery well in practical investment. With these off-the-shelfmodules, quantitative researchers could focus on the problemthey are interested in a speciﬁc module without distracted byother trivial details. Besides code, computation and data canalso be shared in some modules, so Qlib is designed to serveusers as a platform rather than a toolbox.

High-performance infrastructure

The performance ofdata processing is important to data-driven methods like AItechnologies. As an AI-oriented platform, Qlib provides ahigh-performance data infrastructure. Qlib provides a time-series ﬂat-ﬁle database . Such a database is dedicated to sci-entiﬁc computing on ﬁnance data. It greatly outperforms cur-rent popular storage solutions like general-purpose databasesand time-series databases on some typical data processingtasks in quantitative investment research. Furthermore, thedatabase provides an expression engine, which could acceler-ate the implementation and computation of factors/features,which make research topics that rely on expressions compu-tation possible. Guidance for machine learning

Qlib has been integratedwith some typical datasets for quantitative investment, onwhich typical machine learning algorithms could success-fully learn patterns with generalization ability. Qlib pro-vides some basic guidance for machine learning users andintegrates some reasonable tasks which consist of reasonablefeature space and target label. Some typical hyperparameteroptimization tools are provided. With guidance and reason-able settings, machine learning models could learn patternswith better generalization ability instead of just over-ﬁttingthe noise.

Figure 1 shows the overall framework of Qlib. This frame-work aims to 1) accommodate the modern AI technology, 2) https://en.wikipedia.org/wiki/Flat-ﬁle database Data Server Model CreatorPortfolio GenratorCreatorOrder ExecutorData EnhancementEnsemble

Data

Order ExecutorCreatorModel Manager

ModelModel

Model

Static Workﬂow Dynamic Modeling Analysis D a t a La y e r I n t e r da y M ode l I n t r ada y T r ad i ng I n t e r da y S t r a t eg y Models

ModelModel

Model

Ensemble CreatorPortfolio Generator

Highly user-customizable

Executed Result

Forecasting/Alpha

AlphaAnalyser PortfolioAnalyserAlpha AnalysisReportreturn risk Portfolio AnalysisReportreturn risk

Portfolio & Orders

ExecutionAnalyserExecutionAnalysisReport

Normal Module

Data FlowFeedback

Figure 1: modules and a typical workﬂow built with Qlib help the quantitative researchers build a whole research work-ﬂow with minimal efforts 3) and leave them the maximal ﬂex-ibility to explore problems they are interested without gettingdistracted by other parts.Such a target leads to a modularized design from the per-spective of system design. The system is split into severalindividual modules based on the modern practical researchworkﬂow. Most of the quantitative investment research direc-tions, no matter traditional or AI-based, could be regarded asimplementations of one or multiple modules’ interfaces. Qlibprovides several typical implementations that work well inpractical investment for users in each module . Moreover, themodules provide the ﬂexibility for researchers to override ex-isting methods to explore new ideas. With such a framework,researchers could try new ideas and test the overall perfor-mance with other modules with minimal cost.The modules of Qlib are listed in Figure 1 and connectedin a typical workﬂow. Each module corresponds to a typi-cal sub-task in quantitative investment. A implementation inthe module can be regarded as a solution for this task. We’llintroduce each module and give some related examples of ex-isting quantitative research to show how Qlib accommodatethem.It starts with the

Data Server module in the bottom leftcorner, which provides a data engine to query and processraw data. With retrieved data, researcher could build hisown dataset in the

Data Enhancement module . Researchershave tried a lot solutions to build better datasets by exploringand constructing effective factors/features[Potvin et al. , 2004;Neely et al. , 1997; Allen and Karjalainen, 1999; Kakushadze,2016]. Generating datasets for training[Feng et al. , 2019] isanother research direction to provide datasets solution. The

Model Creater module learns models based on datasets. Inrecent years, numerous researchers have explored all kinds ofmodels to mine trading signals from ﬁnancial dataset[Sezer et al. , 2019]. Moreover, meta-learning [Vilalta and Drissi,2002] that tries to learn to learn provides a new learningparadigm for the Model Creator module. Given plenty ofmethods to model the ﬁnancial data in a modern researchworkﬂow, the model management system has become a nec-ssary part of the workﬂow. The

Model Manager module is designed to handle such problems for modern quantita-tive researchers. With diverse models, ensemble learning isquite an effective way to enhance the performance and ro-bustness of machine learning models, and it is frequentlyused in the ﬁnancial area[Qiu et al. , 2014; Yang et al. , 2017;Zhao et al. , 2017]. It is supported by

Model Ensemble mod-ule . Portfolio Generator module aims to generate a port-folio from trading signals output by models, which is knownas portfolio management[Qian et al. , 2007]. Barra [Sheikh,1996] provides the most popular solution for this task. Withthe target portfolio, we provide a high-ﬁdelity trading simu-lator,

Orders Executor module , to examine the performanceof a strategy and

Analyser modules to automatically analyzethe trading signals, portfolio and execution results. The OrderExecutor module is designed as a responsive simulator ratherthan a back-testing function, which could provide the infras-tructure for some learning paradigm(e.g., RL) that requiresfeedback of the environment produced by the Analyser mod-ules.The data in quantitative investment are in time-series for-mat and updated by time. The size of in-sample dataset in-creases by time. A typical practice to leverage the new data isto update our models regularly [Wang et al. , 2019b]. Besidesbetter utilization of increasing in-sample data, dynamicallyupdating models [Yang et al. , 2019] and trading strategies[Wang et al. , 2019a] will improve the performance furtherdue to dynamic nature of the stock market[Adam et al. , 2016].Therefore, it is obviously not the optimal solution to use aset of static model and trading strategies in

Static Workﬂow .Dynamic updating of models and strategies is a important re-search direction in quantitative investment. The modules inthe

Dynamic Modeling provide interfaces and infrastructureto accommodate such solutions.

Financial data

We’ll summarise the data requirements in quantitative re-search in this section. In quantitative research, the mostfrequently-used format of data follow such format

BasicData T = { x i,t,a } , i ∈ Inst, t ∈ T ime, a ∈ Attr where x i,t,a is the value of basic type(e.g. ﬂoat, int), Inst denotes the ﬁnancial instruments set(e.g. stock, option, etc.),

T ime denotes the timestampes set(e.g. trading days of stockmarket),

Attr denotes the possible attributes set of an instru-ment(e.g. open price, volume, market value), T denote thelatest timestamp of the data(e.g. the latest trading date). x i,t,a denotes the value of attribute a of instrument i at time t .Besides, instruments pools are necessary information tospecify a set of ﬁnancial instruments which change over time P ool T = { pool t } , t ∈ T ime, pool t ⊆ Inst

S&P 500 Index is a typical example of P ool .Data update is an essential feature. The existing historicaldata will not change over time. Only the append operation of https://en.wikipedia.org/wiki/S%26P 500 Index new data is necessary. The formalized update operation is BasicData T = OldBasicData T ∪ { x i,t,a new } BasicData T +1 = BasicData T ∪ { x i,T +1 ,a } P ool T +1 = P ool T ∪ { pool t +1 } User queries can be formalized as

Data

Query = { x i,t,a | i t ∈ pool t , pool t ∈ P ool query a ∈ Attr query , time start ≤ t ≤ time end } which represents data query of some attributes of instrumentsin a speciﬁc time range in a speciﬁc pool.Such requirements are quite simple. Many off-the-shelfopen-source solutions support such operations. We classifythem into three categories and list the popular implementa-tions in each category. • General-purpose database: MySQL[MySQL, 2001],MongoDB[Chodorow, 2013] • Time-series database: InﬂuxDB [Naqvi et al. , 2017] • Data ﬁle for scientiﬁc computing: Data organizedby numpy[Oliphant, 2006] array or pandas[McKinney,2011] dataframeThe general-purpose database supports data with diverseformats and structures. Besides, it provides lots of sophis-ticated mechanisms, such as indexing, transaction, entity-relationship model, etc. Most of them add heavy dependen-cies and unnecessary complexity to a speciﬁc task rather thansolving the key problems in a speciﬁc scenario. The time-series database optimizes the data structures and queries fortime-series data. But they are still not designed for quanti-tative research, where the data are usually in compact array-based format for scientiﬁc computation to take advantage ofhardware acceleration. It will save a great amount of time ifthe data keep the compact array-based format from the diskto the end of clients without format transformation. How-ever, both general-purpose and time-series database store andtransfer the data in a different format for the general purpose,which is inefﬁcient for scientiﬁc computation.Due to the inefﬁciency of databases, array-based data gainpopularity in the scientiﬁc community. Numpy array and pan-das dataframe are the mainstream implementations in scien-tiﬁc computation, which are often stored as HDF or pickle on the disk. Data in such formats have light dependenciesand are very efﬁcient for scientiﬁc computing. However, suchdata are stored in a single ﬁle and hard to update or query.After an investigation of above storage solutions, we ﬁndnone could ﬁt the quantitative research scenario very well. Itis necessary to design a customized solution for quantitativeresearch. File storage design

Figure 2 demonstrates the ﬁle storage design. As shown in theleft part of the ﬁgure, Qlib organize ﬁles in a tree structure.Data are separated into folders and ﬁles according to different https://en.wikipedia.org/wiki/Hierarchical Data Format https://docs.python.org/3/library/pickle.html ailyfeaturesGOOGLopen.bincalendar.txtfor a speciﬁc frequencyattributesinstrumentinstruments poolssp500.txtshared timeline Reference

GOOGL 2004-08-19MSFT 1986-03-13AAPL 1980-12-12AMZN 1997-05-15... size = (N + 1) * 4 BytesTimestamp 1 Timestamp 2 Timestamp TFixed-width binary datavalue1 value 2 value NStart timestamp Index Time

Figure 2: The description of the ﬂat-ﬁle database; the left part is thestructure of ﬁles; the right part is the content of ﬁles frequencies, instruments and attributes. All the values of at-tributes are stored in binary data in a compact ﬁxed-width for-mat so that indexing by bytes becomes possible. The sharedtimeline is stored separately in a ﬁle named ”calendar.txt”.The data ﬁle of attribute values sets its ﬁrst 4 bytes to the in-dex value of the timeline to indicates the start timestamp ofthe series of data. With the start time index, Qlib could alignall the values on the time dimension.The data are stored in a compact format, which is efﬁ-cient to be combined into arrays for scientiﬁc computation.While it achieves high performance like array-based data inscientiﬁc computation, it meets data update requirements inthe quantitative investment scenario. All data are arrangedin the order of time. New data could be updated by ap-pending, which is quite efﬁcient. Adding and removing at-tributes or instruments are quite straightforward and efﬁcient,because they are stored in separate ﬁles. Such a design is ex-tremely light-weighted. Without the overheads of databases,Qlib achieves high performance.

Expression Engine

It is quite a common task to develop new factors/featuresbased on basic data. Such a task takes a large proportion ofthe time of many quantitative researchers. Both Implementsuch factors by code, and the computation process is time-consuming. Therefore, Qlib provides an expression engine tominimize the effort of such tasks.Actually, the nature of factors/features is a function thattransforms the basic data into the target values. The func-tion could break down into a combination of a series of ex-pressions. The expression engine is designed based on thisidea. With this expression engine, quantitative researcherscould implement new factors/features by writing expressionsinstead of complicated code. For example, The BollingerBand technical indicator [Bollinger, 2002] is a widely usedtechnical factor and its upper bounds can be implemented byjust a simple expression ”(MEAN( $ close, N)+2*STD( $ close,N)- $ close)/MEAN( $ close, N)” with the expression engine.Such an implementation is simple, readable, reusable andmaintainable. Users can easily build a dataset with just a se-ries of simple expressions. Searching expressions to constructeffective trading signals is a typical research topic, which hasbeen explored by many researchers [Allen and Karjalainen,1999; Neely et al. , 1997; Potvin et al. , 2004]. An expression engine is an essential tool for such a research topic. Cache system S t o cks A tt r and F ( A tt r) time GOOGLMSFT S t o cks A tt r close.binopen.bin time S t o cks F ( A tt r) high/closeopen/close time UpdateExpressionCache

Cache for Saving Calculation Time Cache For Saving Combination Time

GOOGLMSFT

UpdateDatasetCache

Figure 3: The disk cache system of Qlib; expression cache for savingtime of expression computation; dataset cache for saving time ofdata combination

To avoid replicated computation, Qlib has a built-in cachesystem. It consists of memory cache and disk cache.

In-memory cache

When Qlib computes factors/featureswith its expression engine, it parses the expression into a syn-tax tree. All computed results of nodes will be stored in anLRU(Least Recently Used) cache in memory. The replicatedcomputation of same (sub-)expressions can be saved.

Disk cache

A typical workﬂow of data processing in quan-titative investment can be divided into three steps: fetchingoriginal data, computing expressions and combining data intoarrays for scientiﬁc computation. Computing expressions andcombining data are very time-consuming. It could save muchtime if we can cache the shared intermediate data. In prac-tical data processing tasks, many intermediate results can beshared. For example, the same expression computation canbe shared by different data processing tasks. Therefore Qlibdesigned a 2-level disk cache mechanism. The cache sys-tem is shown in Figure 3. The left part is the original datawe described in Section 3.3. The ﬁrst level is expressioncache, which will save all the computed expressions to thedisk cache. The data structure of the expression cache isthe same as the original data. With the expression cache,the same expression will be computed only once. After theexpression cache is dataset cache, which stores the combineddata to save the combination time. The cache data of both lev-els are arranged by time and indexable on the time dimension,so the disk cache can be shared even when the query timechanges. Moreover, Qlib support data update by appendingnew data thanks to the data arrangement by time. The main-tenance of the data is much easier with such a mechanism.

As we discussed in Section 2, guidance for machine learningalgorithms is very important. Qlib provides typical datasetsfor machine learning algorithms. Some typical task settingscan be found in Qlib , such as data pre-processing, learn-ing targets, etc. Researchers don’t have to explore everythingfrom scratch. Such guidances provide lots of domain knowl-edge for researchers to start their journey in this research area.For most machine learning algorithms, hyperparameter op-timization is a necessary step to achieve better generaliza-tion. Although it is important, it takes a lot of effort and isDF5 MySQL MongoDB InﬂuxDB Qlib -E -D Qlib +E -D Qlib +E +DStorage(MB)

394 303

802 1,000Load Data(s) ± ± ± ± ± ± ± ± ± ± -Convert Index(s) - 3.6 ± ± ± ± ± ± ± ± ± ± Total(64CPUs) (s) - 8.8 ± ± - Table 1: Performance comparison of different storage solutions quite repetitive. Therefore, Qlib provides a

Hyperparame-ters Tuning Engine(HTE) to make such a task easier. HTEprovides an interface to deﬁne a hyperparameter search space Θ and then search the best hyperparameters θ automatically.In a typical ﬁnancial task of modeling time-series data, thenew data comes in sequence by time. To leverage the newdata, models have to be re-trained on new data periodically.The new best hyperparameters θ change but are often close toprevious best hyperparameters. HTE provides a mechanismdedicated to hyperparameter optimization on ﬁnancial tasks.It generates a new distribution for hyperparameters searchspace for better a chance to reach the best point with fewertrials. The distribution for searching θ can be formalized as p new ( x ) = p prior ( x ) ϕ θ prev ,σ ( x ) E x ∼ p prior [ ϕ θ prev ,σ ( x )] where p prior is the original hyperparameters search space; ϕ θ prev ,σ ( x ) ∼ N ( θ prev , σ ) ; θ prev is the best hyperparam-eter in last model training. The domain of hyperparametersearch space remains the same, but the probability densityaround θ prev increases. Qlib provide a

Conﬁg-Driven Pipeline Engine(CDPE) tohelp researchers build the whole research workﬂow show inFigure 1 easier. The user could deﬁne a workﬂow with justa simple conﬁg ﬁle like List ?? (some trivial details are re-placed by ”...”). Such an interface is not mandatory, and weleave the maximal ﬂexibility to users to build a quantitativeresearch workﬂow by code like building blocks. The performance of data processing is important to data-driven methods like AI technologies. As an AI-oriented plat-form, Qlib provides a solution for data storage and data pro-cessing. To demonstrate the performance of Qlib, We com-pare Qlib with several other solutions discussed in Section3.3, which includes

HDF5, MySQL, MongoDB, InﬂuxDb andQlib . The

Qlib +E -D indicates Qlib with expression cacheenabled and dataset cache disabled, and so forth.

Figure 4: A Conﬁguration example of CDPE

The task for the solutions is to create a dataset from thebasic OHLCV daily data of a stock market, which involvesdata query and processing. The ﬁnal dataset consists of 14factors/features derived from OHLCV data(e.g. ”Std($close,5)/$close”). The time of the data ranges from 1/1/2007 to1/1/2020. The stock pool consists of 800 stocks each day,which changes daily.Besides the comparison of the total time of each solution,we break down the task into following steps for more details. • Load Data

Load the OHCLV data or cache into RAMas the array-based format for scientiﬁc computation. • Compute Expr.

Compute the derived factors/features. • Convert Index

It only applies to Qlib. Because Qlibdoesn’t store the indices(i.e., timestamp, stock id) in theoriginal data, it has to set up data indices. • Filter data

Filter the stock data by a speciﬁc pool. Forexample, SP500 involves more than 1 thousand stock intotal, but it only includes 500 stocks daily. The data notincluded in SP500 on a speciﬁc day should be ﬁlteredout, though it has ever been in SP500. It is impossible toﬁlter out data when loading data, because some derivedfeatures rely on historical OHLCV data. • Combine data

Concatenate all the data of differentstocks into a single piece of array-based dataAs we can seen in Table 1. Qlib’s compact storage achievessimilar size and loading speed as the dedicated scientiﬁc The open, high, low, close price and trading volume of a stock

DF5 data ﬁle. The databases take too much time on loadingdata. After looking into the underlying implementation, weﬁnd that data go through too many layers of interfaces andunnecessary format transformations in both general-purposedatabase and time-series database solution. Such overheadsgreatly slow down the data loading process. Due to the mem-ory cache of Qlib, Qlib -E -D saves about 24% of the time ofCompute Expr. Moreover, Qlib provides expression cacheand dataset cache mechanism. With expression cache en-abled in Qlib +E -D, 80.4% of the time for Compute Expr.is saved if no expression cache is missed. Combining the fac-tors/features into one piece of array-based data for each stockaccounts for the major time consuming of Qlib +E -D, whichis included in the Compute Expr. step. Besides the computa-tion cost, the most time-consuming step is data combination.The dataset cache is designed to reduce such overheads. Asshown in the column Qlib +E +D, the time cost is further re-duced.Moreover, Qlib can leverage multiple CPU cores to accel-erate computation. As we can see in the last line of Tabel 1,the time cost is signiﬁcantly reduced for Qlib with multipleCPUs. Qlib +E +D can’t be accelerated further due to it justreads the existing cache and almost computes nothing.

Qlib an opensource platform in continuous development.More detailed documentations can be found in its githubrepository . A lot of features(e.g. data service with client-server architecture, analysis system, automatic deploymenton the cloud) not introduced in detail in this paper could befound in the online repository. Your contributions are wel-comed. In this paper, we present practical problems of modern quan-titative researchers in the age of AI. Based on these practicalproblems, we design and implement Qlib that aims to em-power every quantitative researcher to realize the great po-tential of AI-technologies in quantitative investment.

References [Adam et al. , 2016] Klaus Adam, Albert Marcet, andJuan Pablo Nicolini.

Stock market volatility and learning ,2016.[Allen and Karjalainen, 1999] Franklin Allen and Risto Kar-jalainen. Using genetic algorithms to ﬁnd technical trad-ing rules.

Journal of ﬁnancial Economics , 51(2):245–271,1999.[Bollinger, 2002] John Bollinger.

Bollinger on Bollingerbands . McGraw Hill Professional, 2002.[Chodorow, 2013] Kristina Chodorow.

MongoDB: thedeﬁnitive guide: powerful and scalable data storage . ”O’Reilly Media, Inc.”, 2013. https://github.com/microsoft/qlib/ [Deng et al. , 2016] Yue Deng, Feng Bao, Youyong Kong,Zhiquan Ren, and Qionghai Dai. Deep direct reinforce-ment learning for ﬁnancial signal representation and trad-ing. IEEE transactions on neural networks and learningsystems , 28(3):653–664, 2016.[Feng et al. , 2019] Fuli Feng, Huimin Chen, Xiangnan He,Ji Ding, Maosong Sun, and Tat-Seng Chua. Enhancingstock movement prediction with adversarial training. In

Proceedings of the 28th International Joint Conferenceon Artiﬁcial Intelligence , pages 5843–5849. AAAI Press,2019.[Firth, 2004] N Firth. Why use quantlib. ,2004.[Kakushadze, 2016] Zura Kakushadze. 101 formulaic al-phas.

Wilmott , 2016(84):72–81, 2016.[Li et al. , 2016] Bin Li, Doyen Sahoo, and Steven CH Hoi.Olps: a toolbox for on-line portfolio selection.

The Journalof Machine Learning Research , 17(1):1242–1246, 2016.[McKinney, 2011] Wes McKinney. pandas: a foundationalpython library for data analysis and statistics.

Python forHigh Performance and Scientiﬁc Computing , 14, 2011.[Murphy, 1999] John J Murphy.

Technical analysis of the ﬁ-nancial markets: A comprehensive guide to trading meth-ods and applications . Penguin, 1999.[MySQL, 2001] AB MySQL. Mysql, 2001.[Naqvi et al. , 2017] Syeda Noor Zehra Naqvi, Soﬁa Yfanti-dou, and Esteban Zim´anyi. Time series databases and in-ﬂuxdb.

Studienarbeit, Universit´e Libre de Bruxelles , 2017.[Neely et al. , 1997] Christopher Neely, Paul Weller, and RobDittmar. Is technical analysis in the foreign exchange mar-ket proﬁtable? a genetic programming approach.

Jour-nal of ﬁnancial and Quantitative Analysis , 32(4):405–426,1997.[Oliphant, 2006] Travis E Oliphant.

A guide to NumPy , vol-ume 1. Trelgol Publishing USA, 2006.[Petkova, 2006] Ralitsa Petkova. Do the fama–french factorsproxy for innovations in predictive variables?

The Journalof Finance , 61(2):581–612, 2006.[Potvin et al. , 2004] Jean-Yves Potvin, Patrick Soriano, andMaxime Vall´ee. Generating trading rules on the stock mar-kets with genetic programming.

Computers & OperationsResearch , 31(7):1033–1047, 2004.[Qian et al. , 2007] Edward E Qian, Ronald H Hua, andEric H Sorensen.

Quantitative equity portfolio manage-ment: modern techniques and applications . CRC Press,2007.[Qiu et al. , 2014] Xueheng Qiu, Le Zhang, Ye Ren, Pon-nuthurai N Suganthan, and Gehan Amaratunga. Ensembledeep learning for regression and time series forecasting.In , pages 1–6. IEEE, 2014.Sezer et al. , 2019] Omer Berat Sezer, Mehmet UgurGudelek, and Ahmet Murat Ozbayoglu. Financial time se-ries forecasting with deep learning: A systematic literaturereview: 2005-2019. arXiv preprint arXiv:1911.13288 ,2019.[Sheikh, 1996] Aamir Sheikh. Barra’s risk models.

BarraResearch Insights , pages 1–24, 1996.[Vilalta and Drissi, 2002] Ricardo Vilalta and YoussefDrissi. A perspective view and survey of meta-learning.

Artiﬁcial intelligence review , 18(2):77–95, 2002.[Wang et al. , 2019a] Lewen Wang, Weiqing Liu, Xiao Yang,and Jiang Bian. Conservative or aggressive? conﬁdence-aware dynamic portfolio construction. In , pages 1–5. IEEE, 2019.[Wang et al. , 2019b] Shouxiang Wang, Xuan Wang,Shaomin Wang, and Dan Wang. Bi-directional longshort-term memory method based on attention mechanismand rolling update for short-term load forecasting.

Inter-national Journal of Electrical Power & Energy Systems ,109:470–479, 2019.[Yang et al. , 2017] Bing Yang, Zi-Jia Gong, and WenqiYang. Stock market index prediction using deep neuralnetwork ensemble. In , pages 3882–3887. IEEE, 2017.[Yang et al. , 2019] Xiao Yang, Weiqing Liu, Lewen Wang,Cheng Qu, and Jiang Bian. A divide-and-conquer frame-work for attention-based combination of multiple invest-ment strategies. In , pages 1–5.IEEE, 2019.[Zhao et al. , 2017] Yang Zhao, Jianping Li, and Lean Yu. Adeep learning ensemble approach for crude oil price fore-casting.