Analysis of Models for Decentralized and Collaborative AI on Blockchain
AAnalysis of Models for Decentralized andCollaborative AI on Blockchain
Justin D. Harris
Microsoft, Montreal/Toronto, Canada [email protected]
Abstract.
Machine learning has recently enabled large advances in ar-tificial intelligence, but these results can be highly centralized. The largedatasets required are generally proprietary; predictions are often soldon a per-query basis; and published models can quickly become out ofdate without effort to acquire more data and maintain them. Publishedproposals to provide models and data for free for certain tasks includeMicrosoft Research’s Decentralized and Collaborative AI on Blockchain.The framework allows participants to collaboratively build a dataset anduse smart contracts to share a continuously updated model on a publicblockchain. The initial proposal gave an overview of the framework omit-ting many details of the models used and the incentive mechanisms in realworld scenarios. For example, the Self-Assessment incentive mechanismproposed in their work could have problems such as participants losingdeposits and the model becoming inaccurate over time if the proper pa-rameters are not set when the framework is configured. In this work, weevaluate the use of several models and configurations in order to pro-pose best practices when using the Self-Assessment incentive mechanismso that models can remain accurate and well-intended participants thatsubmit correct data have the chance to profit. We have analyzed simu-lations for each of three models: Perceptron, Nave Bayes, and a NearestCentroid Classifier, with three different datasets: predicting a sport withuser activity from Endomondo, sentiment analysis on movie reviews fromIMDB, and determining if a news article is fake. We compare severalfactors for each dataset when models are hosted in smart contracts on apublic blockchain: their accuracy over time, balances of a good and baduser, and transaction costs (or gas) for deploying, updating, collectingrefunds, and collecting rewards. A free and open source implementationfor the Ethereum blockchain and simulations written in Python is pro-vided at https://github.com/microsoft/0xDeCA10B . This version hasupdated gas costs using newer optimizations written after the originalpublication.
Keywords: decentralized AI, blockchain, ethereum, crowdsourcing, incremen-tal learningc (cid:13)
Springer Nature Switzerland AG 2020Z. Chen et al. (Eds.): ICBC 2020, LNCS 12404, pp. 112, 2020.The final authenticated version is available online at https://doi.org/10.1007/978-3-030-59638-5_10 . a r X i v : . [ c s . A I] S e p Introduction
The advancement of popular blockchain based cryptocurrencies such as Bitcoin[1] and Ethereum [2] have inspired research in decentralized applications thatleverage these publicly available resources. One application that can greatly ben-efit from decentralized public blockchains is the collaborative training of machinelearning models to allow users to improve a model in novel ways [3]. There existsseveral proposals to use blockchain frameworks to enable the sharing of machinelearning models. In DInEMMo, access to trained models is brokered through amarketplace allowing contributors to profit based on a model’s usage, but it lim-its access to just those who can afford the price [4]. DanKu proposes a frameworkfor competitions by storing already trained models in smart contracts, which donot allow for continual updating [5]. Proposals to change Proof-of-Work (PoW)to be more utilitarian by training machine learning models have also gained inpopularity such as: A Proof of Useful Work for Artificial Intelligence on theBlockchain [6]. These approaches can incite more technical centralization suchas harboring machine learning expertise, siloing proprietary data, and accessto machine learning model predictions (e.g. charged on a per-query basis). Inthe crowdsourcing space, a decentralized approach called CrowdBC has beenproposed to use a blockchain to facilitate crowdsourcing [7].To address centralization in machine learning, frameworks to share machinelearning models on a public blockchain while keeping the models free to use forinference have been proposed. One example is Decentralized and CollaborativeAI on Blockchain from Microsoft Research [3]. That work focuses on the descrip-tion of several possible incentive mechanisms to encourage participants to adddata to train a model. This is the continuation of previous work in [3] by theauthor.The system proposed in [3] is modular: different models or incentive mech-anisms (IMs) can be used an seamlessly swapped: however, some IMs mightwork better for different models and vice-versa. These models can be efficientlyupdated with one sample at time making them useful for deployment on Proof-of-Work (PoW) blockchains [3] such as the current public Ethereum [2] blockchain.The first is a Naive Bayes classifier for its applicability to many types of prob-lems [8]. Then, a Nearest Centroid Classifier [9]. Finally, a single layer Perceptronmodel [10].We evaluated the models on three datasets that were chosen as examplesof problems that would benefit from collaborative scenarios where many con-tributors can improve a model in order to create a shared public resource. Thescenarios were: predicting a sport with user activity from Endomondo [11], sen-timent analysis on movie reviews from IMDB[12], and determining if a newsarticle is fake [13]. In all of these scenarios users benefit from having a directimpact on improving a model they frequently use and not relying on a central-ized authority to host and control the model. Transaction costs (or gas) for eachoperation were also compared since these costs can be significant for the publicEthereum blockchain.he Self-Assessment IM allows ongoing verification of data contributionswithout the need for a centralized party to evaluate data contributions. Here arethe highlights of the IM as explained in [3]: – Deploy : One model, h , already trained with some data is deployed. – Deposit : Each data contribution with data x and label y also requires adeposit, d . Data and meta-data for each contribution is stored in a smartcontract. – Refund : To claim a refund on their deposit, after a time t has passed andif the current model, h , still agrees with the originally submitted classifica-tion, i.e. if h ( x ) == y , then the contributor can have their entire deposit d returned. • We now assume that ( x, y ) is “verified” data. • The successful return of the deposit should be recorded in a tally ofpoints for the wallet address. – Take : A contributor that has already had data verified in the
Refund stagecan locate a data point ( x, y ) for which h ( x ) (cid:54) = y and request to take aportion of the deposit, d , originally given when ( x, y ) was submitted.If the sample submitted, ( x, y ) is incorrect, then within time t , other contrib-utors should submit ( x, y (cid:48) ) where y (cid:48) is the correct or at least generally preferredlabel for x and y (cid:48) (cid:54) = y . This is similar to how one generally expects bad edits topopular Wikipedia [14] articles to be corrected in a timely manner.As proposed, the Self-Assessment IM could result in problems such as par-ticipants losing deposits and the model becoming inaccurate if the proper pa-rameters are not set when the framework is initially deployed. In this work, weanalyze the choice of several possible supervised models and configurations withthe Self-Assessment IM in order to find best practices. In this section, we outline several models choices of machine learning modelfor use with Decentralized and Collaborative AI on Blockchain as proposed in[3]. The model architecture chosen relates closely to the incentive mechanismchosen. In this work, we will analyze models for the Self-Assessment incentivemechanism as it appeals to the decentralized nature of public blockchains in thata centralized organization should not need to maintain the IM, for example, byfunding it [3].For our experiments, we mainly consider supervised classifiers because theycan be used for many applications and can be easily evaluated using test sets. Inorder to keep transaction costs low, we first propose to leverage the work in theIncremental Learning space [15] by using models capable of efficiently updatingwith one sample. Transaction costs, or “gas” as it is called in Ethereum [2], areimportant for most public blockchains as a way to pay for the computation costfor executing a smart contract. .1 Naive Bayes
The model first is a Naive Bayes classifier for its applicability to many typesof problems [8]. The Naive Bayes classifier assumes each feature in the modelis independent, this is what helps makes computation fast when updating andpredicting. To update the model, we just need to update several counts such asthe number of data points seen, the number of times each feature was seen, thenumber of times each feature was seen for each class, etc. When predicting, all ofthese counts are used for the features presented in the sparse sample to computethe most likely class for the sample using Bayes’ Rule [8].
A Nearest Centroid Classifier computes the average point (or centroid) of allpoints in a class and classifies new points by the label of the centroid thatthey are closest to [9]. They can also be easily adapted to support multipleclassifications (which we do not do for this work). For this model, we keep trackof the centroid for each class and update it using the cumulative moving averagemethod [16]. Therefore we also need to record the number of samples that havebeen given for each class. Updating the model with one sample needs to updatethe centroid for the given class but not for the other classes. This model can beused with dense data representations.
A single layer perceptron model is useful linear model for binary classification[10]. We evaluate this model because it can be used for sparse data like text aswell as dense data. The Perceptron’s update algorithm only updates the weightsif the model currently classifies the sample as incorrect. This is good for our sys-tem since it should help avoid overfitting. The model can be efficiently updatedby just adding or subtracting, depending on the sample’s label, the values forthe features of the sample with the model’s weights.
Three datasets were chosen as examples of problems that would benefit fromcollaborative scenarios where many contributors can improve a model in orderto create a shared public resource. In each scenario, the users of an applicationthat would use such a model benefit by having a direct impact on improving themodel they frequently use and not relying on a centralized authority to host andcontrol the model. .1 Fake News Detection
Given the text for a news article, the task is the determine if the story is reliableor not [13]. We convert each text to a sparse representation using the term-frequency of the bigrams with only the top 1000 bigrams by frequency count inthe training set considered. While solving fake news detection is likely too dif-ficult for simple models, a detector would greatly benefit from decentralization:freedom from being biased by a centralized authority.
The FitRec datasets contain information recorded from the use of participants’fitness trackers during certain activities [11]. In order to predict if someone wasbiking or running, we used the following features: heart rate, maximum speed,minimum speed, average speed, median speed, and gender. We did some simplefeature engineering with those features such as using average heart rate dividedby minimum heart rate. As usual, all of our code is public.Fitness trackers and start-ups developing them have gained in popularity inrecent years. A user considering purchasing a new tracker might not trust thatthe manufacturer developing it will still be able to host a centralized model infew years. The company could go bankrupt or just discontinue the service. Usinga decentralized model gives users confidence that the model they need will beavailable for a long time, even if the company is not. This should even give themthe assurance to buy the first version of a product and knowing that it shouldimprove without them getting forced into buying a later version of the product.Even if the model does get corrupted, applications can easily revert to an earlierversion on the blockchain, still giving users the service they need [3].
The dataset of 25,000 IMDB movie reviews from is a dataset for sentimentanalysis where the task is to predict if the English text for a movie review ispositive or negative [12]. We used word-based features limited to only the most1000 common words in the dataset. This particular sentiment analysis datasetwas chosen for this work because of it’s size and popularity. Even though thisdataset focuses on movie reviews, in general, a collaboratively built model forsentiment analysis can be applicable in many scenarios such as a system tomonitor posts on social media. Users could train a shared model when they flagposts or messages as abusive and this model can be used by several social mediaservices to provide a more pleasant experience to their users.
We conducted experiments for the three datasets with each of the three models.Experiments ran in simulations to quickly determine the outcome of differentonfigurations. The code for our simulations is all public. Each simulation startswith a model trained on 10% of the training data. The simulation then iteratesover the rest of the samples in the training set submitting each sample once.For simplicity, we assumed that each scenario just has two agents representingthe main two types of user groups: “good” and “bad”. We refer to these as agentssince they may not be real users but could be programs possibly even generatingdata to submit. The “good” agent almost always submits correct data with thelabel as provided in the dataset, as a user would normally submit correct data ina real-world use case. The “bad” agent represents those that wish to decrease themodel’s performance, so the “bad” agent always submits data with the oppositelabel that was provided in the dataset. Since the “bad” agent is trying to corruptthe model, they are willing deposit more (when required) to update the model.This allows them to update the model more quickly after the model has alreadybeen updated. The “good” agent only updates the model if the deposit requiredto do so is low, otherwise they will wait until later. They also check the model’srecent accuracy on the test set before submitting data. In the real world, it isimportant for people to monitor if the model’s performance and determine if itis worth trying to improve it or if it is totally corrupt. If the model’s accuracyis around 85% then it can be assumed to be okay and not overfitting so ideally,it should be safe to submit new data. If incorrect data was always submitted,or submitted too often by “bad” agents, then of the model’s accuracy shoulddecrease and honest users would most likely lose their deposits because theirdata would not satisfy the refund criteria of the IM. We use loose terms here like“should” and “likely” because it is difficult to be general in terms of all types ofmodels. For example, certainly a rule-based model could be used that memorizestraining data. As long as no duplicate data is submitted with different labels,a rule-based model would allow each participant to get their deposits back andthe analysis would be trivial. The characteristics of the agents are compared inTable 1.
Table 1.
Characteristics of the agents’ behaviorsCharacteristic “Good” Agent “Bad” AgentStarting Balance 10,000 10,000Average Maximum Deposit 50 100Deposit Standard Deviation 10 3Average Time Between Updates 10 minutes 60 minutesP(incorrect label) 0.01% 100%P(submitting) (100 ∗ accuracy + 15)% 100% Each agent must wait 1 day before being claiming a refund for “good” dataor reporting the data as “bad”. This was referred to as t in our original paper.hen reporting data as “bad”, an agent can an amount from the initial depositportional to the percent of “verified” contributions they have. This can be writtenas r ( c r , d ) = d × n ( c r ) (cid:80) all c n ( c ) using the notation in our initial paper. After 9 days,either agent can claim the entire remaining deposit for a specific data sample.This was t a in our original paper.For each dataset, we compared: – The change of each agent’s balance over time. While using the IM, an agentmay lose deposits to the other agent, reclaim their deposit, or profit by takingdeposits that were from the other agent. We monitor balances in order todetermine if it can be beneficial for an agent to participate by submittingdata, whether it be correct or incorrect. – The change of the model’s accuracy with respect to a fixed test set overtime. In a real-world scenario, it would be important for user’s to monitorthe accuracy as a proxy to measure if they should continue to submit datato the model. If the accuracy declines, then it could mean that “bad” agentshave corrupted the model. – The “ideal” baseline of the model’s accuracy on the test set if the modelwere to be trained all of the simulation data. In the real-world, this wouldof course not be available because the data would not be known yet.We also compared Ethereum gas costs (i.e. transaction costs) for the commonactions that are done in the framework. The Update gas cost shown for eachmodel was when the model did not agree with the provided label classificationand so needed its weights to be updated. Otherwise, the Perceptron Updatemethod would be only slightly more than prediction because a Perceptron modeldoes not get updated if it currently predicts the same classification as the labelit is given for a data sample. The gas cost of predicting is not shown becauseit can be done “off-chain” (without creating a transaction) which incurs no gascost since it does not involve writing data to the blockchain. However, predictingis the most expensive operation inside of Refund and Report so the cost of doingprediction “on-chain” can be estimated using those operations. Contracts werecompiled with the “solc-js” compiler using Solidity 0.6.2.
With each model, the “good” agent was able to profit and the “bad” agent lostfunds. As can be seen in Figure 1, the difference in balances was most significantwith the Perceptron model. The Perceptron model has the highest accuracy yetthe Naive Bayes was able to surpass its baseline accuracy.The Perceptron model has the lowest gas cost as shown in Table 2. The de-ployment cost for the Naive Bayes and Sparse Nearest Centroid models weremuch higher because almost all of the 1000 features effectively needs to be settwice (once for each class). For the Sparse Nearest Centroid Classifier, predic-tion (which happens in Refund and Reward) did not need to go through eachdimension because the distance to each centroid can be calculated by storing ig. 1.
Plot of simulations with the Fake News dataset. he magnitude of each centroid and then using the sparse input data to findthe difference from the magnitude just for the few features in the sparse input.Updating the Sparse Nearest Centroid Classifier does not need to update everyfeature because we store some extra information (mainly the new denominator)when we update each feature. At prediction time, we compute use the correctdenominator to use.
Table 2.
Ethereum gas costs for each model for the Fake News dataset. Data sampleshad 15 integer features representing the presence of the top bigrams from the trainingdata. In brackets are approximate USD values from September 2020 with a modest gasprice of 4gwei and ETH valued at 373 USD.Action Naive Bayes Sparse Nearest Centroid Sparse PerceptronDeployment 55,511,446 67,139,037 (46.20 USD)Update 281,447 356,345 (0.39 USD)Refund 172,216 176,797 (0.21 USD)Reward 136,800 141,253 (0.15 USD)
As seen in Figure 2, with each model, all “good” agents can profit while the“bad” agent wastes lots of funds. The Naive Bayes (NB) and Nearest CentroidClassifier (NCC) models performed very well on this type of data, hardly strayingfrom the ideal baseline. The linear Perceptron on the other hand was much moresensitive to data from the “bad” agent and it’s accuracy dropped significantlyseveral times but finally recovering. This could be because the Percepton doesnot update if it already agrees with the classification provided. So it might nothave been able to gain as much reinforcement from correct data as the otherclassifiers.The Perceptron model usually has the lowest gas cost as shown in Table 3.The gas costs were fairly close for each action amongst the models, especiallycompared to the other datasets. This is mostly because there are very few fea-tures (just 9) for this dataset. Dense versions of the models are very expensivewhen you have many features.
Figure 3 shows all “good” agents can profit while the “bad” agent loses most orall of the initial balance. All models maintained their accuracy with this typeof data with the Naive Bayes model performing the best. This is likely becausethere are so many features. ig. 2.
Plot of simulations with the Activity Prediction dataset.
Table 3.
Ethereum gas costs for each model for the Activity Prediction dataset.Data samples had 9 integer features. In brackets are approximate USD values fromSeptember 2020 with a modest gas price of 4gwei and ETH valued at 373 USD.Action Naive Bayes Dense Nearest Centroid Dense PerceptronDeployment 10,113,606 9,734,985 (13.39 USD)Update (0.33 USD) 243,164 227,047Refund 151,070 146,790 (0.20 USD)Reward 115,525 111,245 (0.15 USD) ig. 3.
Plot of simulations with the IMDB Movie Review Sentiment Analysis dataset. y only a small amount, the Naive Bayes model beats the Perceptron modelfor the Update method with the lowest gas cost. The gas costs for all actions areshown in Table 4. As with the Fake News dataset, the Update cost for the SparseNearest Centroid Classifier was low because we can skip many dimensions. TheSparse Nearest Centroid Classifier and Naive Bayes models had a much higherdeployment costs since the amount of data was effectively double since eachfeature needs to be set for each of the two classes.
Table 4.
Ethereum gas costs for each model for the IMDB Movie Review SentimentAnalysis dataset. Data samples had 20 integer features representing a movie review withthe presence of 20 words that were in the 1000 most common words in the trainingdata. In brackets are approximate USD values from September 2020 with a modest gasprice of 4gwei and ETH valued at 373 USD.Action Naive Bayes Sparse Nearest Centroid Sparse PerceptronDeployment 55,423,682 67,136,669 (46.07 USD)Update (0.50 USD) 422,476 332,927Refund 189,954 196,375 (0.22 USD)Reward 154,538 160,831 (0.16 USD)
With all experiments, the Perceptron model was consistently the cheapest touse. This was mostly because the size of the model was much less than the othertwo models which need to store information for each class, effectively twice theamount of information that the Perceptron needs to store. While each model wasexpensive to deploy, this is a one time cost to incur. This cost is far less thanthe comparable cost to host a web service with the model for several months.Most models were able to maintain their accuracy except for the volatilePerceptron for the Activity Prediction dataset. Even if the model gets corruptedwith incorrect data, it can be forked from an earlier time when its accuracy on ahidden test set was higher. It can also be retrained with data identified as “good”while it was deployed. It is important for users to be aware of the accuracy onthe model on some hidden test set. Users can maintain their own hidden testsets or possibly use a service supplied by an organization which would publisha rating on a model based on the test sets they have.The balance plots looked mostly similar across the experiments because the“good” agent was already careful and how we set a constant wait time of 9days for either agent to claim the remaining deposit for a data contribution.The “good” agent honestly submitted correct data and only did so when theythought the model was reliable, this helped ensure that they can recover theireposits and earn for reporting many contributions from the “bad agent”. Whenthe “bad” agent is able to corrupt, it can successfully report a portion of the con-tributions from the “good” agent as bad because the model would not agree withthose contributions. The “bad” agent cannot claim a majority of these depositswhen reporting the contribution since they do not have as many “verified” con-tributions as the “good” agent. This leaves a left over amount for which eitheragent must wait for 9 days before taking the entire remaining deposit, hencethe periodic looking patterns in the balance plots every 9 days. The patterncontinues throughout the simulation because there is always data for which thedeposit that cannot be claimed by either agent after just the initial refund waittime of 1 day.Future work in analyzing more scenarios is encouraged and easy to implementwith our open source tools at https://github.com/microsoft/0xDeCA10B/tree/master/simulation . For example, changing the initial balances of eachagent to determine how much a “good” agent need to spend to stop a muchmore resourceful “bad” agent willing to corrupt a model.
References
1. Nakamoto, S., et al.: Bitcoin: A peer-to-peer electronic cash system. (2008)2. Buterin, V.: A next generation smart contract & decentralized application plat-form. (2015)3. Harris, J.D., Waggoner, B.: Decentralized and collaborative ai on blockchain. 2019IEEE International Conference on Blockchain (Blockchain) (Jul 2019)4. Marathe, A., Narayanan, K., Gupta, A., Pr, M.: DInEMMo: Decentralized in-centivization for enterprise marketplace models. 2018 IEEE 25th InternationalConference on High Performance Computing Workshops (HiPCW) (2018) 95–1005. Kurtulmus, A.B., Daniel, K.: Trustless machine learning contracts; evaluating andexchanging machine learning models on the ethereum blockchain (2018)6. Lihu, A., Du, J., Barjaktarevic, I., Gerzanics, P., Harvilla, M.: A proof of usefulwork for artificial intelligence on the blockchain (2020)7. Li, M., Weng, J., Yang, A., Lu, W., Zhang, Y., Hou, L., Liu, J., Xiang, Y., Deng,R.H.: Crowdbc: A blockchain-based decentralized framework for crowdsourcing.IEEE Transactions on Parallel and Distributed Systems (6) (June 2019) 1251–12668. Webb, G.I. In: Na¨ıve Bayes. Springer US, Boston, MA (2010) 713–7149. Tibshirani, R., Hastie, T., Narasimhan, B., Chu, G.: Diagnosis of multiple can-cer types by shrunken centroids of gene expression. Proceedings of the NationalAcademy of Sciences (10) (2002) 6567–657210. Rosenblatt, F.: The perceptron: a probabilistic model for information storage andorganization in the brain. Psychological Review (6) (1958) 38611. Ni, J., Muhlstein, L., McAuley, J.: Modeling heart rate and activity data forpersonalized fitness recommendation. In: The World Wide Web Conference. WWW19, New York, NY, USA, Association for Computing Machinery (2019) 1343135312. Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learningword vectors for sentiment analysis. In: Proceedings of the 49th Annual Meetingof the Association for Computational Linguistics: Human Language Technologies,Portland, Oregon, USA, Association for Computational Linguistics (June 2011)142–1503. UTK Machine Learning Club: Fake News — Kaggle. (2020) [Online; accessed 07-January-2020].14. Wikipedia contributors: Wikipedia — Wikipedia, the free encyclopedia. https://en.wikipedia.org/w/index.php?title=Wikipedia (2020) [Online; accessed 08-January-2020].15. Schlimmer, J.C., Fisher, D.: A case study of incremental concept induction. In: Pro-ceedings of the Fifth AAAI National Conference on Artificial Intelligence. AAAI’86,AAAI Press (1986) 496–50116. Wikipedia contributors: Moving average — Wikipedia, the free encyclope-dia. https://en.wikipedia.org/w/index.php?title=Moving_averagehttps://en.wikipedia.org/w/index.php?title=Moving_average