[PDF] IPFS and Friends: A Qualitative Comparison of Next Generation Peer-to-Peer Data Networks

Abstract

Decentralized, distributed storage offers a way to reduce the impact of data silos as often fostered by centralized cloud storage. While the intentions of this trend are not new, the topic gained traction due to technological advancements, most notably blockchain networks. As a consequence, we observe that a new generation of peer-to-peer data networks emerges. In this survey paper, we therefore provide a technical overview of the next generation data networks. We use select data networks to introduce general concepts and to emphasize new developments. Specifically, we provide a deeper outline of the Interplanetary File System and a general overview of Swarm, the Hypercore Protocol, SAFE, Storj, and Arweave. We identify common building blocks and provide a qualitative comparison. From the overview, we derive future challenges and research goals concerning data networks.

Full PDF

aa r X i v : . [ c s . N I] F e b IPFS and Friends: A Qualitative Comparison ofNext Generation Peer-to-Peer Data Networks

Erik Daniel and Florian Tschorsch

This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer beaccessible.

Abstract —Decentralized, distributed storage offers a way toreduce the impact of data silos as often fostered by centralizedcloud storage. While the intentions of this trend are not new, thetopic gained traction due to technological advancements, mostnotably blockchain networks. As a consequence, we observe thata new generation of peer-to-peer data networks emerges. In thissurvey paper, we therefore provide a technical overview of thenext generation data networks. We use select data networks tointroduce general concepts and to emphasize new developments.We identify common building blocks and provide a qualitativecomparison. From the overview, we derive future challenges andresearch goals concerning data networks.

Index Terms —Data Networks, Blockchain Networks, Peer-to-Peer Networks, Overlay Networks

I. I

NTRODUCTION

Nowadays, users store and share data by using cloud storageproviders in one way or another. Cloud storages are organizedcentrally, where the storage infrastructure is typically ownedand managed by a single logical entity. Such cloud storageproviders are responsible for storing, locating, providing, andsecuring data.While cloud storage can have many economical and tech-nical advantages, it also raises a series of concerns. Thecentralized control and governance leads to data silos thatmay affect accessibility, availability, and conﬁdentiality. Dataaccess might, for example, be subject to censorship. At thesame time, data silos pose a valuable target for breaches andacquiring data for sale, which risk security and privacy. Ingeneral, users lose their self-determined control and delegateit to a cloud provider.One direction to break free from data silos and to reducetrust assumptions are peer-to-peer data networks . Under thisumbrella term, we summarize data storage approaches thatbuild upon a peer-to-peer (P2P) network and include aspects ofdata storage, replication, distribution, and exchange. As typicalfor P2P networks, peers interact directly, build an overlaynetwork, share resources, and can make autonomous localdecisions. Consequentially, P2P data networks strive to jointlymanage and share storage.P2P data networks are not a new technology, though.There are many different older P2P networks that can beclassiﬁed as data networks as well. The popularity of P2Ptechnologies emerged in 1999 with the audio ﬁle sharingnetwork Napster, closely followed by Gnutella for sharing alltypes of ﬁles [1]. Napster and Gnutella marked the beginning

Erik Daniel and Florian Tschorsch are with the Department of DistributedSecurity Infrastructures at Technische Universit¨at Berlin, 10587 Berlin, Ger-many; e-mail: [email protected] and ﬂ[email protected] and were followed by many other P2P networks focusing onspecialized application areas or novel network structures. Forexample, Freenet [2] realizes anonymous storage and retrieval.Chord [3], CAN [4], and Pastry [5] provide protocols tomaintain a structured overlay network topology. In particular,BitTorrent [6] received a lot of attention from both users andthe research community. BitTorrent introduced an incentivemechanism to achieve Pareto efﬁciency, trying to improvenetwork utilization achieving a higher level of robustness.The recent advancements in P2P technologies affected theareas of distributed ﬁle systems [7] and content distributiontechnologies [8]. This trend also falls under the umbrella ofdata networks in general and P2P data networks in particular.One component which seemed to be missing in P2P ﬁlesharing systems was a way to improve long-term storageand availability of ﬁles. With the introduction of Bitcoin [9]in 2008, the P2P idea in general and the joint data repli-cation in particular gained new traction. Distributed ledgertechnologies provide availability, integrity, and byzantine faulttolerance in a distributed system. In particular cryptocurrenciesshowed their potential as a monetary incentive mechanismin a decentralized environment. These and additional trendsand developments, e.g., Kademlia [10] and information-centricnetworking [11], lead to the invention of what we denote thenext generation of P2P data networks.In this survey paper, we provide a technical overview of thenew generation of P2P data networks. We show how these newsystems are built, how they utilize the experience and researchresults from previous systems, as well as new developmentsand advancements over the last decade. We identify buildingblocks, similarities, and trends of these systems. While someof the systems are building blocks themselves for other appli-cations, e.g., decentralized applications (DApps), we focus ontwo main system aspects: content distribution and distributedstorage . Furthermore, we provide insights in the incentivemechanisms, deployed for retrieving or storing ﬁles, or both.To this end, we focus on select systems with interesting mech-anisms, different use cases, and different degree of content anduser privacy. Our overview focuses on concepts and abstractsfrom implementation details to extract general insights. Yet, itshould be noted that the systems are prone to change due toongoing development. Our survey paper makes use of a widerange of sources, including peer-reviewed papers, white papersas well as documentations, speciﬁcations, and source code.Speciﬁcally, we focus on IPFS [12], Swarm [13], the Hyper-core Protocol [14], SAFE [15], Storj [16], and Arweave [17].In particular, IPFS has gained popularity as storage layer forblockchains [18, 19, 20, 21, 22, 23, 24] and was subject ofa series of studies [25, 26, 27, 28, 29, 30, 31, 32, 33, 34].

Furthermore, we put our overview of these systems in contextto preceding systems and research directions, namely Bit-Torrent, information-centric networking, and blockchains. Bycontrasting precursor systems we sketch the evolution of datanetworks and are able to profoundly discuss advancements ofthe next generation.From our overview we are able to extract the buildingblocks and interesting aspects of P2P data networks. Whileall systems allow distributed content sharing and storage,they seem to focus on either of the aspects. That is, eachsystem aims to serve a slightly different purpose with differentrequirements and points of focus. This leads to different designdecisions in network organization, ﬁle look up, degree ofdecentralization, redundancy, and privacy. For example, Storjaims for a distributed cloud storage while the Hypercoreprotocol focuses on distributing large datasets. Similarly, IPFSaims to replace client-server structure of the web and thereforeneeds a stronger focus on data look up than BitTorrent wheremainly each ﬁle is located in its own overlay network. Atthe same time, we found many similarities in the approachof building data networks, for example, using Kademlia tostructure the network or ﬁnding peers, split ﬁles into pieces,or incentivizing different tasks to increase functionality.The remainder is structured as follows: The survey tran-sitions from a system view, over a component view to aresearch perspective on data networks. As part of the systemview, we ﬁrst provide background information of technologicalprecursors of data networks (Section III). Subsequently, weintroduce “IPFS and Friends” and provide a detailed technicaloverview of the next generation of data networks (Section IVand Section V). Lastly, we mention related systems andconcepts (Section V-F). As part of the component view, wederive the building blocks of data networks and share insightsgained from the technical overview (Section VI). Finally,we transition to a research perspective and identify researchareas and open challenges (Section VII). Section II referencesrelated survey papers and Section VIII concludes this survey.II. R

ELATED S URVEYS

In this section, we guide through the broad landscape of datanetworks and provide additional references to related surveypapers. In contrast to the existing literature, we provide acomparative overview of next generation data networks, i.e.,P2P data networks. We focus on storage and content sharingindependent of the utilization of a blockchain.Androutsellis-Theotokis and Spinellis [8] give a state of theart (2004) overview of P2P content distribution technologiesproviding a broad overview of the previous generation. Otherprevious works also provide closer looks at the previousgeneration with a closer focus on speciﬁc P2P data networks(e.g., FreeNet and Past) [7, 35] or decentralized ﬁles systemsin general (e.g., Google FS and Hadoop Distributed FS) [36].Research on next generation data networks particularlyfocus on the interaction with blockchains. Huang et al. [37]mainly cover IPFS and Swarm and Benisi et al. [38] withan even stronger focus on the blockchain aspects. Casino etal. [39] take a closer look at the immutability of decentralizedstorage and its consequences and possible threats. 𝑆 𝐿 𝐿 𝐿 𝐿 𝐿 T P ee r 𝑆 file.torrent . . .announce 𝑇 pieces 𝑐 𝑐 𝑐 Fig. 1: Conceptional overview of BitTorrent.III. P

RECURSORS

The next generation of data network uses ideas of pre-cursor systems. In this section, we provide an introductionto three important precursors systems, which inﬂuenced thedesign of the presented data networks, speciﬁcally, BitTorrent,information-centric networking, and blockchains.

A. BitTorrent

The BitTorrent protocol [6] is a P2P ﬁle sharing protocol. Ithas an incentive structure controlling the download behavior,attempting to achieve fair resource consumption. The goal ofBitTorrent is to provide a more efﬁcient way to distribute ﬁlescompared to using a single server. This is achieved by utilizingthe fact that ﬁles are replicated with each download, makingthe ﬁle distribution self-scalable.Files are exchanged in torrents. In general, each torrent is aP2P overlay network responsible for one ﬁle. To exchange aﬁle with the BitTorrent protocol a .torrent ﬁle, containingmeta-data of the ﬁle and a contact point, a tracker, is created.It is also possible to deﬁne multiple ﬁles in a .torrent ﬁle.The torrent ﬁle needs to be made available, e.g., on a webserver, before the ﬁle can be shared. The tracker serves as abootstrapping node for the torrent. Peers that have completeﬁles are called seeders and peers still missing chunks are calledleechers. Leechers request chunks and serve simultaneously asdownload points for already downloaded chunks.A conceptional overview of how BitTorrent deals with ﬁlescan be seen in Fig. 1. The roles and their interaction are asfollows: a peer gets the .torrent ﬁle, contacts the tracker 𝑇 listed in the .torrent ﬁle, gets a list of peers, connectsto the peers and becomes a leecher. In the ﬁgure, the peer 𝑆 serves as a seed of the ﬁle and the peers 𝐿 𝑖 represent theleechers requesting the different chunks. As illustrated for the .torrent ﬁle, the ﬁle is split into chunks 𝑐 𝑖 . After a leechersuccessfully acquired all chunks, it becomes a new seed. Seed 𝑆 and leechers build the torrent network for the ﬁle. Otherﬁles are distributed in different torrent networks with possiblydifferent peers.Instead of the presented centralized trackers, there are alsotrackerless torrents. In a trackerless torrent seeds are foundwith a distributed hash table (DHT). The client derives the key from the torrent ﬁle and the DHT returns a list ofavailable peers for the torrent. The BitTorrent client can use apredetermined node or a node provided by the torrent ﬁle forbootstrapping the DHT.The feature that made BitTorrent unique (and probablysuccessful) is the explicit incentivization of peers to exchangedata, which are implemented in the ﬁle sharing strategiesrarest piece ﬁrst and tit-for-tat. Rarest piece ﬁrst describes thechunk selection of BitTorrent. It ensures a minimization ofchunk overlap, making ﬁle exchange more robust against nodechurn. The chunks that are most uncommon in the networkare preferably selected for download. Tit-for-tat describes thebandwidth resource allocation mechanism. In BitTorrent peersdecide to whom they upload data based on the downloadeddata from a peer. This should prevent leechers from onlydownloading without providing any resources to others.BitTorrent is well researched [40, 41, 42] and has proven itstest of time. The BitTorrent Foundation and Tron Foundationdeveloped BitTorrent Token (BTT) [43], which serves as anadditional blockchain-based incentive layer to increase theavailability and persistence of ﬁles. B. Information-Centric Networking

Another precursor we want to mention is information-centric networking (ICN). Even though ICN is not a P2P datanetwork, some of its ideas and concepts are at least similarto some data networks. Contrary to P2P data networks, ICNproposes to change the network layer. The routing and ﬂow ofpackets should change from point-to-point location search torequesting content directly from the network. As an examplelet us assume we wanted to retrieve some data, e.g., a website,and we know that this website is available at example.com .First, we request the location of the host of the site via DNS,i.e., the IP address. Afterwards, we establish a connection toretrieve the website. In ICN, we would request the data directlyand would not address the host where the data is located. Anynode storing the website could provide the data immediately.One way to enable such a mechanism and to ensure dataintegrity is to use hash pointers (or more generically contenthashes) to reference content. The content of a ﬁle is used as in-put of cryptographic hash function, e.g., SHA-3. The resultingdigest can then be used to identify the content and the clientcan verify the integrity of the ﬁle locally. The cryptographicproperties of the hash function, most importantly pre-imageand collision resistance, ensure that nobody can replace ormodify the input data without changing its digest.Jacobson et al. [44] proposed content-centric networking,where these content requests are interest packets. Owner(s) ofthe content can then directly answer the interest packet withdata packets containing the content. This requires other mech-anisms for ﬂow control, routing, and security on an infrastruc-ture level. Interest packets are broadcasted and peers sharinginterest in data can share resources. There are multiple projectsdealing with ICN, e.g., Named Data Networking [45] (NDN).Ntorrent [46] Mastorakis et al. propose an extension of NDNto implement a BitTorrent-like mechanism in NDN. Furtherinformation on ICN can be found in [11]. Since ICN typically requires a revised network layer, many of the concepts arerealized as P2P network. Most prominently, IPFS integratesideas of ICN, which we discuss in the following section.

C. Blockchain

The introduction of Bitcoin [9] in 2008 enabled new pos-sibilities for distributed applications. Bitcoin is an ingenious,intricate combination of ideas from the areas of linked times-tamping, digital cash, P2P networks, byzantine fault tolerance,and cryptography [47, 48]. One of the key innovations thatBitcoin brought forward was an open consensus algorithm thatactively incentivizes peers to be compliant. Therefore, it usesthe notion of coins, generated in the process, i.e., mining.While the term blockchain typically refers to an entiresystem and its protocols, it also refers to a particular datastructure, similar to a hash chain or tree. That is, a blockchainorders blocks that are linked to their predecessor with a cryp-tographic hash. This linked data structure ensures the integrityof the blockchain data, e.g., transactions. The blockchain’sconsistency is secured by a consensus algorithm, e.g., inBitcoin the Nakamoto consensus. For more details on Bitcoinand blockchains, we refer to [48].In a nutshell, a blockchain provides distributed, immutable,and ordered storage. Unfortunately, the feasibility of a purelyblockchain-based data network is limited, due to a series ofscalability problems and limited on-chain storage capacity [49,50]. Moreover, storing large amounts of data in a blockchainthat was designed as medium of exchange and store of value,i.e., cryptocurrencies such as Bitcoin, leads to high transac-tions fees. However, research and development of blockchainsshows the feasibility of blockchain-based data networks, e.g.,Arweave (cf. Section V-E).In general, however, cryptocurrencies allowing decentral-ized payments can be used in P2P data networks as anincentive structure. As we will elaborate in the following,such an incentive structure can increase the robustness andavailability of data network and therefore address weaknessesof previous generations.IV. I

NTERPLANETARY F ILE S YSTEM (IPFS)The Interplanetary File System (IPFS) [12] is a bundleof subprotocols and a project initialized by Protocol Labs.IPFS aims to improve the web’s efﬁciency and to make theweb more decentralized and resilient. IPFS uses content-basedaddressing, where content is not addressed via a location butvia its content. The way IPFS stores and addresses data withits deduplication properties, allows efﬁcient storage of data.Through IPFS it is possible to store and share ﬁles ina decentralized way, increasing censorship-resistance for itscontent. IPFS can be used to deploy websites building adistributed web. It is used as a storage service complementingblockchains, enabling many different applications on top ofIPFS [18, 19, 20, 21, 22, 23, 24].Since IPFS uses content-based addressing, it focuses mainlyon immutable data. IPFS however supports updatable ad-dresses for content by integrating the InterPlanetary NameSystem (IPNS). IPNS allows the linking of a name (hash of a 𝑁 𝑁 𝑁 𝑁 𝑁 𝑁 𝑁 𝑁 𝑁 𝑁 𝑁 𝑁 𝑁 KademliaRandom

Fig. 2: Example IPFS network topology.public key) with the content identiﬁer of a ﬁle. By changingthe mapping of ﬁxed names to content identiﬁers, ﬁle updatescan be realized. Please note however, content identiﬁers areunique and ﬁle speciﬁc.In addition, IPFS employs its own incentive layer, i.e.,Filecoin [51], to ensure the availability of ﬁles in the network.Yet, IPFS works independently from Filecoin and vice-versa.This is a prime example of how a cryptocurrency can beintegrated to incentivize peers.

A. General Functionality

IPFS uses the modular P2P networking stack libp2p . Infact, libp2p came into existence from developing IPFS. InIPFS nodes are identiﬁed by a node id. The node id is thehash of their public key. For joining the network, the IPFSdevelopment team deployed some bootstrap nodes. By con-tacting these nodes a peer can learn new peers. The peers withwhich a node is connected, is its swarm. Peers can be foundvia a Kademlia-based DHT. The communication betweenconnections can be encrypted. While IPFS uses Kademlia, itsconnections are not completely determined by Kademlia. InIPFS, a node establishes a connection to newly discoverednodes and then tries to put them in buckets. Connections areclosed randomly once a threshold is achieved [32]. Fig. 2shows an exemplary network using the Kademlia structure ofFig. 3 (solid lines) and random connections (dashed lines). Tothis end, we assume that the network consists of 13 nodeswith 8 bit identiﬁers.IPFS uses content-based addressing. An object (ﬁle, list,tree, commit) is split into chunks or blocks. Each block isidentiﬁable by a content identiﬁer (CID), which can be createdbased on a recipe from the content. From these blocks aMerkle directed acyclic graph (DAG) is created. The root ofthe Merkle DAG can be used to retrieve the ﬁle. IPFS employsblock deduplication: each stored block has a different CID.This facilitates ﬁle versioning, where a newer version of theﬁle shares a lot of blocks with the older version. In this case,only the differences between the versions need to be storedinstead of two complete Merkle DAGs. The blocks have anadded wrapper specifying the UNIXFS type of the block.As an example we assume the survey and an earlier draftare stored on IPFS. Fig. 4 is a simpliﬁed representation of theMerkle DAGs of the two ﬁles. Each node represents a chunkand the label represents the node CID, the content hash. The 𝑁 𝑁 𝑁 𝑁 𝑁 𝑁 𝑁 𝑁 𝑁 𝑁 𝑁 𝑁 𝑁 Fig. 3: Kademlia tree with 13 nodes and random ids.DAG is created from bottom to top, since the intermediatenodes CID depends on its descendants. The actual data islocated in the leaves. In the ﬁnal version additional informationwas appended to the content, which results in a different rootnode and additional nodes. Therefore, in our example, 𝐹 is theroot CID of the draft and 𝐹 ′ the root of the ﬁnished survey.The blocks themselves are stored on devices or providers.The DHT serves as a look-up for data providers. As inKademlia, nodes with node ids closest to the CID storethe information about the content providers. A provider canannounce that it is storing speciﬁc blocks. The possession ofblocks needs to be reannounced in a certain time frame.The actual exchange of blocks is handled by the Bitswap

Protocol. Each node has a want, have, and do not want list.The different lists contain CIDs which the node wants/has ordoes not want. CIDs on a do not want list are not even cachedand simply dropped on receive. A node sends the CIDs onits want list to the connected neighbors, its swarm. Neighborsin possession of this block send the block and a recipe forcreating the CID. The node can then verify the content bybuilding the CID from the recipe. If no neighbor possesses awanted CID, IPFS performs a DHT lookup. After a successfulDHT lookup, a node possessing the CID is added to the swarmand afterwards the added node is send the want list.For a peer to download a ﬁle it needs to know the root CID.After acquiring the CID of an object’s Merkle DAG root, it canput this root CID on the want list and the previously describedBitswap/DHT takes over. The root block gives informationabout its nodes, resulting in new CIDs which have to berequested. Subsequent CID requests are not send to all neigh-bors. The neighbors answering the root CID are prioritizedand are grouped in a session. Since version 0.5, Bitswapsends a

WANT-HAVE message for subsequent requests tomultiple peers in the session and to one peer an optimistic

WANT-BLOCK message. The

WANT-HAVE message asks if thepeer possesses the block and

WANT-BLOCK messages requestthe block directly. If a block is received other pending request,can be canceled with a

CANCEL message [34]. Previously,neighbors were asked for the block simultaneously, resultingin possibly receiving a block multiple times. Once all leavesof the tree are acquired the ﬁle is locally available. Files arenot uploaded to the network only possession is announced.Using our previous example of the stored surveys, weassume the earlier draft, 𝐹 , is available at the author’s and 𝐹 = (( 𝑎, 𝑏 ) , ( 𝑐, 𝑑 ))( 𝑎, 𝑏 ) 𝑎𝑎 𝑎 𝑎 𝑏𝑏 𝑏 𝑏 ( 𝑐, 𝑑 ) 𝑐𝑐 𝑐 𝑐 𝑑𝑑 𝑑 (a) File 𝐹 𝐹 ′ = (( 𝑎, 𝑏 ) , ( 𝑐, 𝑑 ′ ) , ( 𝑒, 𝑓 ))( 𝑎, 𝑏 ) 𝑎𝑎 𝑎 𝑎 𝑏𝑏 𝑏 𝑏 ( 𝑐, 𝑑 ′ ) 𝑐𝑐 𝑐 𝑐 𝑑 ′ 𝑑 𝑑 𝑑 ( 𝑒, 𝑓 ) 𝑒𝑒 𝑒 𝑒 𝑓𝑓 𝑓 𝑓 (b) File 𝐹 ′ Fig. 4: Simpliﬁed IPFS ﬁle structure visualizing Merkle DAGs of CIDs and the concept of deduplication.coauthor’s node with the node id 𝑁 and 𝑁 and the ﬁnalversion, 𝐹 ′ , is available at the author’s and coauthor’s node aswell as three reviewers with the id 𝑁 , 𝑁 and 𝑁 . There isno additional replication due to the protocol.IPFS does not have any implicit mechanisms for repairingand maintaining ﬁles or ensuring redundancy and availabilityin the network. Files can be “pinned” to prevent a nodefrom deleting blocks locally. Otherwise content is only cachedand can be deleted via garbage collection at any point intime. Furthermore, ﬁles cannot be intentionally deleted inother nodes, deletes always happen locally only. For a ﬁleto disappear, it needs to be removed from every cache andevery pinning node. For storage guarantees Filecoin exists.Filecoin [51] employs a storage and retrieval market for stor-ing and retrieving ﬁles. While the storage and retrieval markethandle their tasks slightly differently, the main principle is thesame. There are three different orders: bid, ask, and deal. Thebid order is a notiﬁcation of the client that it wants to store orretrieve ﬁles. The ask order is a notiﬁcation from a storage orretrieval node announcing storage or retrieval conditions. Thedeal order is the actual deal of bid and ask orders.The trustworthiness of storage nodes is secured using ablockchain-based structure with proof of space-time and proofof replication. The Filecoin network is responsible for punish-ing dishonest nodes. The storage market is for storing contentover time. The retrieval market is for compensating provisionof ﬁles via payment channels. B. Features

IPFS is very ﬂexible. Therefore, it supports multiple trans-port/network protocols, or cryptographic hash function. Tomake this possible IPFS uses multi-address and multi-hash.Multi-address is a path structure for encoding addressinginformation. They allow a peer to announce its contact infor-mation (e.g., IPv4 and IPv6), transport protocol (e.g., TCP andUDP) and port.Multi-hash is used to provide multiple different hash func-tions. The digest value is prepended with the digest length,and the hash function type. Multi-hashes are used for the IPFSnode id and part of the CID.The CID in IPFS is used for identifying blocks. A CID isa cryptographic hash of its content with added meta data. Themeta data includes the used hashing algorithm and its length(multi-hash), the encoding format (InterPlanetary Linked Data)and the version. In other words, the multi-hash prepended with encoding information is InterPlanetary Linked Data (IPLD),and IPLD prepended with version information is the IPFS CID.While IPFS itself has no mechanism to ensure redun-dancy/availability, IPFS Cluster allows the creation and admin-istration of an additional overlay network of nodes, separatefrom the IPFS main network. IPFS Cluster helps to ensuredata redundancy and data allocation in a deﬁned swarm. Thecluster manages pinned data, maintains a conﬁgured amount ofreplicas, repinning of content if necessary, and considers freestorage space while selecting nodes for pinning data. IPFSCluster needs a running IPFS node. IPFS Cluster uses libp2p for its networking layer.IPFS Cluster ensures horizontal scalability of ﬁles withoutany incentives. It can be used by a content provider to increaseavailability without relying on caching in the network. Filecoincan be used to incentivize others to store ﬁles.

C. Discussion

IPFS uses many interesting concepts. The concepts likecontent addressing and deduplication could improve retrievaltimes and storage overhead.The ﬂexible design makes it harder to get into the topicof IPFS. While encryption is supported in IPFS there areno additional mechanisms for increasing the privacy of itsparticipants. The want and have list might provide sensitiveinformation about the participants. IPFS could have similarprivacy problems to BitTorrent. Furthermore, for good and badit is not possible to prevent replication or enforce deletion ofcontent once released.IPFS is a popular research topic. Next to investigation ofpossible use case for IPFS, IPFS is also investigated [25, 26,27, 28, 29, 30, 31, 32, 33, 34], with researchers analyzingperformance and efﬁciency of the system.V. R

ELATED

P2P D

ATA N ETWORKS

Next to IPFS, many data networks are in development. Wegive an overview of ﬁve other data networks, pointing out theirmain concepts. A summary and comparison of BitTorrent,IPFS, and following data networks can be seen in TABLE I.

A. Swarm

Swarm [13] is a P2P distributed platform for storing anddelivering content developed by the Ethereum Foundation. Itprovides censorship-resistance by not allowing any deletes,

TABLE I: General overview of the different data networks.

System Main Goal and Distinct Feature File Persistence Token Mutability

BitTorrent [6] Efﬁcient ﬁle distribution utilizing tit-for-tat to provide Paretooptimality not guaranteed BitTorrent-Token [43] –IPFS [12, 52] Decentralized web achieving fast distribution through contentaddressing and wide compatibility not guaranteed Filecoin [51] IPNSSwarm [13, 53] Decentralized storage and communication infrastructure backed by asophisticated Ethereum-based incentive mechanism not guaranteed Ethereum [54] ENS, FeedsHypercore [14, 55] Simple sharing of large mutable data objects (foldersynchronization) between selected peers not guaranteed – yesSAFE [15, 56] Autonomous data and communications network using self-encryptionand self-authentication for improved decentralization and privacy public guaranteed,private deletable Safecoin speciﬁcStorj [16, 57] Decentralized cloud storage that protects the data from Byzantinenodes with erasure codes and a reputation system determined lifetime,deletable on request CentralizedPayments yesArweave [17, 58] Permanent storage in a blockchain-like structure including contentﬁltering blockweave Arweave token – as well as upload and forget properties. Swarm is built forEthereum [54] and is therefore in some parts dependent onand sharing design aspect of Ethereum.The aim of Swarm is the provision of decentralized storageand streaming functionality for the web3 stack. Swarm isthe “hard disk of the world computer” as envisioned by theEthereum Foundation.Similar to IPFS, Swarm uses content-based addressing.In Swarm the content-based addressing further decides thestorage location. To ensure availability, Swarm introducesareas of responsibility. The area of responsibility are closeneighbours of the node. The nodes in an area of responsibilityshould provide chunk redundancy. Mutability is supportedthrough versioning, keeping each version of the ﬁle. Feeds,specially constructed and addressed chunks, and the EthereumName Service (ENS) are used for ﬁnding the mutated ﬁles.ENS is a standard deﬁned in the Ethereum ImprovementProposal 137 [59]. It provides the ability to translate addressesinto human-readable names. In contrast to IPNS, ENS isimplemented as a smart contract on the Ethereum blockchain.To ensure compliant node behavior, Swarm provides anincentive layer. The incentive structure is based on SWAP,SWEAR and SWINDLE. The Swarm Accounting Protocolhandles the balancing of data exchange between nodes. Thebalance can be settled with cheques, which can be interpretedas a simple one-way payment channel. SWarm EnforcementAnd Registration (SWEAR) and Secured With INsuranceDeposit Litigation and Escrow (SWINDLE) shall ensure per-sistence of content. Furthermore, Swarm’s incentive structurehas postage stamps, which provide a mechanism against junkuploads and also a lottery mechanism to incentivize thecontinued storage of chunks.

Discussion:

Swarm provides interesting incentive con-cepts. Settling unbalanced retrieval with cheques provides afaster and cheaper way to settle discrepancies than relying onblockchain transactions. The postage stamps with the lotterygive an additional incentive for storing chunks. Additionally,while it does cost to upload content, nodes can earn the costby actively serving chunks to participants.Feeds can provide user deﬁned space in the network.Through pinning and recovery feeds, Swarm can mitigatethe disadvantage of Distributed Immutable Store for Chunks (DISC), where the location cannot be freely chosen, whichwould be possible with a normal DHT.However, Swarm clearly depends on the Ethereum ecosys-tem. While it is advantageous for the incentive structure, sinceEthereum is actively developed and has a broad user base, italso requires users to depend on Ethereum.Furthermore, the postage stamps give a clear link to a useruploading content. While Swarm provides a certain degreeof sender anonymity, the upload pseudonymity might limitavailable content.While Swarm has a potentially large user base due to itshigh compatibility and integration with Ethereum, researchof use cases or research investigating Swarm’s mechanism israre. The connection of Swarm and Ethereum could be onereason for a lack of research, since Swarm seems less completethan IPFS and Ethereum itself still maintains many researchopportunities.

B. Hypercore Protocol/Dat

The Hypercore Protocol [14, 60] (formerly Dat Protocol)supports incremental versioning of the content and meta datasimilar to Git. The Hypercore Protocol consists of multiplesub-components. While strictly speaking Hypercore is oneof the sub-components, for simplicity we use the term toreference the Hypercore Protocol in general. In Hypercore,data is stored in a directory like structure and similar toBitTorrent each directory is dealt with its own network. Theprotocol supports different storage modes, where each nodecan decide which data of a directory and which versions ofthe data it wants to store. Furthermore, the protocol supportssubscription to live changes of all/any ﬁles in a directory. Allcommunication in the protocol is encrypted. In order to ﬁndand read the data it is necessary to know a speciﬁc read key.The protocol is designed to share large amounts of mutabledata. The motivation for creating the protocol was to preventlink rot and content drift of scientiﬁc literature. The protocolallows sharing of only part of the data with random access.Hypercore can be understood as sharing a folder. Files in afolder can be modiﬁed, added, and deleted. This also includesand allows mutable ﬁles.

Discussion:

Hypercore allows sharing of data by ex-changing a public key. It is possible to acquire a speciﬁc version and only speciﬁc regions of the data. This makes itsimple, especially for large dataset, and allows mutable data.The protocol natively concentrates on sharing collection ofﬁles, which broadens the usability of the protocol.Due to the encryption and a discovery key, the protocolensures conﬁdentiality. A public key allows the calculation ofthe discovery key but it is not possible to reverse the publickey. This prevents others from reading the data. A downside ofHypercore is the lack of additional authentication mechanismsbeyond the public key, which prevents additional ﬁne-grainedaccess control. Furthermore, it still leaks meta data since thediscovery key is only a pseudonym.Hypercore has no incentive structure for replicating data andthe data persistence relies on its participants.Research utilizing or analyzing Hypercore/Dat is rare. Whilethe protocol seems well developed and usable, research seemsto focus on IPFS, instead.

C. Secure Access For Everyone (SAFE)

The Secure Access For Everyone (SAFE) network [15, 61]is designed to be a fully autonomous decentralized data andcommunication network. Even authentication follows a self-authentication [62] mechanism, which does not rely on anycentralized component. The main goal of SAFE is to providea network which everyone can join and use to store, view,and publish data without leaving trace of their activity on themachine. This would allow participants to publish content withlow risks of persecution.SAFE supports three different data types: Map, Sequence,and Blob. The data can be further divided into public andprivate data. Map and sequence are Conﬂict-free ReplicatedData Types, which is important in case of mutable datato ensure consistency. The Blob is for immutable data. Alldata in the SAFE network is encrypted, even public data.The used encryption algorithm is self encrypting [63], whichuses the ﬁle itself to encrypt the ﬁle. A ﬁle is split intoat least three ﬁxed size chunks. Each chunk is hashed andencrypted with the hash of the previous chunk, i.e., 𝑛 − where 𝑛 is the current chunk. Afterwards, the encrypted chunkgets obfuscated with the chunk at position 𝑛 − . In caseof SAFE, the obfuscated chunks are stored in the network.For decrypting, a data map is created during the encryptionprocess. The data map contains information about the ﬁle andmaps the hash of obfuscated chunks to the hash of the realchunks. For public data the decryption keys are provided bythe network. While private data can be deleted, public datashould be permanent. Therefore mutable data can only beprivate. A Name Resolution System allows human-readableaddresses for retrieving data.In the SAFE network, storing data is charged with thenetwork’s own currency, i.e., Safecoin. The Safecoin bal-ance of the clients is monitored by client managers andapproved/rejected with the help of SAFE’s consensus mech-anisms. Nodes can earn Safecoin by farming, i.e., providingcontent to requesters. Discussion:

The self-authentication, self-encryption, andthe network organization give the user a high degree of control over their data. The absence of central components reducesingle points of failure. Furthermore, privacy and to a certaindegree anonymity are key features of the SAFE network.The network requires authentication for storing data only.Retrieving data is mediated via a client-selected proxy, whichprovides pseudonymous communication. Safecoin is intendedto provide an incentive layer which ensures the availabilityand reliability of the network.Paul et al. [64] provide a ﬁrst security analysis of SAFE in2014, concerning conﬁdentiality, integrity and availability aswell as possible attacks. In 2015 Jacob et al. [65] analyzedthe security of the network with respect to authenticity, in-tegrity, conﬁdentiality, availability, and anonymity. The authorsexplained how the self-authentication and the decentralizednature could be potentially exploited to reveal personal dataof single entities.SAFE is in development since 2006 and considers recentresearch and developments, but remains (at the time of writing)in its alpha phase. We feel that SAFE has a potential toestablish the topic of anonymity as a unique feature whencompared to the other data networks.

D. Storj

Storj [16] is a P2P storage network. The discussed versionis 3.0. It concentrates on high durability of data, low latency,and high security and privacy for stored data. End-to-endencryption for communication, ﬁle locations, and ﬁles issupported. For the high durability of ﬁles or in other wordsbetter availability of ﬁles in the network, Storj uses erasurecodes. Furthermore, low bandwidth consumption is also onemain design goal. The protocol assumes object size of 𝑀 𝐵 or more, while lower object sizes are supported the storageprocess could be less efﬁcient. In Storj, decentralization isinterpreted as no single operator is solely responsible for theoperation of the system. In a decentralized system, trust andByzantine failure assumptions are important. Storj assumesno altruistic, always good behaving nodes, a majority ofrational nodes, behaving only malicious when they proﬁt, anda minority of Byzantine malicious nodes.Storj aims to be a decentralized cloud storage. Storj LabsInc. wants to provide an alternative to centralized storageproviders. For this purpose, Storj provides compatibility withAmazon S3 application programming interface to increase thegeneral acceptance and ease the migration for new user.Since Storj provides cloud storage, user are allowed to storeand retrieve data as well as delete, move, and copy data.To ensure the cooperation of the rational nodes, Storjprovides an incentive system. The incentive system rewardsstorage nodes for storing and providing content. Nodes aremonitored with audits and evaluated via a reputation system.

Discussion:

Storj employs some concepts that are uniquewhen compared to other P2P data networks. The Amazon S3compatibility might promote the decentralized storage system.The erasure codes add overhead to storing ﬁles, but duringa ﬁle retrieval only the necessary amount of pieces need tobe downloaded. Storj uses Reed-Solomon erasure codes [66].Data encoded with a ( 𝑘, 𝑛 ) erasure code, typically encode an object with 𝑛 pieces, in such a way that only 𝑘 pieces arenecessary to recreate the object. Storj chooses four valuesfor each object: 𝑘 , 𝑚 , 𝑜 , and 𝑛 . 𝑘 represents the minimumof required pieces to reconstruct the data, 𝑚 is a buffer forrepair, 𝑜 is a buffer for churn and 𝑛 is the total number ofpieces. Erasure codes provide a higher redundancy with lessoverhead compared to storing the pieces multiple times. Thedecentralization of storage, through the erasure codes, withadequate storage node selection and the help of a reputationsystem increases the protection against data breaches.Storj has mainly two node types, satellite and storagenodes. The satellite nodes administrate the storage processand maintenance of ﬁles. The encryption of meta data andeven ﬁle paths adds an additional protection of meta data.However, satellite nodes are important parts of the network andpartition the network, since ﬁles available at one satellite arenot available at another satellite. This promotes centralizationin form of the satellite. While satellites cannot share the metadata with possible third parties due to the encryption, it is stillpossible to leak access patterns.While Storj is deployed and can indeed be used, applicationsand research on the topic is rather rare. De Figueiredo etal. [67] analyzed the Storj network and identiﬁed the satellitenodes as possible vectors for Denial-of-Service attacks. Theymodiﬁed the implementation of storage node’s connection han-dling and successfully took down a satellite node, renderingpayment and ﬁle retrieval impossible for some time. Anotherstudy also showed an interesting different attack on data net-works. Zhang et al. [68] showed, in Storj v2.0, the possibilityto upload unencrypted data to storage nodes, which can beused to frame owner’s of storage nodes. Nonetheless, Storj’sprovided privacy guarantees, resilience, acquirable meta dataor the possibility to deploy the different nodes by everyonecould provide valuable insights for cloud storage. E. Arweave

The Arweave protocol [17] utilizes a blockchain-like struc-ture, a blockweave, to provide a mechanism for permanenton-chain data storage as well as payment for storage. In theblockweave, a block points to the direct predecessing blockand a recall block, which is deterministically chosen basedon the information of the previous block. While the weave isimmutable and provides censorship-resistance of its data, everynode can decide to refuse accepting content. Refusing contentby a sufﬁciently large amount of nodes prevents inclusion ofunwanted content. Arweave utilizes Wildﬁre a protocol similarto BitTorrent’s tit-for-tat to rank nodes, reducing communica-tion latencies in the network.Arweave aims to provide eternal permanent storage of data,preserving and time-stamping information in an immutableway. The data is stored on-chain on the blockweave, therefore,immutable and only removable through forking the weave. Theblockweave provides decentralized storage for the permaweb.Storage and maintenance of the blockweave and its data isensured through Arweave’s cryptocurrency: Arweave tokens.The tokens are used for rewarding miners and payment forsending transactions.

Discussion:

The Arweave protocol provides on-chainstorage on a blockchain-like structure. This gives the storagesimilar advantages and disadvantages of a blockchain. Ar-weave provides time-stamping, transparency, incentives, andimmutable storage. The data is stored through transactionsproviding pseudonymous authors of data.One of the biggest problems of blockchains is the scalability.Arweave tries to reduce these problems by utilizing block-shadows, a mechanism similar to compact blocks, explainedin Bitcoin Improvement Proposal 152 [69], and Wildﬁre forfast block propagation reducing fork probability. Furthermore,the usage of Block Hash List and Wallet List should reduce theinitial cost of participation. With version 2.0 Arweave intro-duced a hard fork to improve scalability, decoupling data fromtransactions. Instead of including the data in the transaction, aMerkle root of the data is included. This improves transactionpropagation speed, since the data is no longer necessary toforward the transaction.Due to the pseudo-random recall block, nodes are incen-tivized to store many blocks to maximize their mining reward.This increases the replication of data. However, not everynode necessarily stores every block or content, every nodedecides for itself based on content ﬁlter which data it stores.Requesting content might become complicated, since nodesare request opportunistically in hope they store the content.Research about Arweave directly is at most sparse. How-ever, this can be explained by the broad range of emergingblockchain-based protocols and research about blockchain canbe at least partly applied to Arweave.

F. Honorable Mentions and Related Concepts

Next to our detailed overview of select P2P data networks,we provide additional literature on other systems and conceptsconcerning the current generation of P2P data networks. Inparticular, there are some paper concepts providing differentand interesting ideas for P2P content sharing.Sia [70] aims to be a decentralized cloud storage platform. Aﬁle is split into chunks, which are encrypted and then storedvia erasure coding on multiple storage nodes. The locationof chunks is stored as metadata. Sia uses a blockchain toincentivize storage and retrieval of data. The conditions forand duration of storing the data is ﬁxed in storage contracts.The data owner is responsible for ﬁle health.Fukumitsu et al. [71] propose a peer-to-peer-type storagesystem, where even meta-data, necessary for reconstructing thestored ﬁles, is stored in the network and can be retrieved withan ID, a password and a timestamp. The authors assume anunstructured P2P network where each node can offer differentservices. Nodes broadcast regularly necessary informationabout themselves, e.g., offered services and its IP address.An important component of the scheme are storage node listsstored on a blockchain. The storage node list is a randomlyordered list of selected nodes offering storage services. Data isstored in parts and the storage process is split into two phases:storing user data and storing data necessary for reconstructinguser data. User data is encrypted, divided into parts and theparts are stored on nodes selected from the currently available

TABLE II: Summary of the building blocks.

Category BitTorrent IPFS/Filecoin Swarm Hypercore SAFE Storj ArweaveNetwork

Topology Unstructured Hybrid Kademlia Unstructured Kademlia Kademlia Unstructured

File Handling

File Look-up DHT, Central DHT,Opportunistic DHT DHT DHT Central OpportunisticStorage File Blocks Chunks Files Chunks Segments FilesStorage Location Random Random Addressed Random Addressed RandomFile Replication Passive Passive,Caching Active/Passive,Caching Passive Active,Caching – Passive

Information Security

Conﬁdentiality – – Manifests Public-key Self-authentication Satellite nodes –Integrity Meta-data ﬁle Content-addressing Content-addressing Meta-data ﬁle Content-addressing,self-encryption Satellite nodes BlockweaveAvailability Replication,Incentives Replication,Incentives Replication,Erasure Codes,Incentives Replication Replication,Incentives Erasure Codes,Incentives Replication,Incentives

Incentivization

Upload Free Free Charge Free Charge Free ChargeReward (Storing) – For Time For/Over Time – – For Time Over TimePunish (Storer) – Misbehavior Misbehavior – – Misbehavior –Chunk/File Trade Monitor Monitor Monitor – – Monitor MonitorRetrieval Only Charge(optional) Charge(optional) Chargeimbalance – Reward Charge – storage nodes. The parts can be requested using restore keys.For reconstructing user data the decryption key and pairsof storage node and restore keys are necessary. Therefore,the data is replicated on other nodes. A user creates an ID,password pair, and selects a storage list. The data is encryptedwith the hash of ID, password and storage list. Storage nodesare chosen deterministically from the storage list. The restorekey for the parts is the hash of the storage list and the hashof a piece index, the ID and password. This scheme allowsfetching data without storing information on the user device.Jia et al. [72], propose

OblivP2P a mechanism implement-ing ideas from oblivious RAM to hide data access patterns.While the authors mention that their mechanism is applicableto other peer-to-peer systems, they focus on a BitTorrent likesystem with a tracker.Qian et al. [73] propose Garlic Cast, a mechanism forimproving anonymity in an overlay network. Peers do notrequest and search content directly. Instead, a peer searchesfor proxies and the proxies exchange and request the content.Messages between a peer and its proxy are exchanged via asecurity-enhanced information dispersal algorithm (IDA). AnIDA is a form of erasure coding where 𝑘 of 𝑛 pieces aresufﬁcient to reconstruct the object. The security-enhanced IDAﬁrst encrypts a message, splits the message and key into 𝑛 fragments with a 𝑘 -threshold IDA, and sends cloves, messagescontaining a key and message fragment. Proxies are discoveredvia random walks: Cloves are send to its neighbors, requestingpeers to be a proxy with a random clove sequence number,each neighbor randomly forwards the clove and maintains thestate of successor and predecessor, A peer with two cloveswith the same sequence number can recover the request, andif it volunteers to be a peer returns a reply to the requester.Other paper concepts utilize a blockchain for access control and to store data locations instead of a supplement as anincentive mechanism, e.g. Blockstack [74], which maintainsmeta-data on the blockchain and relies on external datastores for actual storage of data. There are also conceptsusing distributed ledger technologies for access control e.g.Calypso [75], which uses a skipchain-based identity and accessmanagement allowing auditable data sharing. However, thesesystems and systems concentrating only on selling data via theblockchain are outside of the scope of this survey.VI. D ISCUSSION OF B UILDING B LOCKS

After gaining an initial understanding of each system, wetake a closer look at all systems, identifying similaritiesand distinct differences. In this discussion, we also includeBitTorrent as prominent example from a previous generationof data networks. By comparing these systems and reviewingliterature on the topic, we identify building blocks and openchallenges in P2P data networks. In particular, we identiﬁedthe areas, network architectures, ﬁle handling, informationsecurity, and incentivization as most relevant technical aspects.In the following, we take these building blocks and derive ataxonomy. In TABLE II, we provide a summary of the buildingblocks.

A. Network Architecture

Each of the considered data network builds an overlaynetwork to communicate with other peers. While many waysexist to organize an overlay network [3, 5], we clearly see adominance of Kademlia [10]. Each network uses a Kademlia-based DHT one way or another; if not for the overlay networkitself then at least for peer discovery.Despite using Kademlia, the networks are organized dif-ferently upon closer inspection. IPFS, Swarm, and SAFE Network Architectures Unstructured ArweaveStorjHypercoreIPFSBitTorrentStructured ChordKademlia IPFSSwarmSAFEPastry

Fig. 5: Overview of the different network architectures.use the DHT also to structure the network. SAFE, however,separates the network additionally in sections, where eachsection organizes itself with so-called elders. Swarm createsa Kademlia topology, where the identity directly decides theneighbors. SAFE and Swarm can therefore be classiﬁed asstructured overlay networks. While IPFS also uses a DHT, apeer connects to every peer it encounters until the numberof connection exceeds a certain limit [32], which basicallyleads to an unstructured overlay network. Yet, IPFS also hasstructured components, which make use of the DHT. Storj usesthe DHT to learn peers. Regardless, each storage node decideshow much resources it provides to a satellite and with whichsatellite it cooperates. Furthermore, cooperation between satel-lites and storage nodes, is controlled with a reputation systemfor satellites and storage nodes. In BitTorrent and Hypercore,the DHT does not inﬂuence the neighbor selection, leading toan unstructured overlay. In BitTorrent, the connection betweenthe peers are decided based on tit-for-tat.Arweave is an exception as it does not use a DHT at all.Arweave uses a gossip protocol similar to Bitcoin, where peersannounce their neighbors and known addresses. Concerningnetwork organization, Arweave has no strict structure forits neighbor selection, although it uses Wildﬁre, a tit-for-tatbased mechanism to rank peers and drop connections fromunresponsive/unpopular peers.An overview of the presented categorization with respect tothe network architecture is provided in Fig. 5.

B. File Handling

The ﬁle handling is another core component of a data net-work and clearly more diverse than the network organization.We provide an overview of our taxonomy in Fig. 6, which wedivide in storage and ﬁle look-up mechanisms.A common pattern with respect to storage is that in eachdata network, immutable ﬁles or at least immutable data blobsare preferred. Mutability and intentional deletion of ﬁles israther a feature than the default.Due to the respective protocol, the ﬁles are split into pieceseither during the exchange (BitTorrent, Hypercore) or the ﬁleis stored in pieces located on potentially different devices.Splitting ﬁles into pieces increases the storage overhead dueto additional meta data. At the same time, though, it improves

File Handling Look-up Opportunistic ArweaveIPFSKademlia BitTorrent (Trackerless)IPFSSwarmHypercoreSAFECentral Components BitTorrent (Tracker)StorjStorage Chunked/Pieces Content-Addressed SwarmSAFERandom IPFSSwarmStorjFile-based ArweaveBitTorrentHypercore

Fig. 6: Overview of ﬁle storage and look-up mechanisms.the retrieval process in case of large ﬁles. Arweave does notsplit ﬁles into pieces. Instead, it uses transactions to store ﬁles,which become part of a block in the blockweave.While chunking is in general a common feature, the storageis irregular. BitTorrent and Hypercore concentrate more onexchanging data than using the network to store data on theirbehalf. This results in a high probability of all chunks beingpresent on one device. The storage is rather ﬁle-based sincethe aim is the possession of all chunks to possess the ﬁle.IPFS and Swarm split the ﬁles into pieces and build aMerkle Tree/DAG. The root is then sufﬁcient to retrieve theﬁle. Each piece can be addressed and retrieved by itself andindividually stored on separate nodes. In IPFS, the location ofchunks is “random” in the sense that each node can determineby itself, if it stores a certain chunk. In Swarm a chunk’sstorage location is tied to its address. However, similar to IPFSother nodes can also decide to additionally store chunks.SAFE splits the chunks into pieces and encrypts the chunkswith each other. Similar to Swarm a chunk is content addressedand the content decides the storage location.Storj splits the ﬁles in erasure encoded pieces, reducing therequired trust in single nodes. The storage location of thepieces is decided randomly and distributed on the availablestorage nodes, cooperating with the responsible satellite node.The chunking of ﬁles also inﬂuences the look-up process.The request is either referencing a chunk/ﬁle directly or achunk pointing to other chunks. The chunks are in generalretrieved from neighbors. The request to neighbors can bedirected or random via a broadcast. In case of Arweave andIPFS, the ﬁle look-up can be considered opportunistic as peersare queried without knowledge about the peers’ possession of the chunks/ﬁle. In Storj a central component is availableto send direct requests. In the other data networks, however,peers utilize a DHT for the look-up. In IPFS the DHT is usedas a backup look-up, if the opportunistic request fails. Sincein BitTorrent and Hypercore the overlay network deals with aspeciﬁc ﬁle or a group of ﬁles, we have to differentiate here: aneighbor is expected to possess at least part of a ﬁle. Thereforethe peer discovery can be considered as a directed request. Tothis end, BitTorrent uses either a central component (i.e., atracker) or a DHT (i.e., trackerless). Hypercore uses a DHT. C. Information Security

Conﬁdentiality, integrity, and availability (CIA) are impor-tant aspects of information security. These aspects provideadditional challenges and gain additional importance in thedistributed setting of data networks. In a distributed systemwhere data is potentially stored on different unsuperviseddevices, it is hard to protect the data or control access todata. Since the data comes from many untrusted devices, theintegrity needs to be guaranteed. We can generally expectimproved availability, e.g., due to the redundant storage anddistribution of data. However, considering availability as longterm ﬁle persistence remains a challenge. Any node coulddelete content and arbitrary join or leave the network, whichresults in ﬁles becoming unavailable.To keep content and meta-data of data conﬁdential fromother participants is difﬁcult in a distributed environment. Evennodes storing data are possible information leaks. Encryptionis the main instrument to protect the data in distributedsystems. The encryption prevents other parties from readingthe content of ﬁles despite fetching or storing the data. Anadditional protection against storage nodes is chunking of ﬁles.By chunking the ﬁle and ideally distributing the chunks ondifferent nodes a storage node is unable to identify content.Swarm, SAFE, and Storj distribute the chunks during thestorage process. In the other data networks, the distributionis less prominent, or in case of Arweave not present at all.Another aspect which protects the content of data is accesscontrol. Access control in the presented data networks ismostly realized through distributing decryption keys. Theexchange of the decryption key is mainly handled by theconcerned parties directly outside of the data network. BitTor-rent, IPFS, and Arweave employ no additional access control.However, some data networks also provide additional mech-anisms. In Storj, satellite nodes verify and authorize accessrequests. Data access is additionally restricted by satellites,where another satellite cannot grant access to data submitted toanother satellite. SAFE uses self-authentication to authenticateaccess to private data. Swarm provides access control throughso-called manifests. In Hypercore, it is necessary to knowthe public key of the directory for discovering peers anddecrypting the communication. This provides an additionaldistinction between write and read access.For the integrity of data, it is possible to rely on andtrust the data provider. However, in a distributed system itis hard to trust all peers. The presented data networks utilizehash functions to ensure integrity. The hash value has to be

Availability Erasure Codes SwarmStorjReplication Passive BitTorrentIPFSSwarmHypercoreArweaveCache-based IPFSSwarmSAFEActive SwarmSAFEIncentives(see Fig. 8)

Fig. 7: Overview of availability mechanisms.known in advance and therefore might require out-of-bandcommunication. Given a hash and the algorithm used for thehash, content can be veriﬁed by regenerating the hash andcomparing it with a given hash.The usage of hash functions is different. In BitTorrent andHypercore, the hash is provided by a ﬁle containing meta-data. IPFS, Swarm, and SAFE use the hash for content-addressing, meaning the content decides the address andcontent is retrieved by their address. Therefore, the acquireddata can be directly veriﬁed. Additionally, SAFE uses self-encryption, where data is only restorable if it is the rightdata. Storj relies on the satellite nodes, which perform randomaudits on storage nodes utilizing hashes. Furthermore, satelliteand storage nodes are evaluated with a reputation systemto increase their credibility. In Arweave, data is stored in ablockweave, which is similar to a blockchain. Each blockconﬁrms its predecessor by including a hash pointer andtherefore provides data integrity.Due to node failure or maintenance, nodes can becomeunavailable, eventually decreasing the availability of storedchunks. Therefore to improve availability, multiple copies ofchunks might be required. Long term availability is a seriousproblem of P2P systems in general. The availability of contentcan be increased through active, passive, and cache-basedreplication. In Fig. 7, we provide an overview of the differentavailability mechanisms used by data networks. Popular con-tent proﬁts from cache-based replication, which can happennaturally through requests and as an optimization. Next toreplication erasure codes can also increase the availability.While they introduce a per chunk storage overhead, ﬁlesand missing chunks can be reconstructed without acquiringall chunks. Incentive mechanisms can improve replicationmechanisms and ensure redundancy through monetary means.Note, that we discuss incentivization in a separate section.BitTorrent and Hypercore rely only on passive replicationand therefore volunteers hosting ﬁles. Arweave’s blockweaveis utilizing passive replication, ensuring replicas of blocks onthe participants and therefore the content. However, every nodecan decide which content it stores based on its content policies. Incentives Exchange Retrieval BitTorrent TokenFilecoinSAFEStorjArweave*Trade BitTorrent*IPFS*SwarmStorage Reward (Storer) Continuous Storage ArweaveStorjSwarmStorage Time Period SwarmFilecoinPunish (Storer) StorjFilecoinSwarmCharge (Upload) SwarmSAFEArweave

Fig. 8: Overview of different incentive mechanisms (datanetworks marked with an asterisk do not use monetary in-centivization in this category).This means that not all content is available on all nodes.IPFS uses cache-based replication, additionally to the passivereplication through pinning of chunks. SAFE uses cache-basedreplication and has data managers which are responsible toactively maintain a few redundant copies of chunks. Storjuses erasure codes instead of replication providing a certainsafety margin against segment loss. Furthermore, the satellitenodes are responsible for auditing storage nodes repairing ﬁlesas necessary. Swarm utilizes four methods: erasure codes,passive replication through pinning, cache-based replication,and active replication with the nearest neighbor set.

D. Incentivization

Incentives are crucial in open/public P2P networks to moti-vate compliant behavior. Otherwise, we have to rely on altru-ism and benign peers. In the presence of “selﬁsh” or maliciouspeers, this however might lead to an deteriorated data network.Most of the presented data networks employ some kind ofincentive mechanism. An exception is Hypercore, which doesnot employ an incentive mechanism and is excluded from thefollowing observation. An overview of the different incentivemechanisms is provided in Fig. 8.One aspect of the incentive mechanism is compensation.While actions can be rewarded or punished with preferentialtreatment or depriving services, the data networks employ their own additional compensation methods. The compensation canbe considered as a monetary incentive. The data networks usecryptocurrencies or crypto-tokens, which can be earned byor used to pay for services. In BitTorrent, BitTorrent Tokensupplements the service. The BitTorrent Token [43] is a TRC-10 utility token of the TRON blockchain [76]. IPFS itself doesnot employ a currency. But it uses Filecoin [51] to complementits protocol to incentivize data reliability/availability. Likewise,the other data networks use a cryptocurrency or token one wayor the other to compensate services. Speciﬁcally, Swarm usesEthereum (ether) [54, 77], SAFE uses Safecoins [61], Storjuses ERC-20 STORJ tokens [16, 78], and Arweave [17] usesits own cryptocurrency.Another aspect is the purpose of the incentive mechanism.We observe two different incentive purposes: promoting partic-ipation and increasing availability. Participation is stimulatedby regulating content retrieval. In all presented data networks,peers keep track of the exchanged data. They can be furtherdifferentiated by a trade relationship, where the received andsend data are compared and one sided observations, wherepeers are evaluated based on retrieved data.Except for SAFE all presented data networks use reputationor monetary incentive to prevent free-riding and promoteactive cooperation. SAFE has a reputation system and a certainreputation is necessary to be an active participant in decisions.However, concerning the exchange of ﬁle, while SAFE re-wards peers for answering request it does not punish peers forslow responses or even charge clients for reading/consumingbandwidth. BitTorrent, IPFS, and Swarm compare send andreceived data. BitTorrent punishes unresponsive, free-ridingpeers by disconnecting from these peers, refusing furtherservice. Additionally, the BitTorrent Token can be used tocompensate peers which offer chunks. Swarm similarly pun-ishes uncooperative peers, where data is only send but neverreceived by disconnecting them, however, Swarm also allowsrebalancing the scale by issuing cheques to peers compen-sating a lack of send pieces. In IPFS, the Bitswap protocolranks peers based on send and received data. Additionally,in Filecoin content retrieval is charged and peers providingthe content are compensated with ﬁlecoin. Arweave monitorsthe responsiveness of peers, ranking the peers, rewarding highranking peers with preferential treatment. In Storj, satellitenodes compensate storage nodes for the provided bandwidth.Storj does not directly compensate the storage node andinstead cumulates the used bandwidth.It is interesting to note that the compensation of ﬁle re-trievals, in Filecoin, Swarm, and Storj is similar to a paymentchannel [79, 80], i.e., a bilateral channel between two peersused to exchange (micro-)payments instantaneously. Paymentchannels are backed by a cryptocurrency but do not require tocommit every update to the blockchain and therefore promiseimproved scalability. Filecoin uses payment channels for theretrieval process, ﬁles are retrieved in small pieces and eachpiece is compensated. Swarm’s chequebook contract behavessimilar to a payment channel, where off-chain payment canbe cashed in at any point in time. In Storj the bandwidthis monitored by allocating a certain amount of bandwidth,allocating a pre-determined amount of bandwidth. The availability of ﬁles also beneﬁts from the participation.By compensating ﬁle retrieval, nodes gain an incentive tocache ﬁles and answer requests. However, long-term avail-ability is also important. Additionally, storing data on otherdevice might require an additional incentive for peers to acceptthe content. Therefore, the incentive mechanism of some datanetworks focus on rewarding and punishing storage nodes.IPFS’s Filecoin, Swarm, Storj, and Arweave reward nodesstoring data. The reward is either for storing the data overtime or for a speciﬁc time period. The time period is deﬁnedand nodes are pre- or postpaid, misbehaving storage nodesare then punished or not compensated. In IPFS’s Filecoin,users rent speciﬁc storage for a time period. In Swarm,storage guarantees are sold. Swarm, Storj and Arweave rewardnodes for storing data over a long time without deﬁned timeconstraints. In Swarm, storage nodes can participate in alottery, if they store certain chunks and might be rewarded forthe continued storage. In Storj, storage nodes are compensatedin time intervals for the data they stored during the interval,in case of storage failures the reward is instead used for ﬁlerepair compensating the new nodes. In Arweave, the networkis paid to store data for a long term. When a node creates anew block, proving storage of data, the node is compensatedfor its continued provision of storage capacity.Punishment of nodes is used to guarantee storage in case ofprepaid storage. If a node breaks its storage promises it loosesfunds. A missed audit in Filecoin or failing to proof storagein Swarm reduces an escrow deposit of the storage node. InStorj part of the payment to new storage nodes is used as anescrow until the storage nodes gained enough reputation. Theescrow will be kept if the node leaves the network too early.In Arweave, instead of punishing nodes, nodes can no longerbe rewarded, if they stop storing blocks.SAFE and Swarm charge the initial upload of data. How-ever, this is a protection against arbitrary uploads rather thanan increase in availability. Swarm ﬁnances the lottery withthe upload fee. In Arweave, the upload of data is paid withtransaction fees. Part of the fees go to the miner and part iskept by the network.VII. R

ESEARCH A REAS AND O PEN CHALLENGES

Previous generation of data networks had different net-work architectures, structured and unstructured, and usedan incentive mechanism mainly to promote cooperation andprevent uncooperative behavior, e.g. free-riders, mainly withreputation systems [8]. Other incentive structures where alsoexplored. The next generation uses mainly Kademlia-basedarchitectures, and employs an incentive structure to increaseavailability and long term persistence.The previous generation already faced some challenges,which still apply to the next generation data networks. In2005, Hasan et al. [7] identiﬁed certain challenges that peer-to-peer systems have to overcome to gain acceptance forreal-life scenarios. This includes deployment, naming, accesscontrol, DDoS attack protections, preventing junk data, andchurn protection. We observe that the next generation data net-works address these problems and provide possible solutions. TABLE III: Overview of research on data networks.

Paper System Short DescriptionPerformance and Structure [26] IPFS Read and write performance[27] IPFS Cluster IoT data sharing[28] IPFS Enhancing with ICN[29] IPFS Meta-Data storage on blockchain[30] IPFS On mobile devices[32] IPFS Network mapping[33] IPFS Network crawler[34] IPFS Improving Bitswap

Conﬁdentiality and Access Control [20] IPFS Blockchain-based, encryption[21] IPFS Blockchain-based, modiﬁed client[22] IPFS Blockchain-based, modiﬁed application[23] IPFS Blockchain-based, encryption[81] IPFS Delegated content erasure

Security [25] IPFS Using for malware[31] IPFS Eclipse attack[64] SAFE CIA and possible attacks[65] SAFE Security analysis[67] Storj Denial-of-Service attack[68] Storj Storing unencrypted data

However, the degree of maturity, the interaction with othermechanism, and the adoption rate need more consideration.In the literature review for the search of current generationdata networks, we found a large body of literature utilizing oranalyzing IPFS. Analyses of other systems are at most sparse.One reason could be lack of actual deployment, small userbase or lack of implementation. Another reason, which thissurvey tries to address, is in our opinion a lack of conciseand structured documentation. Some of the presented systemsmake it hard to get into the system, understand the conceptsand show that the system is valid.We observe ﬁve main challenges of data networks, whichprovide new opportunities for research: performance, conﬁden-tiality and access control, security, anonymity, and naming. Anoverview of existing research can be found in TABLE III.

A. Performance

A research direction which is already pursued by someresearchers is the performance of the systems. Investigatingthe performance, read/write times, storage overhead, ﬁle look-up, churn resistance through simulations or tests, can be usedto identify new use cases and fortify claims that a system mightreplace centralized counterparts. IPFS developed “Testground”for testing and benchmarking P2P systems at scale. In thatsense the performance of Testground and its ability to replicatereal systems, is also an area worthy to be researched. Thereexist other research analyzing the performance of IPFS, e.g.,the read and write latency [26, 29], using IPFS cluster forInternet of Things data sharing [27], improving the system [28,34], or analyzing the network [32, 33]. Heinisuo et al. [30]showed that IPFS needed improvement to be used on mobiledevice due to high network trafﬁc draining the battery. Re-search concerning IPFS’s competitors is lacking. B. Conﬁdentiality and Access Control

The past and present generation of data networks providesome conﬁdentiality and access control, but the systems arerather designed for public data than private data. The knowl-edge gained of nodes while storing data needs to be re-searched, this concerns not only information about the contentof data but also meta-data like access patterns. The securityof the existing access control needs to be investigated. Thereare research proposals for access control with blockchains [20,21, 22, 23], however the immutability of blockchains makesthis questionable for private and personal data. Another aspectconcerning private data is deleting data. While it is usefulfor censorship-resistance to prevent deletion of data, the pos-sibility to delete personal, malicious or illegal data mightraise acceptance of data networks. For example, Politou etal. [81] propose a mechanism for deleting content in IPFS.Investigating and improving the existing systems increases thetrust in data networks. An increased trust in the conﬁdentialityand the protection from unwarranted access can open thesesystems for storing private and personal data.

C. Security

There are also other research areas like security or using thesystems to spread malware [25]. For security, it is importantto know the security against known attacks, e.g., Pr¨unster etal. [31] show an eclipse attack on IPFS, as well as investigatingthe existence of new attack vectors. For example, Storj men-tions the possibility of an “Honest Geppetto” attack, where anattacker operates honestly many storage nodes for a long time,effectively controlling a large part of the storage capabilities.This control allows taking data hostage or taking down thedata in general rendering the data network inoperable. Anotherexample is Frameup [68], where unencrypted data is storedon storage nodes, which could lead to legal issues. Storingarbitrary data might also pose a risk to the storage device.Interestingly, security is the research area where we observeresearch beyond IPFS.

D. Anonymity

Next to conﬁdentiality, which concerns data security andprivacy, protecting the privacy of individuals is another rel-evant aspect; in particular, anonymity, which describes theinability to identify an individual in a group of individuals,i.e., unlinkability [82].With respect to anonymity, various entities can be protectedin data networks: the content creator, the storage node, andthe user requesting content. From previous generation datanetworks, especially Freenet [2] and GNUnet [83] focused onprotecting the identity of the different entities.Due to the incentive mechanisms and the resulting chargeof individuals it is hard to guarantee anonymity as at leastpseudonyms are required. As soon as the incentive mech-anism is used, information about the requester is gained.A distributed ledger recording transactions, e.g., Filecoin,Ethereum Swarm, Arweave, can reveal additional informationand as a result participants are pseudonymous. When a central component authorizes requests and deals with incentivization,e.g., satellite nodes in Storj, requester, storage node and centralcomponent know each other. In case of incentivizing requests,the requesting node and storage nodes are revealed. Theidentity of requesters can be partly secured via forwardingstrategies or proxies, e.g., Swarm, SAFE.The ﬁrst generation had systems like Freenet which aimedfor anonymity and censorship-resistance. The anonymity ofthe current generation seems to fall behind the ﬁrst generation.Despite advances in anonymous communication with mixnetsor Tor [84], there are no data networks providing stronganonymity. In general, the provided anonymity guarantees andfurther enhancements need to be investigated. This includesthe anonymity-utility trade-off and an analysis of differentattacker models. Anonymity is not only important to protectthe privacy of individuals, but is also important to guaranteethe claimed censorship-resistance. If the identity of storagenodes can be easily inferred it is possible that, even though thenetwork protects against deletion, law enforcement can enforcethe censorship. This is a concern especially for systems likeSwarm, where the location of a stored chunk is predeterminedand node identity is linked to Ethereum pseudonyms.

E. Naming

Naming, in particular providing human-readable names ina distributed system, is a known challenge. The problem andits adjacent challenges is captured by Zooko’s Triangle [85].It describes the difﬁculty of building a distributed namespace,which is distributed (without a central authority), secure (clear-cut resolution), and human-readable.In all systems the addressing of data, lacks either distribu-tion (tracker-based BitTorrent and Storj) or human-readability(trackerless BitTorrent, Hypercore, IPFS, Swarm and SAFE).BitTorrent is a good example where the tracker is a centralauthority and in the case of trackerless BitTorrent the human-readable torrent is addressed with the not so readable infohash(hash of the torrent). In the v3.0 of Storj, the satellite is acentral component.The lack of human-readability is a result of self-authenticating data, where the data determines the addressor the name of the data. If the data is changed the addresschanges. Therefore, human-readability is supported through adifferent mechanism, a naming independent of the content. Anexception is Hypercore. In Hypercore, the data group is boundto the public key and the mutability inside the group is securedthrough versioning.One solution to provide human-readability is name res-olution. Name resolution allows the mapping of keys toself-authenticating content. The name resolution can providehuman-readability and provide support for versioning of ﬁles.However, due to the possibility of updating the value anddelays in propagation one could argue that security is violated,even if the key is unique. Independent of Zooko’s Triangle,the name resolution announces content and gives ambiguouscharacter strings meaning and should only be used for publicdata, unless the name resolution provides access control.To this end, IPFS, Swarm, and SAFE provide some kind ofnaming service. In fact, IPFS provides two naming services, IPNS and DNSLink, which are used for different purposes.IPNS is used for mapping the hash of a public key to an IPFSCID, allowing mutable data. DNSLink uses DNS TXT recordsfor mapping domain names to an IPFS address.Swarm also provides two naming systems: single-ownerchunks and ENS [59]. Single-owner chunks provide a dataidentiﬁcation based on an owner and an identiﬁer, providing asecure, non human-readable key with an updatable value. TheEthereum Name System is similar to DNS, where a record ismapped to an address.Swartz [86] argued that a blockchain-based name serviceprovides all three properties of Zooko’s triangle. Anybody canregister the name on the blockchain providing decentraliza-tion, the name can be anything providing human-readability,and the tamperproof ledger ensures unique names providingsecurity. Following this line of argument, systems like Name-coin, Blockstack [74], and ENS, which adopt the idea of ablockchain-based name system, are developed. Although thesesystems exist, except for Swarm with ENS none of the systemseem to provide a solution for Zooko’s triangle. However,due to the lack of transaction ﬁnality and possible blockchainforks, it could be argued that blockchain-based system violatestrong security aspects and only provide eventual security.VIII. C

ONCLUSION

In this survey paper, we studied an emerging new gen-eration of P2P data networks. In particular, we investigatednew developments and technical building blocks. From ourqualitative comparison, we can conclude that except for theoverlay structure the various data networks explore differentsolutions with respect to ﬁle management, availability, andincentivization. In particular, explicit incentive mechanisms,e.g., using a cryptocurrency or some sort of token, seems tobe ubiquitous. Since many systems combine naming servicesand content addressing in a distributed architecture, they havethe potential to reconcile the system properties of humanreadability, security, and decentrality as conjured by Zooko’striangle. In general, P2P data networks have become part ofthe research agenda, either as a basis for other applicationsor as research object itself. Yet, many challenges remain. Wetherefore believe that this new generation of P2P data networksprovide many exciting future research opportunities.R

EFERENCES[1] S. Saroiu, P. K. Gummadi, and S. D. Gribble, “Measurement study ofpeer-to-peer ﬁle sharing systems,” in

Multimedia Computing and Net-working 2002 , International Society for Optics and Photonics, vol. 4673,SPIE, Dec. 2001, pp. 156 –170.[2] I. Clarke, O. Sandberg, B. Wiley, and T. W. Hong, “Freenet: Adistributed anonymous information storage and retrieval system,” in

PET ’00: Proceedings of the International Workshop on DesigningPrivacy Enhancing Technologies: Design Issues in Anonymity andUnobservability , Berkeley, CA, USA, Jul. 2000, pp. 46–66.[3] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan,“Chord: A scalable peer-to-peer lookup service for internet applica-tions,” in

SIGCOMM’01: Proceedings of the 2001 ACM Conference onApplications, Technologies, Architectures, and Protocols for ComputerCommunication , San Diego, CA, USA, Aug. 2001, pp. 149–160.[4] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker, “A scal-able content-addressable network,” in

SIGCOMM’01: Proceedings of the2001 ACM Conference on Applications, Technologies, Architectures, andProtocols for Computer Communication , San Diego, CA, USA, Aug.2001, pp. 161–172. [5] A. Rowstron and P. Druschel, “Pastry: Scalable, decentralized objectlocation, and routing for large-scale peer-to-peer systems,” in

Middle-ware ’01: Proceedings of the 2001 IFIP/ACM International Conferenceon Distributed Systems Platforms , Heidelberg, Germany, Nov. 2001,pp. 329–350.[6] B. Cohen, “Incentives build robustness in bittorrent,” in

P2PEcon ’03:Proceedings of the 1st Workshop on Economics of Peer-to-Peer Systems ,Berkeley, CA, USA, Jun. 2003, pp. 68–72.[7] R. Hasan, Z. Anwar, W. Yurcik, L. Brumbaugh, and R. Campbell, “Asurvey of peer-to-peer storage techniques for distributed ﬁle systems,”in

ITCC ’05: Proceedings of the 2005 International Conference onInformation Technology: Coding and Computing , vol. 2, Las Vegas, NV,USA, Apr. 2005, pp. 205–213.[8] S. Androutsellis-Theotokis and D. Spinellis, “A survey of peer-to-peercontent distribution technologies,”

ACM Computing Surveys , vol. 36,no. 4, pp. 335–371, 2004.[9] S. Nakamoto,

Bitcoin: A peer-to-peer electronic cash system

IPTPS’02: Proceedings of the1st International Workshop on Peer-to-Peer Systems , Cambridge, MA,USA, Mar. 2002, pp. 53–65.[11] B. Ahlgren, C. Dannewitz, C. Imbrenda, D. Kutscher, and B. Ohlman,“A survey of information-centric networking,”

IEEE CommunicationsMagazine , vol. 50, no. 7, pp. 26–36, 2012.[12] J. Benet, “IPFS - content addressed, versioned, P2P ﬁle system (draft3),” Protocol Labs, Tech. Rep., Jul. 2014.[13] V. Tr´on,

The book of swarm , online, v1.0 pre-release, Jun. 2020.[14] M. Ogden, K. McKelvey, M. Buus Madsen, and Code for Science, “Dat- distributed dataset synchronization and versioning,” Dat Foundation,Tech. Rep., Jan. 2018.[15] N. Lambert and B. Bollen, “The safe network a new, decentralisedinternet,” 2014.[16] Storj Labs Inc., “Storj: A decentralized cloud storage network frameworkv3.0,” Storj Labs, Inc., Tech. Rep., Oct. 2018.[17] S. Williams, V. Diordiiev, L. Berman, I. Raybould, and I. Uemlianin,“Arweave: A protocol for economically sustainable information perma-nence,” arweave.org, Tech. Rep., Nov. 2019.[18] M. S. Ali, K. Dolui, and F. Antonelli, “Iot data privacy via blockchainsand IPFS,” in

IOT ’17: Proceedings of the 7th International Conferenceon the Internet of Things , Linz, Austria, Oct. 2017, 14:1–14:7.[19] R. Norvill, B. B. F. Pontiveros, R. State, and A. Cullen,“IPFS for reduction of chain size in ethereum,” in iThings/GreenCom/CPSCom/SmartData ’18: Proceedings of the2018 IEEE International Conference on Internet of Things and IEEEGreen Computing and Communications and IEEE Cyber, Physical andSocial Computing and IEEE Smart Data , Halifax, NS, Canada, Aug.2018, pp. 1121–1128.[20] S. Wang, Y. Zhang, and Y. Zhang, “A blockchain-based framework fordata sharing with ﬁne-grained access control in decentralized storagesystems,”

IEEE Access , vol. 6, pp. 38 437–38 450, 2018.[21] M. Steichen, B. Fiz, R. Norvill, W. Shbair, and R. State,“Blockchain-based, decentralized access control for IPFS,” in iThings/GreenCom/CPSCom/SmartData ’18: Proceedings of the 2018IEEE International Conference on Internet of Things and IEEE GreenComputing and Communications and IEEE Cyber, Physical and SocialComputing and IEEE Smart Data , Halifax, NS, Canada, Aug. 2018,pp. 1499–1506.[22] S. Khatal, J. Rane, D. Patel, P. Patel, and Y. Busnel, “Fileshare:A blockchain and ipfs framework for secure ﬁle sharing and dataprovenance,” in

MoSICom ’20: International Conference on Modelling,Simulation & Intelligent Computing , Dubai, United Arab Emirates, Jan.2020.[23] V.-H. Hoang, E. Lehtihet, and Y. Ghamri-Doudane, “Privacy-preservingblockchain-based data sharing platform for decentralized storage sys-tems,” in

Networking ’20: Proceedings of the 19th IFIP NetworkingConference , Paris, France, Jun. 2020, pp. 280–288.[24] J. Xu, K. Xue, S. Li, H. Tian, J. Hong, P. Hong, and N. Yu, “Healthchain:A blockchain-based privacy preserving scheme for large-scale healthdata,”

IEEE Internet of Things Journal , vol. 6, no. 5, pp. 8770–8781,2019.[25] C. Patsakis and F. Casino, “Hydras and IPFS: a decentralised playgroundfor malware,”

International Journal of Information Security , vol. 18,no. 6, pp. 787–799, 2019. [26] J. Shen, Y. Li, Y. Zhou, and X. Wang, “Understanding I/O performanceof IPFS storage: A client’s perspective,” in IWQoS ’19: Proceedings ofthe International Symposium on Quality of Service , Phoenix, AZ, USA,Jun. 2019, 17:1–17:10.[27] S. Muralidharan and H. Ko, “An interplanetary ﬁle system (IPFS)based iot framework,” in

ICCE ’19: Proceedings of the 37th IEEEInternational Conference on Consumer Electronics , Las Vegas, NV,USA, Jan. 2019, pp. 1–2.[28] O. Ascigil, S. Re˜n´e, M. Kr´ol, G. Pavlou, L. Zhang, T. Hasegawa, Y.Koizumi, and K. Kita, “Towards peer-to-peer content retrieval markets:Enhancing IPFS with ICN,” in

ICN ’19: Proceedings of the 6th ACMConference on Information-Centric Networking , Macao, SAR, China,Sep. 2019, pp. 78–88.[29] E. Nyaletey, R. M. Parizi, Q. Zhang, and K.-K. R. Choo, “Blockipfs -blockchain-enabled interplanetary ﬁle system for forensic and trusteddata traceability,” in

BLOCKCHAIN ’19: Proceedings of the 2019International Conference on Blockchain , Atlanta, GA, USA, Jul. 2019,pp. 18–25.[30] O.-P. Heinisuo, V. Lenarduzzi, and D. Taibi, “Asterism: Decentralizedﬁle sharing application for mobile devices,” in

MobileCloud ’19: Pro-ceedings of the 7th IEEE International Conference on Mobile CloudComputing, Services, and Engineering , Newark, CA, USA, Apr. 2019,pp. 38–47.[31] B. Pr¨unster, A. Marsalek, and T. Zefferer, “Total eclipse of the heart –disrupting the interplanetary ﬁle system,” 2020.[32] S. Henningsen, M. Florian, S. Rust, and B. Scheuermann, “Mapping theinterplanetary ﬁlesystem,” in

Networking ’20: Proceedings of the 19thIFIP Networking Conference , Paris, France, Jun. 2020, pp. 289–297.[33] S. Henningsen, S. Rust, M. Florian, and B. Scheuermann, “Crawlingthe ipfs network,” in

Networking ’20: Proceedings of the 19th IFIPNetworking Conference , Paris, France, Jun. 2020, pp. 679–680.[34] A. De la Rocha, D. Dias, and Y. Psaras, “Accelerating content routingwith bitswap: A multi-path ﬁle transfer protocol in ipfs and ﬁlecoin,”p. 11, 2021.[35] F. Ashraf, A. Naseer, and S. Iqbal, “Comparative analysis of unstructuredP2P ﬁle sharing networks,” in

ICISDM ’19: Proceedings of the 3rdInternational Conference on Information System and Data Mining ,Houston, TX, USA, Apr. 2019, pp. 148–153.[36] T. D. Thanh, S. Mohan, E. Choi, S. Kim, and P. Kim, “A taxonomyand survey on distributed ﬁle systems,” in

NCM ’08: Proceedings of the4th International Conference on Networked Computing and AdvancedInformation Management , vol. 1, Gyeongju, South Korea, Sep. 2008,pp. 144–149.[37] H. Huang, J. Lin, B. Zheng, Z. Zheng, and J. Bian, “When blockchainmeets distributed ﬁle systems: An overview, challenges, and openissues,”

IEEE Access , vol. 8, pp. 50 574–50 586, 2020.[38] N. Z. Benisi, M. Aminian, and B. Javadi, “Blockchain-based decentral-ized storage networks: A survey,”

Journal of Network and ComputerApplications , vol. 162, p. 102 656, 2020.[39] F. Casino, E. Politou, E. Alepis, and C. Patsakis, “Immutability anddecentralized storage: An analysis of emerging threats,”

IEEE Access ,vol. 8, pp. 4737–4744, 2019.[40] J. Pouwelse, P. Garbacki, D. Epema, and H. Sips, “The bittorrentP2P ﬁle-sharing system: Measurements and analysis,” in

IPTPS ’05:Proceedings of the 4th International Workshop on Peer-To-Peer Systems ,Ithaca, NY, USA, Feb. 2005, pp. 205–216.[41] A. R. Bharambe, C. Herley, and V. N. Padmanabhan, “Analyzing andimproving a bittorrent networks performance mechanisms,” in

INFO-COM ’06: Proceedings of the 25th IEEE International Conference onComputer Communications , Barcelona, Catalunya, Spain, pp. 1–12.[42] R. L. Xia and J. K. Muppala, “A survey of bittorrent performance,”

IEEECommunications Surveys and Tutorials , vol. 12, no. 2, pp. 140–158,2010.[43] BitTorrent Foundation, “Bittorrent (btt) white paper,” BitTorrent Foun-dation, Tech. Rep., Feb. 2019.[44] V. Jacobson, D. K. Smetters, J. D. Thornton, M. F. Plass, N. H. Briggs,and R. L. Braynard, “Networking named content,” in

CoNext ’09:Proceedings of the 5th ACM International Conference on EmergingNetworking Experiments and Technologies , Rome, Italy, Dec. 2009,pp. 1–12.[45] L. Zhang, A. Afanasyev, J. Burke, V. Jacobson, P. Crowley, C. Pa-padopoulos, L. Wang, and B. Zhang, “Named data networking,”

Com-puter Communication Review , vol. 44, no. 3, pp. 66–73, 2014.[46] S. Mastorakis, A. Afanasyev, Y. Yu, and L. Zhang, “Ntorrent: Peer-to-peer ﬁle sharing in named data networking,” in

ICCCN ’17: Proceedingsof the 26th International Conference on Computer Communication andNetworks , Vancouver, BC, Canada, Jul. 2017, pp. 1–10. [47] A. Narayanan and J. Clark, “Bitcoin’s academic pedigree,”

ACM Queue ,vol. 15, no. 4, p. 20, 2017.[48] F. Tschorsch and B. Scheuermann, “Bitcoin and beyond: A technicalsurvey on decentralized digital currencies,”

IEEE Communications Sur-veys & Tutorials , vol. 18, no. 3, pp. 2084–2123, 2016.[49] A. Gervais, G. O. Karame, K. W¨ust, V. Glykantzis, H. Ritzdorf,and S. Capkun, “On the security and performance of proof of workblockchains,” in

CCS ’16: Proceedings of the 23rd ACM SIGSAC Con-ference on Computer and Communications Security , Vienna, Austria,Oct. 2016, pp. 3–16.[50] B. Joseph, A. Miller, J. Clark, A. Narayanan, J. A. Kroll, and E. W.Felten, “Sok: Research perspectives and challenges for bitcoin andcryptocurrencies,” in

SP ’15: Proceedings of the 36th IEEE Symposiumon Security and Privacy , San Jose, CA, USA, May 2015, pp. 104–121.[51] Protocol Labs, “Filecoin: A decentralized storage network,” ProtocolLabs, Tech. Rep., Jul. 2017.[52] P. Labs,

IPFS - github , https://github.com/ipfs, Accessed: 2021-01.[53] Ethersphere,

Ethersphere - github , https://github.com/ethersphere, Ac-cessed: 2021-01.[54] G. Wood. (2014). “Ethereum: A secure decentralised generalised trans-action ledger,” [Online]. Available: http://gavwood.com/Paper.pdf.[55] H. P. developers,

Hypercore protocol - github ,https://github.com/hypercore-protocol, Accessed: 2021-01.[56] MaidSafe,

Safe network - github , https://github.com/safenetwork, Ac-cessed: 2021-01.[57] S. Labs,

Storj labs - github , https://github.com/Storj/, Accessed: 2021-01.[58] ArweaveTeam,

Arweave - github , https://github.com/ArweaveTeam, Ac-cessed: 2021-01.[59] N. Johnson,

Eip-137 - ethereum domain name service - speciﬁcation ,https://eips.ethereum.org/EIPS/eip-137, Accessed: 2021-01.[60] D. Keall,

How dat works , https://datprotocol.github.io/how-dat-works/,Accessed: 2021-01.[61] MaidSafe,

The safe network primer , https://primer.safenetwork.org/,Accessed: 2021-01.[62] D. Irvine, “Self-authentication,” Tech. Rep., Sep. 2010.[63] ——, “Self encrypting data,” Tech. Rep., Jun. 2015.[64] G. Paul, F. Hutchison, and J. Irvine, “Security of the maidsafe vaultnetwork,” in

WWRF ’14: Wireless World Research Forum Meeting 32 ,Morocco, May 2014.[65] F. Jacob, J. Mittag, and H. Hartenstein, “A security analysis ofthe emerging p2p-based personal cloud platform maidsafe,” in

IEEETrustcom/BigDataSE/ISPA ’15: Proceedings of the 2015 IEEE Trust-com/BigDataSE/ISPA , vol. 1, Helsinki, Finland, Aug. 2015, pp. 1403–1410.[66] I. S. Reed and G. Solomon, “Polynomial codes over certain ﬁnite ﬁelds,”

Journal of the society for industrial and applied mathematics , vol. 8,no. 2, pp. 300–304, 1960.[67] S. De Figueiredo, A. Madhusudan, V. Reniers, S. Nikova, and B. Preneel,“Exploring the storj network: A security analysis,” 2021.[68] X. Zhang, J. Grannis, I. Baggili, and N. L. Beebe, “Frameup: An incrim-inatory attack on storj: A peer to peer blockchain enabled distributedstorage system,”

Digital Investigation , vol. 29, pp. 28–42, 2019.[69] M. Corallo. (Apr. 2016). “Bip 152: Compact blockrelay.” Accessed: 2021-02, [Online]. Available:https://github.com/bitcoin/bips/blob/master/bip-0152.mediawiki.[70] D. Vorick and L. Champine, “Sia: Simple decentralized storage,” Neb-ulous Inc., Tech. Rep., Nov. 2014.[71] M. Fukumitsu, S. Hasegawa, J.-y. Iwazaki, M. Sakai, and D. Takahashi,“A proposal of a secure p2p-type storage scheme by using the secretsharing and the blockchain,” in

AINA ’17: Proceedings of the 31stIEEE International Conference on Advanced Information Networkingand Applications , Taipei, Taiwan, Mar. 2017, pp. 803–810.[72] Y. Jia, T. Moataz, S. Tople, and P. Saxena, “Oblivp2p: An oblivious peer-to-peer content sharing system,” in

USENIX Security ’16: Proceedingsof the 25th USENIX Security Symposium , Austin, TX, USA, Aug. 2016,pp. 945–962.[73] C. Qian, J. Shi, Z. Yu, Y. Yu, and S. Zhong, “Garlic cast: Lightweight anddecentralized anonymous content sharing,” in

ICPADS ’16: Proceedingsof the 22nd IEEE International Conference on Parallel and DistributedSystems , Wuhan, China, Dec. 2016, pp. 216–223.[74] M. Ali, J. Nelson, R. Shea, and M. J. Freedman, “Blockstack: A globalnaming and storage system secured by blockchains,” in

USENIX ATC’16: Proceedings of the 2016 USENIX Annual Technical Conference ,Denver, CO, USA, Jun. 2016, pp. 181–194. [75] E. Kokoris-Kogias, E. C. Alp, S. D. Siby, N. Gailly, L. Gasser, P.Jovanovic, E. Syta, and B. Ford, “Calypso: Auditable sharing of privatedata over blockchains,” Cryptology ePrint Archive, 2018/209, Tech. Rep. ,2018.[76] T. Foundation, “TRON advanced decentralized blockchain platform -whitepaper version 2.0,” TRON Foundation, Tech. Rep., Dec. 2018.[77] V. Tr´on, A. Fischer, D. A. Nagy, Z. Felf¨oldi, and N. Johnson, “Swap,swear and swindle incentive system for swarm,” Ethereum Foundation,Tech. Rep., May 2016.[78] F. Vogelsteller and V. Buterin,

Eip-20 - erc-20 token standard ,https://github.com/ethereum/EIPs/blob/master/EIPS/eip-20.md,Accessed: 2021-01.[79] J. Poon and T. Dryja, “The bitcoin lightning network: Scalable off-chaininstant payments,” Jan. 2016.[80] P. McCorry, M. M¨oser, S. F. Shahandashti, and F. Hao, “Towards bitcoinpayment networks,” in

ACISP ’2016: Proceedings of the 21st Aus-tralasian Conference on Information Security and Privacy , Melbourne,VIC, Australia, 2016, pp. 57–76.[81] E. Politou, E. Alepis, C. Patsakis, F. Casino, and M. Alazab, “Delegatedcontent erasure in ipfs,”

Future Generation Computer Systems , vol. 112,pp. 956–964, 2020. [82] A. Pﬁtzmann and M. K¨ohntopp, “Anonymity, unobservability, andpseudonymity - a proposal for terminology,” in

PET ’00: Proceedings ofthe International Workshop on Designing Privacy Enhancing Technolo-gies: Design Issues in Anonymity and Unobservability , Berkeley, CA,USA, 2000, pp. 1–9.[83] K. Bennett, C. Grothoff, T. Horozov, I. Patrascu, and T. Stef, “Gnunet - atruly anonymous networking infrastructure,” In: Proc. Privacy EnhancingTechnologies Workshop (PET, Tech. Rep., 2002.[84] R. Dingledine, N. Mathewson, and P. F. Syverson, “Tor: The second-generation onion router,” in

USENIX Security ’04: Proceedings of the13th USENIX Security Symposium , San Diego, CA, USA, 2004, pp. 303–320.[85] Z. Wilcox-O’Hearn,

Names: Distributed,secure, human-readable: Choose two ,https://web.archive.org/web/20011020191610/http://zooko.com/distnames.html,Accessed: 2021-01.[86] A. Swartz,