CoronaZ: another distributed systems project
CCoronaZ : another distributed systems project
Simulating a contact tracing application in a scalable environment
Stefan Ciprian Voinea
Universit`a degli Studi di Padova [email protected]
Stefan Vladov
Technical University of Munich [email protected]
Fabian Rensing
Paderborn University fabian.rensing@helsinki.fi
I. I
NTRODUCTION
This brief paper describes
CoronaZ , a project for the Dis-tributed Systems course at the
University of Helsinki .All the code of the project is publicly available on GitHubrepository [1].This project simulates a contact tracing application whereeach node represents a person (or a unique device attached tosomeone) that send signals to each other when in range andcommunicate the data collected to a server using the publish-subscribe pattern. The server, called broker , can then be polledby a node called consumer that will send the data to a database.A front-end application then requests this data and displays themovement and the latest updates via the browser.The idea came from simulating this kind of movementswith Arduino boards capable of communicating between them-selves using the nrf24l01 and to the broker with esp8266 .Unfortunately this was not possible given the relatively strictamount of time that each of the students involved coulddedicate to the project and the waiting time to get the necessaryhardware. II. T
ECHNOLOGICAL CHOICES
In this section we describe and explain why we have decidedto use certain technologies rather than others. • Programming language : we chose Python because it is afast programming language for creating prototypes. Alsoall group members are fluent with Python; • MQTT Broker : we chose
Apache Kafka as broker for ourproject since it is one of the most used brokers in themarket and it is has a large community that supportsthe project. Apache Kafka also handles well scaling andintegration with other systems; • Containers : we chose to use
Docker and docker-compose given the easiness of constructing and spawning nodes.As explained in III, we have a docker-compose.yml that handles that base components such as the broker andthe database; • Database : we chose
MongoDB for its high scalabilityand ease of use. Additionally, its dynamic schemas wouldallow us to be very flexible with our data model; • Back-end : we chose
NodeJS and
Express , as it is apopular back-end for web applications, and some groupmembers had prior experience • Front-end : we chose
React for its fast development speedand familiarity of some group members with it.
A. System requirements
The system requirements for this project are simple: • Docker and docker-compose ; • Ubuntu or another Linux system with jq installed forstarting the project using the init-project.sh bashscript (partly working with git bash on Windows).III. A RCHITECTURE
We can divide our project in five major areas: nodes , broker , DB consumer , database , back-end and front-end . Fig. 1. Architecture of the
CoronaZ project.
A. Nodes
This can be considered a single component since each nodecan be spawned separately from one another.When each node spawns in the map it is placed on a randomposition. Every second the node will “ move ” in a randomdirection, broadcast a message telling its position and
UUID ( Universally Unique Identifier ) and listen to the ones that arein broadcast by other nodes.Nodes introduced into the simulation can be infected or non-infected (or safe ). When a node is infected it will stay in hislast position and will not move for a certain amount of time asspecified in the parameter infection_cooldown , whichdefines the number of seconds that the node will stay in place.After this amount of time, the node will have an “ immunityperiod ” in which it will be able to move again and, if it gets1 a r X i v : . [ c s . N I] F e b n contact with an infected node, it will not get infected. Thistime is the same as the infection_cooldown parameter.In a gamification perspective, the nodes in our systemare also called humans and zombies . Humans are the nodesconsidered safe while the zombies are the nodes that havebeen infected .In our simulation the nodes can all connect to each othersince they are all in the same network, as explained in IV, andeach of them can hear the data sent in broadcast by the others.In a more realistic situation nodes would be only capable ofhearing the signals of nodes nearby them, like in Fig.1 wherenode A can communicate only with node B and nodes B , C and D are nearby each other enough for them to hear theirsignal.Here is an example of a message that is sent in broadcastfrom a node: { "uuid":"ff0a1bda-34b9-11eb-b339 ... ","position":[1, 5],"infected":false,"timestamp":"2020-12-02 16:19 ... ","alive":true} After the node exceeds its lifetime in seconds, before exitingwill send a last message where the value of alive will be false . This signals the front-end that this node will not haveto be displayed anymore.Each node has a unique
UUID that is generated in Python,using the uuid [2]. This is created via a combination of
MAC address and the IP address of the machine the script runs onand the timestamp on when the process starts.The nodes images are built via a Dockerfile that uses thesmall Linux distribution Alpine to run the Python scripts. Thenodes’ python scripts are all loaded in the Docker image andthe main.py file is executed as
CMD when the containerstarts. This configuration allows to run containers with ar-guments as well, thus deciding if the node will start as aninfected node or as a safe node.The work of the nodes is split along four threads : • main thread that starts all the other threads and controlsthem (main logic of the program); • broadcasting thread in which the commands for broad-casting the message are executed; • listening thread that waits for incoming messages; • Kafka server connection thread that checks if the connec-tion with the Kafka server is still up.For connectivity the nodes use the socket library andwill send UDP messages in broadcast to the unassigned port via a random port chosen by the library.The sender’s IP address and the port that the message hasbeen sent from can be seen in the logs printed on screen.Setting the debugging at
INFO level helps seeing the messagesthat each node send with one another and that are sent toKafka. The messages to be sent to the Kafka server are collected viathe get_next_broadcast_message method and sentusing the send_message method.
B. MQTT broker
Apache Kafka is composed by the Kafka container and theZookeeper container. Kafka is an important part in the projectsince it is the broker in the publish-subscribe pattern.Each node, after it has collected the IDs of the other nearbyit, will send the list ot these IDs, with other information, to thebroker. The topic , in our code, used by the nodes is “ coronaz ”.This will contain all the information send by the nodes to thebroker, which will later forward them to the consumer whenit asks for them.Here is an example of a message that is sent to the broker: { "uuid": "5603b252-36de-11eb ... ","position": [72,33],"infected": false,"timestamp": "2020-12-05 09:43 ... ","alive": true,"contacts": [{ "uuid": "563eafe6-36de-11eb ... ","timestamp": "2020-12-05 09:43 ... "},{ "uuid": "56fd5d12-36de-11eb ... ","timestamp": "2020-12-05 09:43 ... "},{ "uuid": "567a64c7-36de-11eb ... ","timestamp": "2020-12-05 09:43 ... "},{ "uuid": "56b292fc-36de-11eb ... ","timestamp": "2020-12-05 09:43 ... "},{ "uuid": "575030c3-36de-11eb ... ","timestamp": "2020-12-05 09:43 ... "},{ "uuid": "563eafe6-36de-11eb ... ","timestamp": "2020-12-05 09:43 ... "}]}
If a message is received from the same node twice then, thereceiving node will discard the first message that arrived.2 . DB consumer
The DB consumer can be considered as a single entity sinceit is independent both from Kafka and from Mongo. Theconsumer is subscribed to the Kafka topic that contains thenew messages from the nodes, in our case “ coronaz ”. Whena new message arrives, the consumer gets it and, every tenmessages (or when a node dies), it aggregates them in a json that will be sent to the MongoDB database.
D. Database
As the other components, the database runs in a Dockercontainer.It accepts and stores incoming data from the Consumer. Thedatabase is accessed by the back-end, which will forward thedata to the front-end, refreshed every second in order to alwaysdisplay the latest status.
Fig. 2. CoronaZ database structure.
In the database, each document is mapped with the
UUID of the nodes as a key and the value represents the latest updateof the node(from the last message).
E. Back-end
For the back-end of the project,
NodeJS provides a singleREST endpoint for the front-end. This is
GET /data , whichconnects to the MongoDB database and queries all the docu-ments containing the state of the nodes.
F. Front-end
The front-end of the project, built with
React and
Materi-alUI , shows the evolution of the system and the simulations.A slider allows the user to select one of the queried statesfrom the database, which will be then visualized on the mapunderneath. Moreover, this will also update the statistics onthe top of the page. Realism mode switches between usingcircles or icons.Safe nodes are colored in blue while infected nodes are red.Dead nodes are gray.The statistics are the following: “Total nodes”, “Zombies”,“Deaths” and “Dead zombies” which respectively representthe total number of nodes, number of infected nodes, numberof dead nodes and number of nodes that died infected.
Fig. 3. CoronaZ front-end.
IV. D
OCKER NETWORKING
Since this project utilizes Docker containers, thereis a docker network called coronaZ , defined in the docker-compose.yml file, which connects all the com-ponents.This network allows each node to get a new IP address andto have their own ports for broadcasting and listening.Setting the network mode in another way, for examplewith network mode:host is inadvisable due to operabilityissues between docker for Windows and Linux, as well ascreating problems with IP and port assignment.V. S
CALABILITY AND FAULT TOLERANCE
We have tested the scalability and the fault tolerance of theproject in the following ways: • unexpectedly shutting down the MongoDB database :when MongoDB fails, the user is presented with a loading gif in the front-end that communicates an error hap-pened in the system. The DB consumer will wait to sendthe messages until the database itself becomes reachableagain; • unexpectedly shutting down the DB consumer : if the DBconsumer goes offline Kafka will still hold the messagesin the “ coronaz ” queue until asked for them. The databasewill not be updated until the consumer comes back up; • unexpectedly shutting down the back-end : as per thedatabase, if the back-end fails, a loading gif will bepresented to the user in the front-end. When it comesback up, it will serve the front-end again and the userwill see the current map and simulation; • unexpectedly shutting down the broker : this would makenodes not able to communicate with the rest of thesystem, they would only send messages among them-selves. These messages will be stored in the nodes untilthe broker becomes available again. When the brokercomes back up again the messages will contain all thecontacts that have been recorded by each node, but willnot contain all the movements made. The next messagewill contain the movement from the current position andthe movements made in the downtime of the broker3ill be lost. The movements are not essential since theexact movements are not relevant in a contact tracingapplication, the most important data comes from thecontacts; • adding more nodes : when a node is added it will startcommunicating with the other nodes already in the systemand it will start sending messages to the broker; • unexpectedly shutting down a node : if a node shuts downunexpectedly, the last message with the “ "alive":false ” parameter will not be sent to the server. Thiswould cause the front-end to keep showing the node asa fixed point on the map. The rest of the system will notbe affected by a node shutting down.In general for the errors that can be felt in the front-end,such as the database not responding or the back-end server notresponding, the user does not really need to know what exactcomponent went down. This is also for security reasons in casea malicious user wants to find the vulnerabilities; without thespecific logs it is harder for him to find the exact breakingpoint of the project.Also the user does not need to concern himself with thevarious components which make up the whole network, heperceives the system as one entity.VI. S IMULATION
To start the project we have made an init-project.sh script that asks the parameters with which the simulation willtake place, executes the docker-compose ( up and down )commands and manages the number of nodes by spawning asmany as the user wants. Fig. 4. CoronaZ front-end.
The script will ask for the run parameters and will set themon the file, otherwise will run with a set of defaults. The defaultparameters for the run are (in json format): { "field_width" : 100,"field_height" : 100,"scale_factor" : 5,"zombie_lifetime" : 120,"infection_radius" : 2, "infection_cooldown" : 15}
These parameters stand for: • “ field width ”: width of the map; • “ field height ”: height of the map; • “ scale factor ”: scales the map and the nodes in order tobe viewed better in the front-end; • “ zombie lifetime ”: lifetime in seconds of the nodes; • “ infection radius ”: the distance that the nodes can con-sider to infect other nodes; • “ infection cooldown ”: “ cooldown ” period in seconds inwhich the nodes stand in order to “ cure ”. This value isalso used as an “ immunity period ” in which the nodecannot get infected. Fig. 5. CoronaZ simulation at start.Fig. 6. CoronaZ simulation when first node is cured.
VII. P
ERFORMANCE EVALUATION
A “ simulation.pcapng ” file is present in the essayfolder and contains part of a simulation for the system systemrunning.From this file, various message lengths can be seen, forexample: the length in bytes of a message sent in broadcast byeach node is (just the payload, with headers). Forthe messages sent to the the server by the nodes, the minimumsize is ( with headers), the rest depends on thenumber of contacts the node had.4 ig. 7. CoronaZ simulation when nodes die (from natural causes). For a simulation with twenty nodes, each of them sendsone message in broadcast a second, which means 189B. Inthe “ worst case scenario ” where all the nodes are in infectingrange, the lenght of the message sent to the server would be .So just one node sends B + 2 . KB = 2 . KB eachsecond, this would mean that for a simulation of seconds,the total amount of data sent by the nodes is . KB ∗ ∗
20 = 321 . KB ∗
20 = 6 . M B .Latency does not represent an obstacle in our system sinceeach node stores the messages that need to be sent to theserver and the data handled is not time sensitive as per other real-time applications.Network performances for the nodes are discussed inVIII-A. VIII. F
UTURE WORK
Given the nature of the project there won’t be future releasesbut in this section we want to talk about what can be done toimprove the project.There are various improvements that can be done to makeCoronaZ a more interesting and stable simulator, we willtackle them based on the areas defined in III.
A. Nodes
Nodes logic can be improved adding a more intelligentbehaviour. This would result in better movements on the mapand, possibly, avoiding infected nodes.The throughput of the network could be improved byshortening the messages and choosing a different messageformat (like differential updates ). This would allow not only toexchange less bytes between the nodes but especially betweenthe nodes and the Kafka server.
B. MQTT broker
In order to allow for a better network performance, scalingthe broker would be the best option. A dynamic scale rulecould be created and the server replicas can be hidden behinda proxy.This would not only help improving the performance of thenetwork but also by adding fault tolerance in case one of thereplicas fail.
C. DB consumer
As per the broker server, scaling the DB consumer nodeand making each node read from a separate queue or aseparate server would improve network performance and faulttolerance.
D. Database
Currently the database only holds one simulation, but itcan be expanded to hold multiple ones, preferably as differentMongoDB collections.As MongoDB is very configurable, one could also tweakfactors such as replication.
E. Back-end
Currently the back-end only supports one end-point. An im-portant improvement would be to allow the front-end to makeintelligent queries, for example only sending the newest states,instead of all of them, to reduce bandwidth consumption.Additionally, more complex functions such as selecting aspecific simulation or user profiles can be implemented.
F. Front-end
The front-end is feature complete in terms of our projectscope. A further improvement could be an option to select aspecific simulation or having a user profile.IX. C
ONCLUSIONS
This project represents a simulation of a contact tracingapplication where each node that spawns in the networkcommunicates its position to the others in range and will sendthe data collected to a central server with the publish-subscribe pattern.The running system implements the basic goals and func-tionalities requested in the given programming task sheet. Asexplained in V it supports scalability and fault tolerance, whilethe end user of the system can decide how many nodes toadd, thus scaling the system horizontally . In VII there is theanalysis of how the system performs and how much data issent in the network during the simulation. Improvements, bothin the system’s architecture and it’s performance are discussedin VIII. This project has these three characteristics: namingand node discovery , synchronization and consistency and faulttolerance . R EFERENCES[1]
CoronaZ repository: https://github.com/cipz/CoronaZ/[2] uuid — UUID objects according to RFC 4122: https://docs.python.org/3.8/library/uuid.htmlhttps://docs.python.org/3.8/library/uuid.html