[PDF] Improving Traffic Safety Through Video Analysis in Jakarta, Indonesia

Abstract

This project presents the results of a partnership between the Data Science for Social Good fellowship, Jakarta Smart City and Pulse Lab Jakarta to create a video analysis pipeline for the purpose of improving traffic safety in Jakarta. The pipeline transforms raw traffic video footage into databases that are ready to be used for traffic analysis. By analyzing these patterns, the city of Jakarta will better understand how human behavior and built infrastructure contribute to traffic challenges and safety risks. The results of this work should also be broadly applicable to smart city initiatives around the globe as they improve urban planning and sustainability through data science approaches.

Full PDF

IImproving Trafﬁc Safety Through Video Analysis inJakarta, Indonesia

João Caldeira ∗ Department of PhysicsUniversity of Chicago [email protected]

Alex Fout ∗ StatisticsColorado State University [email protected]

Aniket Kesari ∗ Jurisprudence & Social PolicyUniversity of California, Berkeley [email protected]

Raesetje Sefala ∗ Machine LearningUniversity of the Witwatersrand [email protected]

Joseph Walsh

Center for Data Science and Public PolicyUniversity of Chicago

Katy Dupre

Center for Data Science and Public PolicyUniversity of Chicago

Muhammad Rizal Khaeﬁ

Pulse Lab Jakarta [email protected]

Setiaji

Jakarta Smart City [email protected]

George Hodge

Pulse Lab Jakarta [email protected]

Zakiya Aryana Pramestri

Pulse Lab Jakarta

Muhammad Adib Imtiyazi

Jakarta Smart City

Abstract

This project presents the results of a partnership between the Data Science forSocial Good fellowship, Jakarta Smart City and Pulse Lab Jakarta to create avideo analysis pipeline for the purpose of improving trafﬁc safety in Jakarta.The pipeline transforms raw trafﬁc video footage into databases that are readyto be used for trafﬁc analysis. By analyzing these patterns, the city of Jakartawill better understand how human behavior and built infrastructure contribute totrafﬁc challenges and safety risks. The results of this work should also be broadlyapplicable to smart city initiatives around the globe as they improve urban planningand sustainability through data science approaches. The World Health Organization’s

Global status report on road safety 2015 estimates that over 1.2million people die each year in trafﬁc accidents [1]. Nearly 2000 such fatalities occur annually in thecity of Jakarta, Indonesia, making it one of the most dangerous cities in the world for trafﬁc safety.Many of these deaths are preventable through effective city planning. ∗ J. Caldeira, A. Fout, A. Kesari, and R. Sefala contributed equally to this work. The source code developed in this project is available at https://github.com/dssg/jakarta_smart_city_traffic_safety_public/ .Preprint. Work in progress. a r X i v : . [ c s . C Y ] N ov akarta has experienced rapid population growth over the last 50 years, from roughly two millionpeople in 1970 to more than 10 million today. With this growth comes a rise in vehicle ownershipand congestion, and these factors inevitably lead to an increase in the number of trafﬁc incidents.One of the core problems with using machine learning and other data-driven techniques in trafﬁcsafety analysis is that it is difﬁcult to collect high-quality data. In partnership with Jakarta Smart City(JSC) and Pulse Lab Jakarta (PLJ) , a team of fellows at the Data Science for Social Good (DSSG)fellowship at the University of Chicago was formed to tackle this problem. Our team was givenaccess to video footage of seven trafﬁc cameras spread over Jakarta. We developed a video analysispipeline that furnishes JSC and PLJ with the ability to generate rich databases that contain massiveamounts of information about trafﬁc behaviors. We hope our work provides a roadmap for applyingmachine learning throughout the developing world in the context of smart cities and urban planningmore broadly.In addition to the technical application of the video analysis pipeline, we want this project to providea template for others who hope to successfully deploy machine learning and data driven systems inthe developing world. Through intense cooperation between the fellowship team in Chicago and theproject partners in Jakarta, we gleaned insights into how to effectively build a system that is likelyto be used by a partner in the developing world. Speciﬁcally, we became attuned to the need formapping technical solutions to social problems that are articulated by people working in the ﬁeld,understanding cultural context and awareness, and creating a feasible deployment strategy. Theselessons should be invaluable to the many researchers and data scientists who wish to partner withNGOs, governments, and other entities that are working to use machine learning in the developingworld. JSC provided approximately 700GB of 1024 by 768 pixel video footage taken from seven locationsacross Jakarta, chosen to represent varying geography, infrastructure, and trafﬁc behavior. Additionalvideo footage from other locations was downloaded using JSC’s public data portal. Starting fromthese videos, we were tasked with generating quantitative data that could be used for more standardtrafﬁc analysis.In order to evaluate our results, we needed to obtain annotated videos. This was done by hand-labeling vehicles and pedestrians in a sample of our videos using the

Computer Vision AnnotationTool (CVAT) [2].

Before translating raw video into structured data, extensive work had to be done so all partnershad a common vision of the policy interventions that the Jakarta authorities hoped to deploy givenbetter trafﬁc information. We established that in the medium term, they are interested in learning thebest places that they can place “trafﬁc stewards” and build trafﬁc lights. In the long term, they areinterested in learning where bigger infrastructure projects may be most successful.In addition to these speciﬁc interventions, we also set out to deﬁne the scope of problematic trafﬁcbehaviors that the city hopes to curtail. In this case, we are concerned with a few speciﬁc behaviors,including vehicles driving against trafﬁc, motorcycles and scooters driving on pedestrian surfaceslike sidewalks, and illegal stopping or parking. Once we understood the most dangerous drivingbehaviors, and the policy levers available, we were able to think about how to map social policyproblems to technical solutions. This map informed the speciﬁc data that we generated. We detail ourparticular choice of computer vision methods in Section 3.2. https://smartcity.jakarta.go.id/ http://pulselabjakarta.org/ https://dssg.uchicago.edu/ Results

Our pipeline was created with a “streaming” approach, which breaks a video into individual framesat the beginning of the pipeline. It then passes these frames through a system of workers and queues.Essentially, each worker is given a particular “task” (e.g. object detection) that it performs on eachframe. Once it ﬁnishes a task, it sends that frame to the next queue, where the frame waits until thenext worker is ready to process it. Frame order is preserved, and at the end, a worker puts framesback together to output the original video with any new annotations or analysis. The workers alsooutput quantitative information about object counts, direction, etc. which can be loaded to a database.The pipeline is modular, so any worker can be replaced by a different algorithm. Modularity is akey feature, allowing a user of the pipeline to optimize its performance on their speciﬁc task ofinterest. This also avoids loading large uncompressed videos into memory as implied by a batchedapproach, permits simultaneous execution of multiple tasks, and permits load balancing by addingor removing workers from tasks as necessary. There are some apparent limitations to this decision,namely that GPU computations utilized by many machine learning algorithms are optimized forbatch computation, and workers cannot use yet-unseen frames when performing a task, which limitsthe exploitation of temporal dependence between frames. We note that both of these concerns canbe addressed through use of appropriate buffering in the workers, which is exactly how we performefﬁcient classiﬁcation with YOLOv3 (see below).

We developed several modules that make up the pipeline, and are directly related to Jakarta’s speciﬁcpolicy requirements. Detection and classiﬁcation are necessary components as they determinewhat objects in a frame are motor vehicles. The results of detection and classiﬁcation provide thefoundation for detecting speciﬁc trafﬁc behaviors. We use YOLOv3 trained on the COCO dataset [3].Figure 1 (left) shows an example of our detection and classiﬁcation results.Motion estimation is similarly important because it helps determine when a vehicle is traveling thewrong way. We chose to use optical ﬂow as it allowed us to extract the information of whether anobject was moving and in what direction, without any additional training. Using Lucas-Kanadeoptical ﬂow algorithm [4] in conjunction with Shi-Tomasi feature detection [5], we were able tocalculate the direction of movement for every detected vehicle in a frame. We used the existingimplementation available in OpenCV [6]. Figure 1 (right) illustrates this result.Figure 1: On the left, detection/classiﬁcation with YOLOv3. On the right, motion detection withLucas-Kanade Optical Flow.Finally, we needed to classify the different regions of the image into different classes, such as roador sidewalk, in order to determine whether motor vehicles were moving in an illegal way. For thistask, one can use semantic segmentation, which classiﬁes each pixel as belonging to one of severaldifferent classes. We used a pretrained version of the WideResNet-38 model described in [7]. Theresult of semantic segmentation on one of the stretches of road we had data for can be seen in Figure 2.More granular classiﬁcation of different segments of road, such as encoding the correct direction todrive in or where crosswalks exist, can then be added by hand in each intersection.Combining these methods, we can answer questions such as, “Is this vehicle traveling on the wrongside of the road?” or “Is this motorcycle illegally parked on a sidewalk?” Figure 3 shows oneexample of this. In this case, our system ﬂagged four instances of a car moving in the wrong lane3igure 2: A scene pre- and post-segmentation.within a three-day span. In fact, three of these instances occurred in the same 2-hour period. Onecan imagine the utility such a system could provide, as an analyst can quickly identify that thisintersection sees problematic behavior at particular days and times. This insight can then be used toinform interventions such as building a trafﬁc light or median, or deploying a trafﬁc steward at a busytime of day.Figure 3: Examples of driving on the wrong side of the road found by our pipeline.

We evaluated object detection, classiﬁcation, and motion detection by comparing our model outputsto the ground truth.For detection, we measure precision and recall. In this case, recall is the proportion of objects ofinterest which are correctly identiﬁed as objects, regardless of the predicted class. Precision is theproportion of detections which are true objects of interest. To evaluate these metrics for this speciﬁcproblem, we go through all boxes predicted by our model in decreasing order of conﬁdence. Ideally,the box drawn by the model will exactly align with the box drawn by the human, but in practice therewill be differences. We used an “Intersection Over Union” (IOU) approach to determine whethertwo boxes were the same. If the IOU between the predicted and a true box are above our chosenthreshold, we take those boxes to refer to the same object. Then we check if the predicted class is thesame as the true class.There will be many parameters in the models that can be changed, including the IOU thresholdconsidered in evaluation. We took a similar approach to classiﬁcation evaluation, and allowed thepipeline to vary thresholds for objectness and labeling. Doing a grid search across these thresholdsyielded several confusion matrices. An example result from this evaluation can be seen in Figure 4b.For this choice of object threshold, a large proportion of objects are not detected and those that areshow varying levels of accuracy.We point out that some class confusion may be immaterial to the partners’ ultimate interventiondecisions. For example, consider an intersection where illegal left turns can pose a risk to opposingtrafﬁc. City planners would beneﬁt from knowing whether large, heavy vehicles are making suchillegal left turns, but it may be less important to distinguish between buses and trucks, as both posecomparable risks. However, confusing a motorbike for a pedestrian may contribute to a misconceived4nderstanding of ground truth which will lead to improper policy decisions. Indeed, in our ownevaluation, we saw noticed that tuk-tuks and mini-buses (which are Jakarta-speciﬁc and not inthe YOLOv3 training set) were generally correctly characterized as something close to a car, butmotorcycles and bicycles were frequently confused. Therefore this implementation of object detectionand classiﬁcation needs improvement. Thankfully this is a well studied problem in image analysisand there are several options for doing so. One of the chief goals in the near future is to experimentwith various alternative models to ﬁnd one better suited for the Jakarta context. (a) Example Precision-Recall plot for car de-tection and classiﬁcation throughout all anno-tated videos, with IOU threshold set to 0.25.Note that with a conﬁdence threshold of 0.5,we label 50% of the cars in our labeled videoscorrectly, with 70% precision. These resultscan be improved for instance by selectingonly cameras with clearer perspectives. (b) Confusion matrix, normalized so columns sum to 1. Theobjectness threshold is 0.4, while the label threshold is 0.1.One important ﬁnding is that while our algorithm does nothave labels speciﬁc to Jakarta such as tuk-tuks and minibuses,we detect them roughly as well as cars or trucks, labelingthem as cars, buses or trucks. If one is only interested in thebehavior of the vehicles and not so much on the identiﬁcation,it might not be necessary to ﬁne-tune the model to detect thesecategories speciﬁcally. Other performance issues should beaddressed by testing different object detection algorithms.

Figure 4: Evaluation of object detection and classiﬁcation.We also evaluated motion detection. For optimal settings, the average angle between detected andtrue motion is 11.0 o . We can also use motion detection to effectively ﬁnd vehicles moving in thewrong direction in a particular road, as we show in Figure 3. We set out to provide value to Jakarta by demonstrating what current technology could achieve asthey prepare to deploy a complete video analysis system. Our ultimate aspiration is that this projectwill provide a template for how data scientists, local governments, NGOs, and the private sector cancome together to advance urban policy. More than half the world’s population now lives in cities, andthis number continues to grow. As cities around the world grapple with rapidly growing populations,giving them the tools necessary to effectively manage transit will help guarantee their future safetyand prosperity.Starting in 2017, JSC has been building a big data infrastructure, and is looking to integrate thepipeline into its existing systems. This will consist of two distinct phases. The ﬁrst phase will test thesystem by deploying it on a sample of CCTV cameras in Jakarta. Assuming these tests are successful,the system will then be integrated into every CCTV camera in the city. The second phase will focuson creating information systems that ensure that the results are disseminated to the relevant agenciesin the Government of Jakarta.The ﬁrst phase will identify several roads in Jakarta that represent two categories: problematic andsafe. Roads will be categorized by mining trafﬁc data in collaboration with the Jakarta TransportAuthority. After classifying roads, JSC will deploy the system on a sample of the CCTVs thatmonitor problematic roads, and then record the output (e.g. the number of detected cars, motorcycles,etc.). JSC will then validate and tune the classiﬁcation, motion, and segmentation models. Once5he models are well calibrated and fully deployed in the initial sample, JSC, together with JakartaTransport Authority, will gather and interpret results, and then formulate and implement interventionson problematic roads. The effects of these interventions will be monitored, so the interventions canbe continuously updated accordingly.In the second phase, JSC will connect all of its CCTVs to the system. They will perform validationand veriﬁcation, and make necessary model improvements. Once the models output informationcorrectly and seamlessly, JSC will build information systems that support reports and/or a dashboardthat will help various agencies in the Government of Jakarta understand model outputs, and hopefullyimprove decision making.We hope our work illuminates the promise of using data to improve urban life around the globe. Thecode developed for this project is available on GitHub and we hope it proves valuable to anyone whowishes to develop or deploy a similar system and methods.

References [1] World Health Organization. Global status report on road safety 2015. , 2015. Accessed:2018-10-02.[2] OpenCV. Computer vision annotation tool (CVAT). https://github.com/opencv/cvat ,2018. Accessed: 2018-11-20.[3] Joseph Redmon and Ali Farhadi. YOLOv3: An incremental improvement. arXiv preprintarXiv:1804.02767 , 2018.[4] Bruce D Lucas and Takeo Kanade. An iterative image registration technique with an applicationto stereo vision. In

Proceedings of the 1981 DARPA Image Understanding Workshop , pages121–130, 1981.[5] Jianbo Shi and Carlo Tomasi. Good features to track. In

Proceedings of IEEE Conference onComputer Vision and Pattern Recognition , 1994.[6] G. Bradski. The OpenCV Library.

Dr. Dobb’s Journal of Software Tools , 2000.[7] Samuel Rota Bulò, Lorenzo Porzi, and Peter Kontschieder. In-place activated BatchNorm formemory-optimized training of DNNs. In