Improving Traffic Safety Through Video Analysis in Jakarta, Indonesia
João Caldeira, Alex Fout, Aniket Kesari, Raesetje Sefala, Joseph Walsh, Katy Dupre, Muhammad Rizal Khaefi, Setiaji, George Hodge, Zakiya Aryana Pramestri, Muhammad Adib Imtiyazi
IImproving Traffic Safety Through Video Analysis inJakarta, Indonesia
João Caldeira ∗ Department of PhysicsUniversity of Chicago [email protected]
Alex Fout ∗ StatisticsColorado State University [email protected]
Aniket Kesari ∗ Jurisprudence & Social PolicyUniversity of California, Berkeley [email protected]
Raesetje Sefala ∗ Machine LearningUniversity of the Witwatersrand [email protected]
Joseph Walsh
Center for Data Science and Public PolicyUniversity of Chicago
Katy Dupre
Center for Data Science and Public PolicyUniversity of Chicago
Muhammad Rizal Khaefi
Pulse Lab Jakarta [email protected]
Setiaji
Jakarta Smart City [email protected]
George Hodge
Pulse Lab Jakarta [email protected]
Zakiya Aryana Pramestri
Pulse Lab Jakarta
Muhammad Adib Imtiyazi
Jakarta Smart City
Abstract
This project presents the results of a partnership between the Data Science forSocial Good fellowship, Jakarta Smart City and Pulse Lab Jakarta to create avideo analysis pipeline for the purpose of improving traffic safety in Jakarta.The pipeline transforms raw traffic video footage into databases that are readyto be used for traffic analysis. By analyzing these patterns, the city of Jakartawill better understand how human behavior and built infrastructure contribute totraffic challenges and safety risks. The results of this work should also be broadlyapplicable to smart city initiatives around the globe as they improve urban planningand sustainability through data science approaches. The World Health Organization’s
Global status report on road safety 2015 estimates that over 1.2million people die each year in traffic accidents [1]. Nearly 2000 such fatalities occur annually in thecity of Jakarta, Indonesia, making it one of the most dangerous cities in the world for traffic safety.Many of these deaths are preventable through effective city planning. ∗ J. Caldeira, A. Fout, A. Kesari, and R. Sefala contributed equally to this work. The source code developed in this project is available at https://github.com/dssg/jakarta_smart_city_traffic_safety_public/ .Preprint. Work in progress. a r X i v : . [ c s . C Y ] N ov akarta has experienced rapid population growth over the last 50 years, from roughly two millionpeople in 1970 to more than 10 million today. With this growth comes a rise in vehicle ownershipand congestion, and these factors inevitably lead to an increase in the number of traffic incidents.One of the core problems with using machine learning and other data-driven techniques in trafficsafety analysis is that it is difficult to collect high-quality data. In partnership with Jakarta Smart City(JSC) and Pulse Lab Jakarta (PLJ) , a team of fellows at the Data Science for Social Good (DSSG)fellowship at the University of Chicago was formed to tackle this problem. Our team was givenaccess to video footage of seven traffic cameras spread over Jakarta. We developed a video analysispipeline that furnishes JSC and PLJ with the ability to generate rich databases that contain massiveamounts of information about traffic behaviors. We hope our work provides a roadmap for applyingmachine learning throughout the developing world in the context of smart cities and urban planningmore broadly.In addition to the technical application of the video analysis pipeline, we want this project to providea template for others who hope to successfully deploy machine learning and data driven systems inthe developing world. Through intense cooperation between the fellowship team in Chicago and theproject partners in Jakarta, we gleaned insights into how to effectively build a system that is likelyto be used by a partner in the developing world. Specifically, we became attuned to the need formapping technical solutions to social problems that are articulated by people working in the field,understanding cultural context and awareness, and creating a feasible deployment strategy. Theselessons should be invaluable to the many researchers and data scientists who wish to partner withNGOs, governments, and other entities that are working to use machine learning in the developingworld. JSC provided approximately 700GB of 1024 by 768 pixel video footage taken from seven locationsacross Jakarta, chosen to represent varying geography, infrastructure, and traffic behavior. Additionalvideo footage from other locations was downloaded using JSC’s public data portal. Starting fromthese videos, we were tasked with generating quantitative data that could be used for more standardtraffic analysis.In order to evaluate our results, we needed to obtain annotated videos. This was done by hand-labeling vehicles and pedestrians in a sample of our videos using the
Computer Vision AnnotationTool (CVAT) [2].
Before translating raw video into structured data, extensive work had to be done so all partnershad a common vision of the policy interventions that the Jakarta authorities hoped to deploy givenbetter traffic information. We established that in the medium term, they are interested in learning thebest places that they can place “traffic stewards” and build traffic lights. In the long term, they areinterested in learning where bigger infrastructure projects may be most successful.In addition to these specific interventions, we also set out to define the scope of problematic trafficbehaviors that the city hopes to curtail. In this case, we are concerned with a few specific behaviors,including vehicles driving against traffic, motorcycles and scooters driving on pedestrian surfaceslike sidewalks, and illegal stopping or parking. Once we understood the most dangerous drivingbehaviors, and the policy levers available, we were able to think about how to map social policyproblems to technical solutions. This map informed the specific data that we generated. We detail ourparticular choice of computer vision methods in Section 3.2. https://smartcity.jakarta.go.id/ http://pulselabjakarta.org/ https://dssg.uchicago.edu/ Results
Our pipeline was created with a “streaming” approach, which breaks a video into individual framesat the beginning of the pipeline. It then passes these frames through a system of workers and queues.Essentially, each worker is given a particular “task” (e.g. object detection) that it performs on eachframe. Once it finishes a task, it sends that frame to the next queue, where the frame waits until thenext worker is ready to process it. Frame order is preserved, and at the end, a worker puts framesback together to output the original video with any new annotations or analysis. The workers alsooutput quantitative information about object counts, direction, etc. which can be loaded to a database.The pipeline is modular, so any worker can be replaced by a different algorithm. Modularity is akey feature, allowing a user of the pipeline to optimize its performance on their specific task ofinterest. This also avoids loading large uncompressed videos into memory as implied by a batchedapproach, permits simultaneous execution of multiple tasks, and permits load balancing by addingor removing workers from tasks as necessary. There are some apparent limitations to this decision,namely that GPU computations utilized by many machine learning algorithms are optimized forbatch computation, and workers cannot use yet-unseen frames when performing a task, which limitsthe exploitation of temporal dependence between frames. We note that both of these concerns canbe addressed through use of appropriate buffering in the workers, which is exactly how we performefficient classification with YOLOv3 (see below).
We developed several modules that make up the pipeline, and are directly related to Jakarta’s specificpolicy requirements. Detection and classification are necessary components as they determinewhat objects in a frame are motor vehicles. The results of detection and classification provide thefoundation for detecting specific traffic behaviors. We use YOLOv3 trained on the COCO dataset [3].Figure 1 (left) shows an example of our detection and classification results.Motion estimation is similarly important because it helps determine when a vehicle is traveling thewrong way. We chose to use optical flow as it allowed us to extract the information of whether anobject was moving and in what direction, without any additional training. Using Lucas-Kanadeoptical flow algorithm [4] in conjunction with Shi-Tomasi feature detection [5], we were able tocalculate the direction of movement for every detected vehicle in a frame. We used the existingimplementation available in OpenCV [6]. Figure 1 (right) illustrates this result.Figure 1: On the left, detection/classification with YOLOv3. On the right, motion detection withLucas-Kanade Optical Flow.Finally, we needed to classify the different regions of the image into different classes, such as roador sidewalk, in order to determine whether motor vehicles were moving in an illegal way. For thistask, one can use semantic segmentation, which classifies each pixel as belonging to one of severaldifferent classes. We used a pretrained version of the WideResNet-38 model described in [7]. Theresult of semantic segmentation on one of the stretches of road we had data for can be seen in Figure 2.More granular classification of different segments of road, such as encoding the correct direction todrive in or where crosswalks exist, can then be added by hand in each intersection.Combining these methods, we can answer questions such as, “Is this vehicle traveling on the wrongside of the road?” or “Is this motorcycle illegally parked on a sidewalk?” Figure 3 shows oneexample of this. In this case, our system flagged four instances of a car moving in the wrong lane3igure 2: A scene pre- and post-segmentation.within a three-day span. In fact, three of these instances occurred in the same 2-hour period. Onecan imagine the utility such a system could provide, as an analyst can quickly identify that thisintersection sees problematic behavior at particular days and times. This insight can then be used toinform interventions such as building a traffic light or median, or deploying a traffic steward at a busytime of day.Figure 3: Examples of driving on the wrong side of the road found by our pipeline.
We evaluated object detection, classification, and motion detection by comparing our model outputsto the ground truth.For detection, we measure precision and recall. In this case, recall is the proportion of objects ofinterest which are correctly identified as objects, regardless of the predicted class. Precision is theproportion of detections which are true objects of interest. To evaluate these metrics for this specificproblem, we go through all boxes predicted by our model in decreasing order of confidence. Ideally,the box drawn by the model will exactly align with the box drawn by the human, but in practice therewill be differences. We used an “Intersection Over Union” (IOU) approach to determine whethertwo boxes were the same. If the IOU between the predicted and a true box are above our chosenthreshold, we take those boxes to refer to the same object. Then we check if the predicted class is thesame as the true class.There will be many parameters in the models that can be changed, including the IOU thresholdconsidered in evaluation. We took a similar approach to classification evaluation, and allowed thepipeline to vary thresholds for objectness and labeling. Doing a grid search across these thresholdsyielded several confusion matrices. An example result from this evaluation can be seen in Figure 4b.For this choice of object threshold, a large proportion of objects are not detected and those that areshow varying levels of accuracy.We point out that some class confusion may be immaterial to the partners’ ultimate interventiondecisions. For example, consider an intersection where illegal left turns can pose a risk to opposingtraffic. City planners would benefit from knowing whether large, heavy vehicles are making suchillegal left turns, but it may be less important to distinguish between buses and trucks, as both posecomparable risks. However, confusing a motorbike for a pedestrian may contribute to a misconceived4nderstanding of ground truth which will lead to improper policy decisions. Indeed, in our ownevaluation, we saw noticed that tuk-tuks and mini-buses (which are Jakarta-specific and not inthe YOLOv3 training set) were generally correctly characterized as something close to a car, butmotorcycles and bicycles were frequently confused. Therefore this implementation of object detectionand classification needs improvement. Thankfully this is a well studied problem in image analysisand there are several options for doing so. One of the chief goals in the near future is to experimentwith various alternative models to find one better suited for the Jakarta context. (a) Example Precision-Recall plot for car de-tection and classification throughout all anno-tated videos, with IOU threshold set to 0.25.Note that with a confidence threshold of 0.5,we label 50% of the cars in our labeled videoscorrectly, with 70% precision. These resultscan be improved for instance by selectingonly cameras with clearer perspectives. (b) Confusion matrix, normalized so columns sum to 1. Theobjectness threshold is 0.4, while the label threshold is 0.1.One important finding is that while our algorithm does nothave labels specific to Jakarta such as tuk-tuks and minibuses,we detect them roughly as well as cars or trucks, labelingthem as cars, buses or trucks. If one is only interested in thebehavior of the vehicles and not so much on the identification,it might not be necessary to fine-tune the model to detect thesecategories specifically. Other performance issues should beaddressed by testing different object detection algorithms.
Figure 4: Evaluation of object detection and classification.We also evaluated motion detection. For optimal settings, the average angle between detected andtrue motion is 11.0 o . We can also use motion detection to effectively find vehicles moving in thewrong direction in a particular road, as we show in Figure 3. We set out to provide value to Jakarta by demonstrating what current technology could achieve asthey prepare to deploy a complete video analysis system. Our ultimate aspiration is that this projectwill provide a template for how data scientists, local governments, NGOs, and the private sector cancome together to advance urban policy. More than half the world’s population now lives in cities, andthis number continues to grow. As cities around the world grapple with rapidly growing populations,giving them the tools necessary to effectively manage transit will help guarantee their future safetyand prosperity.Starting in 2017, JSC has been building a big data infrastructure, and is looking to integrate thepipeline into its existing systems. This will consist of two distinct phases. The first phase will test thesystem by deploying it on a sample of CCTV cameras in Jakarta. Assuming these tests are successful,the system will then be integrated into every CCTV camera in the city. The second phase will focuson creating information systems that ensure that the results are disseminated to the relevant agenciesin the Government of Jakarta.The first phase will identify several roads in Jakarta that represent two categories: problematic andsafe. Roads will be categorized by mining traffic data in collaboration with the Jakarta TransportAuthority. After classifying roads, JSC will deploy the system on a sample of the CCTVs thatmonitor problematic roads, and then record the output (e.g. the number of detected cars, motorcycles,etc.). JSC will then validate and tune the classification, motion, and segmentation models. Once5he models are well calibrated and fully deployed in the initial sample, JSC, together with JakartaTransport Authority, will gather and interpret results, and then formulate and implement interventionson problematic roads. The effects of these interventions will be monitored, so the interventions canbe continuously updated accordingly.In the second phase, JSC will connect all of its CCTVs to the system. They will perform validationand verification, and make necessary model improvements. Once the models output informationcorrectly and seamlessly, JSC will build information systems that support reports and/or a dashboardthat will help various agencies in the Government of Jakarta understand model outputs, and hopefullyimprove decision making.We hope our work illuminates the promise of using data to improve urban life around the globe. Thecode developed for this project is available on GitHub and we hope it proves valuable to anyone whowishes to develop or deploy a similar system and methods.
References [1] World Health Organization. Global status report on road safety 2015. , 2015. Accessed:2018-10-02.[2] OpenCV. Computer vision annotation tool (CVAT). https://github.com/opencv/cvat ,2018. Accessed: 2018-11-20.[3] Joseph Redmon and Ali Farhadi. YOLOv3: An incremental improvement. arXiv preprintarXiv:1804.02767 , 2018.[4] Bruce D Lucas and Takeo Kanade. An iterative image registration technique with an applicationto stereo vision. In
Proceedings of the 1981 DARPA Image Understanding Workshop , pages121–130, 1981.[5] Jianbo Shi and Carlo Tomasi. Good features to track. In
Proceedings of IEEE Conference onComputer Vision and Pattern Recognition , 1994.[6] G. Bradski. The OpenCV Library.
Dr. Dobb’s Journal of Software Tools , 2000.[7] Samuel Rota Bulò, Lorenzo Porzi, and Peter Kontschieder. In-place activated BatchNorm formemory-optimized training of DNNs. In