Policy Research Working Paper 9488 Applying Machine Learning and Geolocation Techniques to Social Media Data (Twitter) to Develop a Resource for Urban Planning Sveta Milusheva Robert Marty Guadalupe Bedoya Sarah Williams Elizabeth Resor Arianna Legovini Development Economics Development Impact Evaluation Group December 2020 Policy Research Working Paper 9488 Abstract With all the recent attention focused on big data, it is easy to geoparsing algorithm to identify its location. The project overlook that basic vital statistics remain difficult to obtain geolocated 32,991 crash reports in Twitter for 2012–20 in most of the world. This project set out to test whether and clustered them into 22,872 unique crashes to produce an openly available dataset (Twitter) could be transformed one of the first crash maps for Nairobi. A motorcycle deliv- into a resource for urban planning and development. The ery service was dispatched in real-time to verify a subset hypothesis is tested by creating road traffic crash location of crashes, showing 92 percent accuracy. Using a spatial data, which are scarce in most resource-poor environments clustering algorithm, portions of the road network (less but essential for addressing the number one cause of mor- than 1 percent) were identified where 50 percent of the tality for children over age five and young adults. The geolocated crashes occurred. Even with limitations in the research project scraped 874,588 traffic-related tweets in representativeness of the data, the results can provide urban Nairobi, Kenya, applied a machine learning model to cap- planners useful information to target road safety improve- ture the occurrence of a crash, and developed an improved ments where resources are limited. This paper is a product of the Development Impact Evaluation Group, Development Economics. It is part of a larger effort by the World Bank to provide open access to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://www.worldbank.org/prwp. The authors may be contacted at smilusheva@worldbank.org. The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Produced by the Research Support Team Applying Machine Learning and Geolocation Techniques to Social Media Data (Twitter) to Develop a Resource for Urban Planning∗ Sveta Milusheva† Robert Marty† Guadalupe Bedoya† Sarah Williams‡ Elizabeth Resor§ Arianna Legovini† JEL Classification: R41, R42, O18, C80 Keywords: Big Data, Machine Learning, Road Safety, Urban Mobility, SDGs ∗ We thank Robert Tenorio and Amy Dolinger for their field coordination and research support. We are also grateful to Andrew Muriithi, Purity Kimuru, Rodgers Avuya, Salome Omondi and Pheliciah Mwachofi for their field support. DIME Analytics provided technical support throughout the analysis with Luiza Andrade and Luis Eduardo San Martin conducting code review and reproducibility checks. We appreciate comments from anonymous reviewers and participants at the ACM COMPASS Conference and the Netmob Conference. The research has been funded with UK aid from the UK government through the ieConnect for Impact program and the World Bank’s Knowledge for Change program. † Development Impact Evaluation Department, World Bank, Washington DC ‡ School of Architecture and Planning, Massachusetts Institute of Technology, Cambridge MA § School of Information, University of California, Berkeley CA 1 Introduction The World Bank has declared that data are the next deprivation to end; they argue that the lack of data causes many of the world’s poorest populations to be overlooked when resources are allocated to address their essential needs [1]. Data deprivation is a pressing challenge with as many as 74% of the global and 97% of the Sub-Saharan African population living in countries without adequate vital registration [2]; one-third of countries lack any poverty statistics [1]; and only 17% of the estimated road traffic deaths are reported in official figures of low-income countries [3]. Without data to inform national and urban policies, the gap between low- and high-income countries will worsen [4]. However, while official statistics are poor, data in the hands of private providers are plentiful, populated by the rapid expansion of mobile phones and social media. Globally, phone penetration reached 67% in 2019 [5], and social media penetration is almost 50% [6]. This provides an opportunity for using crowdsourced data to study major urban and development policies [7–11]. In this project we test the hypothesis of whether privately maintained data can be transformed into a resource to better understand development challenges. Private data have been used to characterize populations from determining poverty to understanding public emotions [12–17]. Here, we use private data to describe the urban environment that affects those populations, specifically analyzing events reported on social media that affect people’s safety such as road traffic crashes, crime or floods. We focus on road traffic crashes (RTCs). Despite being the number one cause of death for children and young adults aged 5-29 years, the lack of adequate data on RTCs is a recognized and unmet challenge [18]. The objective is to improve RTC data for urban planners so they can contribute to addressing the high toll of road deaths, estimated globally at 1.35 million a year [3]. Our case study is Kenya, a country with high road mortality, where the official figures are said to underestimate the number of fatalities by a factor of 4.5 [3]. The United Nations’ Sustainable Development Goal (SDG) 3 sets a target to halve road mortality by 2020; progress has been slow, and the target moved to 2030. The Stockholm Declaration by the Third Global Ministerial Conference on Road Safety “Achieving Global Goals 2030” reiterated the call for country investments in road safety–from legislation and regulation, safe urban and transport design, safe modes of transport and vehicles, to modern technologies for crash prevention, trauma care, and urban management. However, resource constraints make it unlikely that countries will be able to meet all of these goals. Instead, countries should strategically invest for the greatest impact. This requires knowing where and when crashes happen, so that resources can be targeted to risky locations and times. Social media data, with all their biases, can contribute to filling some of the data gaps for urban analysis, planning and management [19]. In this study, we create an algorithm that classifies transport-related tweets into geolocated RTCs for Nairobi. This is done by building on existing literature to test two natural language 2 processing algorithms to identify crash reports [20, 21], developing an improved geoparsing algorithm to extract data on crash time and location [22–28], and ground truthing the results. The paper also contributes to a broader literature that uses machine learning methods for road safety analysis [29–31]. This study innovates on three fronts and demonstrates the value of using social media to expand data availability. (1) Geospatial Twitter data analysis usually uses the approximately 1% of tweets that have a geolocation tag [32–34]; we improve this by using a machine learning geoparsing algorithm to leverage the 99% of tweets that do not contain a geotag. (2) To our knowledge there are no other studies that physically validate the locational accuracy of tweets in real time. Among verified tweets, 92% were found to be valid crashes, demonstrating the validity of crowdsourced crash data. (3) The work created an essential resource by generating one of the first real-time maps of RTCs in an African city (Nairobi). We identify 52,228 crash reports and geolocate those with enough information provided in the text (32,991 of them). In a context where there is no systematic georeferenced data on crashes to support policy planning, the process outlined here could be used to capture these data for cities all over the world that need this essential resource. Overall, the method expands the coverage of road crashes that can be used to analyze road safety and to prioritize policy action around the locations where crashes occur more often. This is especially useful in country contexts where the only data available for analysis are aggregated statistics on total fatalities in the country, with no detailed breakdown of location or time. Crowdsourced data can help act as an additional input that can be used by policymakers in better understanding the situation. By using a clustering algorithm to identify and rank crash locations, we find that the top 15% of crash clusters (66 of 435) account for half of all crashes. Knowing that a small portion (<1%) of the road network hosts 50% of RTCs in the crowdsourced data can help reduce an intractable problem to a more manageable one. This analysis shows the potential for using these data to complement road safety diagnostics and to guide investments and planning in road safety in Kenya and in other contexts, especially those with similar data deficiencies and with sufficient social media density like India and the Philippines [35]. The approach can be extended to other events reported on social media, whether related to disaster relief, crime, personal safety, urban mobility, or road maintenance. The work on disaster relief and response makes prominent use of geoparsing of tweets [36–43]. Geoparsing of tweets that lack geolocation information could enable more comprehensive crime analytics [44–46]. Improved algorithms can lead to faster and better geolocation of events, which would help urban planners and policy makers improve responses and better target interventions. 3 Method The goals of this analysis are to create data on road crashes with times and locations and understand how these incidents cluster in the city, which allows for the spatial prioritization of urban investments in road safety. The technical challenges this study addresses are: i) improve the protocols for geolocation, ii) apply applications of AI to classify tweets reporting crashes and identify their location from multiple geographical references, iii) cluster the crashes geographically and identify areas with many crashes. See the Supplemental Information (SI) for the detailed methodology. The components are as follows: 1. Scrape data. We scrape 874,588 tweets posted by Ma3Route, an existing urban mobility platform with 1.1 million followers, since its inception in May 2012 through July 2020 (see SI for examples of tweets and for a figure of the daily number of tweets across time). 2. Develop and augment a gazetteer. We build a gazetteer of landmarks for the five counties that constitute the Nairobi metro area using: OpenStreetMap, Geonames and Google Places. The gazetteer includes the landmark name, geocoordinates and type of landmark (e.g., school, bus stop). We use consecutive combinations of 2 and 3 words (known as n-grams) and skip-grams of landmarks in the gazetteer, alternate spellings and abbreviations, and splitting of landmarks with select punctuation (e.g., slashes, parentheses, commas). We innovate by developing alternate names that exclude the landmark type from the name (e.g., excluding “Hotel” from the name). 3. Develop a truth dataset. We develop a truth dataset to train the algorithm. Taking all tweets for July 2017 - July 2018, we restrict tweets to the ones most likely related to a crash based on a broad list of words and their variations. Each tweet is manually coded, indicating (1) if the tweet reported a crash and (2) the approximate latitude and longitude of any reported crash whenever enough information is provided. A total of 9,480 tweets were coded, of which 69% (6,602) reported a crash and of these, 63% (4,192) identified an approximate location of the crash. On average, users posted 10 crash reports that could be geolocated to Twitter daily. 4. Identify RTCs and their location. We use a three-step process to convert unstructured crowdsourced text into a dataset. The first is to identify relevant reports from hundreds of thousands of reports. The second is to extract necessary information from the relevant reports. The third is to consolidate unique record information from multiple reports of the same event. In Figure 1, we illustrate how the algorithm works to classify and geolocate RTCs. We use the tweet “Bad accident on Waiyaki Way next to Kianda heading towards ABC Place.” 4 Figure 1. Illustration of classification and geolocation algorithm developed for extracting data from crowdsourced information (a) Classify relevant crowdsourced reports. We restrict the analysis to tweets that contain keywords from a broad list of English and Kiswahili road safety terms such as “accident” or “overturn.” This approach follows previous research and allows for misspellings [20]. We use natural language processing to classify and exclude tweets that contain road safety keywords but discuss road safety conditions rather than specific crash events (e.g., “terrible drivers keep causing crashes”). We test two approaches that analyze the combination of words in a tweet: Naive Bayes and support vector machines (SVM). 5 (b) Geolocate reports. We extract all landmarks and roads that have an exact match between the gazetteer and the tweet. In Figure 1, “kianda” and “abc way” match several entries in the gazetteer. We extract misspelled matches based on Levenshtein distance varied by length of the n-gram, matches based on the word following a preposition, and matches based on intersections between multiple roads. Existing geoparsers extract all possible location references without identifying the unique location that makes the data useful. We resolve two technical challenges to find the location of the crash: i. When multiple locations are mentioned in the tweets, we use prepositions to sort locations into tiers, based on the probability of a location being correct after a particular preposition. For example, in Figure 1, “next to” is ranked as tier 1 while “toward” is ranked as tier 6, resulting in the correct geolocation for the crash at “kianda” and not “abc place”. ii. When a name refers to multiple landmarks, we adopt a toponym resolution approach. In Figure 1, more than 6 landmarks across Nairobi have “kianda” in the name. We resolve the toponym in three steps: (1) we search for landmarks that are within 500 m of a road if it is mentioned, (2) we use the centroid of the clustered location if 90% or more of the landmarks are in a 500 m radius, or (3) we rank the landmarks by the probability of being correct using the landmark type in the truth data (see SI for statistics on location type). In the example, we use “Waiyaki Way” to narrow down the landmarks “kianda” in a 500 m radius (from 6 to 3) and then use the centroid as the crash location. We define a correct geoparse as one located within 500 m of the coordinates in the truth dataset. As a benchmark, we compare our algorithm to the Location Name Extraction tool (LNEx), which was shown to have better accuracy than other geoparsers [40]. As LNEx and other geoparsers are not designed to extract one unique location from text [26, 40, 47], we first judge performance by examining whether any location references are near the true coordinates. Next, we define the crash location as determined by LNEx to be the centroid of all locations it finds in the tweet and compare this with the unique location identified by our algorithm. (c) Identify unique reports. To avoid over-counting, we develop a clustering algorithm that uses time and location to identify which tweets refer to the same crash. In Figure 1, five tweets report a crash within two hours of each other, referencing different landmarks that are all close together. To develop reasonable parameters for clustering, we manually identify tweets that report the same crash in the truth dataset based on the time, location and crash characteristics. The 4,192 crash reports are clustered into 2,648 unique crashes. For unique crash clusters, 97% of tweets reported 6 Table 1. Geolocation Algorithm Results Any Location Crash Location Captured by Determined by Algorithm Close to Algorithm Close to True Crash Location True Crash Location Recall Precision Recall Precision LNEx 0.674 0.686 0.129 0.132 Alg., Raw Gaz 0.695 0.757 0.579 0.756 Alg., Aug Gaz 0.798 0.857 0.651 0.811 Alg., Aug Gaz [Cluster] 0.656 0.774 ‘N Crashes’ refers to the number of correctly identified crashes. ‘Raw Gaz’ refers to the raw gazetteer (ie, dictionary of landmarks with original names) and ‘Aug Gaz’ refers to the augmented gazetteer. We use our raw gazetteer as an input into LNEX, which implements its own augmentation process. For LNEx, the crash location is determined by taking the centroid of all locations captured by the algorithm. Locations are considered close if they are within 500 meters of each other. landmarks within 500 m and within 4 hours of each other (see additional details in SI for how parameters were chosen). (d) Ground truth. To ensure that the crowdsourced data are reliable and provide correct information, we conduct a ground-truthing exercise to validate the quality of the data and the performance of the underlying algorithm. We processed tweets in real-time and dispatched a motorcycle delivery service (Sendy) to the site of the crash within minutes. The Sendy driver was tasked with verifying and reporting whether a crash actually happened in that location. If a driver could not see the crash, they were instructed to ask a bystander whether a crash had occurred but was cleared or whether a crash occurred nearby. Drivers were able to arrive at the crash location quickly; the median time between being alerted of a crash and arriving at the scene was 26 minutes. Results The methods laid out here created a georeferenced RTC dataset that was previously unattainable and produced one of the first real-time maps of RTCs in Nairobi. We classify 52,228 tweets as crash-related out of a universe of 874,588 tweets during 2012 - 2020 (Panel A of Figure 2). This is based on the SVM algorithm, which we find performs better than the Naive Bayes algorithm according to the F1 statistic (see Table S4 in the SI). We geolocate 32,991 time-stamped crash tweets from August 2012 to July 2020 and cluster them into 22,872 unique geolocated crashes (panels B and C of Figure 2 show the unique crashes generated by Twitter daily using the algorithm and clustering). In our truth dataset, where we manually 7 coded each crash-related tweet, we found that 63% of tweets contain enough information in order to be geolocated. Assuming the same proportion of tweets contain enough information to be geolocated in the full dataset, we would expect 32,903 tweets with enough location information. This aligns almost perfectly with the number of tweets that the algorithm is able to geolocate. Figure 2. Crowdsourced crash reports from twitter data that our algorithm has geolocated and clustered into unique crashes for the city of Nairobi between 2012 and 2020. Road data comes from OpenStreetMap. The ground-truthing exercise confirms the validity of the crowdsourced data. We find that of the 73 crash-related tweets physically verified, 92% correctly corresponded to a crash near the estimated location; 32.8% witnessed the crash scene, 57.5% did not see the crash but were told by a bystander that a crash occurred and was recently cleared, and 1.4% reported that the crash did not occur at the specified location but nearby. Furthermore, using our truth dataset to benchmark shows that our algorithm performs significantly better than the current geoparsing standard. Our algorithm’s recall rate of 65% is a five-fold improvement in performance compared to the LNEx algorithm (13% recall) in identifying the unique location of a crash (Table 1). This is largely because LNEx is not designed to identify a unique location when 8 multiple locations are mentioned. Our algorithm performs 25% better than LNEx even when comparing whether any location extracted from the tweet is near the true location. Analyzing the crash data produced using our algorithm and focusing on the truth dataset within the city limits of Nairobi, we find that all crashes from July 2017 to July 2018 can be found in 435 clusters, each with a maximum diameter of 300 m. Of these clusters, 67% have two or more crashes and there are 56 clusters with 10 or more crashes. Additionally, 66 crash clusters represent over 50% of all the crashes. When looking at the 7.5 years of crowdsourced data for the city of Nairobi, the number of crash clusters does not grow linearly, implying that the locations where crashes occur and are reported in Twitter are consistent across years. Only 14% of crash locations have only a single crash, and there are 443 crash clusters with 10 or more crashes. We see the concentration of crashes even more when we note that only 9% of crash clusters (133 of 1,375) represent 50% of the crashes reported (Figure 3 shows crash heatmaps for the truth dataset from July 2017 to July 2018 and for 2012-2020). Figure 3. Heatmap of crashes Data in panel a is from July 2017 - July 2018, where we use the manually coded Twitter dataset. Data in panel b is for August 2012 - July 2020. Road data comes from OpenStreetMap. Discussion Cities are constantly evolving and understanding urban mobility is critical to creating urban designs that help to manage risks for pedestrians and vehicles. Severe data limitations hinder the development of policy interventions needed to manage risks, especially in low- and middle-income resource-constrained countries. Closing the data deprivation gap can help avert divergence in socioeconomic conditions between data-poor and -rich countries. By focusing on RTCs–the number one cause of death among young people—we demonstrate that social media could be an inexpensive way to produce non-existent RTC data in resource-poor contexts that can support government analyses of road safety and potentially inform policy. 9 This tool could be especially powerful when combined with investments in building a digital administrative dataset that would provide information on the crashes attended by police. The answer to the seemingly simple question of where and when crashes occur has profound implications for public policy response that can save lives. And while official data deprivation can be an impediment to economic development, data generated by private operators can be transformed and placed in the hands of policy makers as a resource for policy making. By expanding the amount of data, we can generate more input to help resource-constrained countries prioritize policy action where it is most needed. This example of geolocating crash data from mining twitter data can help to guide infrastructure redesign or enforcement policies to reduce RTCs. Nairobi comprises an extensive road network of 6200 km; with the city’s limited resources, addressing road safety across the whole network is difficult. By using this type of geolocated data, urban planners and policy makers can narrow down the problem to the areas with the highest number of crashes. This has been proven to work in developed countries where targeting risky locations led to reductions in the concentration of crashes [48]. As shown in the results, crashes reported on Twitter are highly concentrated, with the top 15% of locations spread across 20 km of road having 50% of the crashes reported on Twitter. It should be noted that there are some limitations to the approach. The data generated are limited by the coverage of the crowdsourced data. Users are more active on social media at particular times, and it is necessary to possess a smartphone and have access to internet to be able to use the service. This can lead to bias in the reports generated via the crowdsourced data. Only 7.5% of tweets are sent between the hours of 9 p.m. and 6 a.m., and as a result only 12% of the crash reports from Twitter are during this time. There could also be geographic bias if there are areas of the city where people with smartphones are more likely to be present or passing by, and therefore more likely to report. Our real-time motorcycle validation exercise demonstrates the internal validity of the crowdsourced data and the improved algorithm. External validity is more difficult to assess because we do not know what the universe of crashes is. Additionally, we do not know the severity of the crashes reported on Twitter. Therefore, we have no way of knowing if the areas where crashes happen are the most dangerous, which is what policy makers likely would want to target. These caveats should be considered by policy makers when using crowdsourced data to inform policies and targeting. Despite the limitations, our improved geoparsing algorithm discussed in this paper can begin filling some of the gaps in data in low-capacity and data-scarce settings. And while the crash cluster areas identified by the algorithm may not be the most dangerous or may not represent all crash areas, they nevertheless highlight problem areas. All crashes, minor or severe, have important economic consequences in terms of property damage and lost time and productivity due to the traffic generated (which is one of the reasons the 10 crash is likely reported on Twitter). Therefore, these data can be used to target areas for design solutions where we are seeing high numbers of crashes consistently over time. In settings where there are limited or non-existent administrative records and, therefore, lack of any geolocated data, this tool can produce information in real-time for one of the most pressing challenges in developing countries. Furthermore, by developing tools that generate time-stamped geolocated data and statistics from crowdsourcing on different “events” that are reported on social media, we can hope to expand data availability across other contexts and across issues beyond RTCs. For example, real-time traffic applications like RIDLR in India can be used to expand data on road safety. These improved tools can also help geolocate victims during a natural disaster or alert disaster management teams to the location of unsafe buildings or areas needing immediate attention. They can support law enforcement or communities to locate and respond to crimes, cases of violence against women, or police violence. Improved identification of the time and location of events can help to automate and accelerate policy response across a wide set of issues, potentially leading to better policy outcomes. References 1. Serajuddin U, Uematsu H, Wieser C, Yoshida N, Dabalen A. Data deprivation: Another deprivation to end. The World Bank. 2015;. 2. Notzon F, Nichols EK. Global Program for Civil Registration and Vital Statistics (CRVS) Improvement; 2015. 3. WHO. Global status report on road safety 2018. World Health Organization. 2018;. 4. IEAG. A World that Counts–Mobilising the Data Revolution for Sustainable Development. Independent Expert Advisory Group on a Data Revolution for Sustainable Development. 2014;. 5. GSMA Intelligence. The Mobile Economy 2020. London: GSM Association. 2020;. 6. Kemp S. Digital 2020: Global Digital Overview. Retrieved from Datareportal: https://datareportalcom/reports/digital-2020-global-digital-overview. 2020;. 7. Batty M. Big data, smart cities and city planning. Dialogues in human geography. 2013;3(3):274–279. 8. Miller G. Social scientists wade into the tweet stream. Science. 2011;333(6051):1814–1815. 9. Kitchin R. The real-time city? Big data and smart urbanism. GeoJournal. 2014;79(1):1–14. 10. Einav L, Levin J. Economics in the age of big data. Science. 2014;346(6210). 11 11. Hao J, Zhu J, Zhong R. The rise of big data on urban studies and planning practices in China: Review and open research issues. Journal of Urban Management. 2015;4(2):92–124. 12. Blumenstock J, Cadamuro G, On R. Predicting poverty and wealth from mobile phone metadata. Science. 2015;350(6264):1073–1076. 13. Kosinski M, Stillwell D, Graepel T. Private traits and attributes are predictable from digital records of human behavior. Proceedings of the national academy of sciences. 2013;110(15):5802–5805. 14. Resch B, Summa A, Zeile P, Strube M. Citizen-Centric Urban Planning through Extracting Emotion Information from Twitter in an Interdisciplinary Space-Time-Linguistics Algorithm. Urban Planning. 2016;1(2):114–127. doi:https://doi.org/10.17645/up.v1i2.617. 15. Jaidka K, Giorgi S, Schwartz HA, Kern ML, Ungar LH, Eichstaedt JC. Estimating geographic subjective well-being from Twitter: A comparison of dictionary and data-driven language methods. Proceedings of the National Academy of Sciences. 2020;117(19):10165–10171. 16. Steiger E, Westerholt R, Resch B, Zipf A. Twitter as an indicator for whereabouts of people? Correlating Twitter with UK census data. Computers, Environment and Urban Systems. 2015;54:255 – 265. doi:https://doi.org/10.1016/j.compenvurbsys.2015.09.007. 17. Wang Q, Phillips NE, Small ML, Sampson RJ. Urban mobility and neighborhood isolation in America’s 50 largest cities. Proceedings of the National Academy of Sciences. 2018;115(30):7735–7740. 18. WHO. Data systems: A road safety manual for decision-makers and practitioners. World Health Organization. 2010;. 19. Williams S. Data Action: Using Data for Public Good. Cambridge, MA: MIT Press; 2020. 20. Gu Y, Qian ZS, Chen F. From Twitter to detector: Real-time traffic incident detection using social media data. Transportation Research Part C: Emerging Technologies. 2016;67:321 – 342. doi:https://doi.org/10.1016/j.trc.2016.02.011. 21. Zhang Z, He Q, Gao J, Ni M. A deep learning approach for detecting traffic accidents from social media data. Transportation research part C: emerging technologies. 2018;86:580–596. 22. Finkel JR, Grenager T, Manning C. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05); 2005. 12 23. Bender O, Och FJ, Ney H. Maximum Entropy Models for Named Entity Recognition. USA: Association for Computational Linguistics; 2003.Available from: https://doi.org/10.3115/1119176.1119196. 24. Bhargava R, Zuckerman E, Beck L. CLIFF-CLAVIN: Determining Geographic Focus for News Articles; 2014. NewsKDD: Data Science for News Publishing. 25. Ritter A, Clark S, Mausam, Etzioni O. Named Entity Recognition in Tweets: An Experimental Study. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing; 2011. 26. Gelernter J, Balaji S. An algorithm for local geoparsing of microtext. GeoInformatica. 2013;17(4):635–667. doi:10.1007/s10707-012-0173-8. 27. Malmasi S, Dras M. Location Mention Detection in Tweets and Microblogs. In: Hasida K, Purwarianti A, editors. Computational Linguistics. Singapore: Springer; 2016. p. 123–134. 28. Middleton SE, Middleton L, Modafferi S. Real-Time Crisis Mapping of Natural Disasters Using Social Media. IEEE Intelligent Systems. 2014;29(2):9–17. doi:10.1109/MIS.2013.126. 29. Zeng Q, Huang H, Pei X, Wong S. Modeling nonlinear relationship between crash frequency by severity and contributing factors by neural networks. Analytic methods in accident research. 2016;10:12–25. 30. Zeng Q, Huang H, Pei X, Wong S, Gao M. Rule extraction from an optimized neural network for traffic crash frequency modeling. Accident Analysis & Prevention. 2016;97:87–95. 31. Wahab L, Jiang H. A comparative study on machine learning based algorithms for prediction of motorcycle crash severity. PLOS ONE. 2019;14(4):1–17. doi:10.1371/journal.pone.0214966. 32. Salas A, Georgakis P, Petalas Y. Incident detection using data from social media. In: 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC); 2017. p. 751–755. 33. Mai E, Hranac R. Twitter Interactions as a Data Source for Transportation Incidents. In: Transportation Research Board 2013 Annual Meeting; 2013. 34. Sloan L, Morgan J. Who tweets with their location? Understanding the relationship between demographic characteristics and the use of geoservices and geotagging on Twitter. PloS one. 2015;10(11):e0142209. 35. Gatica-Perez D, Santani D, Isaac-Biel J, Phan TT. Social Multimedia, Diversity, and Global South Cities: A Double Blind Side. In: Proceedings of the 1st International Workshop on Fairness, Accountability, and Transparency in MultiMedia. ACM; 2019. p. 4–10. 13 36. Meier P. Digital humanitarians: How big data is changing the face of humanitarian response. Routledge; 2015. 37. Dhavase N, Bagade AM. Location identification for crime disaster events by geoparsing Twitter. In: International Conference for Convergence for Technology-2014; 2014. p. 1–3. 38. Aggarwal CC, Zhai CX. Mining Text Data. Boston, MA: Springer; 2012. 39. Yin J, Karimi S, Lampert A, Cameron MA, Robinson B, Power R. Using Social Media to Enhance Emergency Situation Awareness: Extended Abstract. In: Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI); 2015. 40. Al-Olimat H, Thirunarayan K, Shalin V, Sheth A. Location Name Extraction from Targeted Text Streams using Gazetteer-based Statistical Language Models. In: Proceedings of the 27th International Conference on Computational Linguistics; 2018. 41. Premamayudu B, Subbarao P, Koduganti VR. Identification of Natural Disaster Affected Area Precise Location Based on Tweets. International Journal of Innovative Technology and Exploring Engineering. 2019;8(6). 42. Sangameswar MV, Nagabhushana Rao M, Satyanarayana S. An algorithm for identification of natural disaster affected area. Journal of Big Data. 2017;4(39). 43. de Bruijn JA, de Moel H, Jongman B, de Ruiter MC, Wagemaker J, Aerts J. A global database of historic and real-time flood events based on social media. Scientific Data. 2019;6(311). 44. Ristea A, Boni MA, Resch B, Gerber MS, Leitner M. Spatial crime distribution and prediction for sporting events using social media. International Journal of Geographical Information Science. 2020;0(0):1–32. doi:10.1080/13658816.2020.1719495. 45. Gerber MS. Predicting crime using Twitter and kernel density estimation. Decision Support Systems. 2014;61:115 – 125. doi:https://doi.org/10.1016/j.dss.2014.02.003. e-Mauroux P. CrimeTelescope: crime hotspot prediction 46. Yang D, Heaney T, Tonon A, Wang L, Cudr´ based on urban and social media data fusion. World Wide Web. 2018;21(5):1323–1347. 47. Karimzadeh M, Pezanowski S, MacEachren AM, Wallgr[U+FFFD]n JO. GeoTxt: A scalable geoparsing system for unstructured text geolocation. Transactions in GIS. 2019;23(1):118–136. doi:10.1111/tgis.12510. 48. Austroads. Guide to roadsafety part 8: Treatment of crash locations; 2015. 14 Applying Machine Learning and Geolocation Techniques to Social Media Data (Twitter) to Develop a Resource for Urban Planning Supplementary Information In recent years, social media, and especially Twitter, has emerged as a source of real-time in- formation. Therefore, it is natural that in a context where there is a dearth of official data on a topic such as road safety, we turn to crowdsourcing through Twitter to produce more compre- hensive information on road traffic crashes. As people move around cities that have been plagued by congestion, they have started to rely on social media and citizen reporting to help them avoid major traffic jams and decrease their commutes. Given the relationship between RTCs and conges- tion, platforms that crowdsource and broadcast traffic updates have the additional benefit of often reporting on RTCs. This makes it possible to use crowdsourced data to identify when and where crashes are occurring, which can be used to supplement and improve on existing official statistics. While only around 1% of tweets contain geo-metadata, a growing literature has developed geoparses—or algorithms that extract location names from text. Tweets present a unique chal- lenge to geoparses. State-of-the art geoparses, such as OpenCascais and Stanford Named Entity Recognition, rely on grammar rules to identify location mentions; however, tweets often do not follow grammatical capitalization rules and use clipped, ungrammatical sentences [1, 2]. New al- gorithms have been developed to geoparse tweets. This includes developing an algorithm that accounts for tweets that contain place references that are abbreviated, misspelled or highly local- ized [2]. Others develop gazetteers (location dictionaries) from sources such as Open Street Map and Geonames and search for names within the gazetteer in tweets, employing different approaches to account for misspellings or tweets using shortened names than are in gazetteers [3, 4, 5]. [5] provides a review of existing approaches. Here we provide more information on the specific data that we used, how they were processed and the algorithms that were developed. These processes can then be implemented in different contexts where crowdsourced data on RTCs are available. Twitter Data We generate crash data from social media. The main source of crowdsourced data comes from Ma3Route, a mobile/web/SMS platform that crowdsources transport data and provides users with information on traffic, matatu (informal bus) directions, driving reports and crashes for Kenya. As of early 2019, Ma3Route had 1.1 million followers on Twitter and around 400,000 subscribed users on their app. When users post a traffic report on the app, Ma3Route displays the report on their S1 app and posts the report to Twitter. We scraped all tweets posted by Ma3Route from May 2012, when the Twitter feed was started, onward. Figure S1 shows the number of tweets across time.1 The full dataset of tweets that we use consists of 874,588 tweets scraped between May 2012 and July 2020. See Table S1 for examples of tweets. Figure S1: Ma3Route Tweet Trends Table S1: Example tweets from Ma3Route 1 accident on waiyaki way just before deloitte 2 accident just before roysambu footbridge on thika road inboud tailback almost at githurai 3 accident at junction of ole dume and arwing kodhek 4 accident at tajmall towards the roundabout heavy traffic from doni 5 there is an accident at the pangani underpass heading to either muranga road or forest road involving two personal cars and a matatu mini bus this is causing a bit of snurl up cc 6 jogoo road traffic small accident just before donholm 7 a heavy truck has rolled at karai naivasha loaded with what seems to be bags of maize such trucks are supposed to use mai mahiu route how did it end up there 8 prepare for snurl up jogoo road just a minor incident apo hamza 9 bad accident involving 6 matatus and a lorry on thika road near till station 10 an accident has occurred kenyatta road involving a lorry that has over- turned and several vehicles User mentions have been removed. 1 Ma3Route was most popular in 2015, receiving 700-1,000 traffic reports a day, but has since declined in popularity, receiving an average of around 300 traffic reports daily. S2 We explored additional Twitter handles that focus on traffic and road safety in Kenya. These include twitter handles such as RoadAlertsKE, KenyanTraffic and ThikaTowntoday. The majority of tweets from these other handles are already tweeted out by Ma3Route; therefore, including these additional handles does not produce many new tweets to incorporate into the dataset. An additional source of data is including tweets that mention Ma3Route but are not necessarily posted by Ma3Route. While these tweets are not included in the current analysis, they can be easily incorporated to expand the data set that is used to generate additional crash reports. We have already done this for the data set of crashes that we are producing for the Government of Kenya. Building a Truth Data set of Tweets We build a truth data set of Ma3Route tweets where tweets are labeled as to whether they refer to a specific traffic crash and, if they do, are geocoded. We code all potentially crash related tweets from July 2017 to July 2018. We define a tweet as potentially crash-related if one of the following words appeared in the tweet: accident, accidents, ajali, axident, collision, crash, crashes, crashs, crush, crushed, damage, disaster, emergency, fatal, fatality, fender bender, fender-bender, hazard, hit, hit-and-run, incident, incidents, injuries, injury, magari zmegongana, mishap, overturn, overturned, ovrturn, ovrturned, pileup, rammed, read end, rear ended, roll, rolled, smash, smashed, wreck, wreckage, zilicrash, zime- crash To account for misspellings of select words, we also include tweets if they contained a word that had a Levenshtein distance of two or less to “accident” or “incident” or a Levenshtein distance of one to “crash” or “crashed”. Six coders were trained to process the 9,480 tweets defined as potentially crash related. Coders were instructed to label a tweet as reporting a crash if the tweet referred to one or more specific crashes; general comments about crashes were labeled as not reporting a crash. If the coder labeled the tweet as reporting a crash, they were instructed to geocode the location of the crash based on the tweet text if they were able. Coders were instructed to record the street names and landmarks used to geocode the crash. In addition, they provided the approximate coordinates of the crashes. S3 Each tweet was labeled and geocoded by two coders; differences were resolved by one of the authors. (We consider geocodes different if they were more than 100 m apart.) Of the 9,480 tweets, 6,602 (69%) reported a crash and of these, 4,192 (63%) identified an approximate location of the crash. Augmenting a Gazetteer The primary goal of the algorithm to augment the gazetteer is to generate alternate names of landmarks that users may use instead of the original name in the gazetteer. Alternate names are generated in three steps: (1) splitting landmark names at certain punctuation (e.g., slashes), (2) create n-grams and skip-grams of landmarks and (3) in select cases, removing the landmark type from the end of the name (e.g., removing ‘restaurant’ from ‘McDonald’s restaurant.’) The algorithm also removes landmark names that are common words that may often be used in a context to not refer to a landmark. In addition, the algorithm removes landmarks that do not refer to a specific location, such as roads. S4 Algorithm Augment gazetteer Input Landmark gazetteer, where for each landmark entry includes: (1) name, (2) types and (3) coordinates Output Augmented landmark gazetteer A. Split landmarks at select punctuation 1. If a landmark has a slash, open parentheses, dash or comma, add landmarks to the gazetteer that separate at the char- acter. B. Clean landmark names 1. Everything lowercase, only keep alphanumeric characters (eg, remove punctuation) C. Remove certain landmarks 1. Remove landmarks that are just one character in length 2. Remove landmarks that have certain types (eg, where the type indicates that the landmark actually represents a large area). We remove landmarks with the type: route, road, political, locality or neighborhood except if the land- mark also contains “flyover“ or “roundabout“ in the name1 1 We treat flyovers and roundabouts as landmarks, even though they are roads, as they represent a unique lo- D. Create N-grams and skip-grams2 cation 2 Other geoparses such as LNEx only 1. Generate 2-3 N-grams and add to gazetteer add the n-grams and skip-grams if 2. Generate 2-3 skip-grams, skip 1-4, restrict so that the first the name does not already exists in and last word match and add to gazetteer3 the gazetteer. Our algorithm differs, and we add all n-grams and skip- grams. However, in the algorithm to E. Create parallel landmarks locate events, we preference locations where the landmark name associated 1. If a word begins/ends with a certain word/phrase, remove with the location was not a derived the word or phrase n/skip-gram, but still consider the n- gram/skip-gram version as the non- (a) If it begins with a stopword or preposition, create par- derived landmark location may be re- allel landmark with word removed moved from consideration if it is not (b) If ends with: bar, shops, restaurant, hotel, stage, bus near a mentioned road. 3 For example, from the original land- stop or bus station, create parallel landmark with mark ‘Prestige Plaza Shopping Mall‘, word removed this generates ‘Prestige Mall‘, ‘Pres- 2. If word contains certain word/phrase, swap with another tige Plaza Mall‘, and ‘Prestige Shop- ping Mall‘ (a) (stage, bus stop, bus station) – make interchangeable. So if someone says “X stage”, create “X bus stop” and “X bus station” 3. Different spellings of words (a) British/English spellings (Eg,: center vs centre, the- ater vs theatre) S5 (b) Common shorter/longer/different ways (train vs rail- way, rail vs railway) 4. Add types (a) If landmark ends with: stage, bus stop or bus sta- tion, add “stage” as type (we preference certain types, hence we do this). 5. Remove parallel landmarks if only 1-2 characters long, and add rest to gazetteer F. Remove landmarks 1. If it has a stop word and is 2 or less words, remove 2. If landmark contains/begins with/ends with: (a) If landmark contains: road or rd, remove (b) If landmark begins with a stop word or preposition, remove (c) If landmark ends with road word (street, st, avenue, ave), remove 3. Remove common English words (a) Remove one word landmarks that are also English words (spelled correctly according to an English spell checker)4 but are not nouns5 or categorized as a bus/ 4 We use Hunspell, a commonly used transit station.6 spellchecker 5 We use spaCy, an open source nat- ural language processing library, to determine the part of speech of each landmark 6 We keep bus/transit stations as users often reference matatu stages when describing crash locations S6 Tweet Classification - Identifying relevant crowdsourced reports We first developed an algorithm to identify whether a tweet is crash related or not, using the truth data set to train the algorithm. We extract features from tweets by extracting n-grams from tweets. We employ a grid search, tuning the models by testing all combinations of multiple parameters. The three main parameters we test are: (1) extracting 1-grams, 1-2 grams or 1-3 grams, (2) not removing any features or removing features that occur in less than/more than 0.01%/99.9%, 1%/99% or 5%/95% of tweets, (3) defining features as the number of occurrences of the n-gram in the tweet or using the Term Frequency - Inverse Density Frequency (TF-IDF) of the n-gram2 For the Support Vector Machine, we also vary the regularization parameter–which controls how the algorithm weighs misclassification versus simplicity–using 0.5, 1, 2, 10, 100 and 1000. Table S2: Example Tweet and Augmented Tweet accident past garden city near thika rd and kamiti rd junction accident past #landmark-name# near #road-name# and #road-name# junction An additional parameter we test is using the original tweet text and, following [6], replacing landmark names and road networks with generalized names (just indicating the presence of a landmark or road). Generalizing landmark and road names helps to reduce the dimensionality of the feature space. Table S2 demonstrates how a particular tweet is transformed into one with general landmark and road names. Table S3 shows examples of the features extracted in regular and augmented tweets where landmarks and roads have been replaced. This augmentation assumes that the occurrence of a road or landmark name contributes equally to the probability of a crash-related tweet. 2 TF-IDF reflects how important a word or n-gram is to a tweet within the full set of tweets; for example, words such as ‘a’ or ‘the’ that appear frequently will be given less weight. It is calculated as N T weets N times n − gram appears in atweet log ( )× N T weets with N − gram N n − grams in a tweet S7 Table S3: Features of Tweets N-gram Using Using Original Augmented Tweet Tweet accident 1 1 past 1 1 garden 1 0 city 1 0 near 1 1 thika 1 0 rd 2 0 and 1 1 kamiti 1 0 junction 1 1 accident past 1 1 past garden 1 0 garden city 1 0 city near 1 0 near thika 1 1 thika rd 1 0 rd and 1 0 and kamiti 1 0 kimiti rd 1 0 rd junction 1 0 #landmark-name# 0 1 #road-name# 0 2 past #landmark-name# 0 1 of #road-name# 0 1 and #road-name# 0 1 Features defined using the number of occur- rences of n-gram in the tweet. We test two methods for determining whether a tweet reports a crash: Naive Bayes and support vector machines. Both techniques are commonly used in text classification for their ability to handle high dimensionality, e.g. when the number of features is greater than the number of observations [7, 8]. The Naive Bayes model is estimated as: n y ˆ =y P (y ) P (xi |y ) (1) i=1 where y is whether the tweet is classified as crash related or not and xi are all the n-grams that occur in a tweet. The linear SVM solves the minimization problem: S8 N min C (1 − yi f (xi ))2 + ||w||2 (2) i where C is a regularization parameter and ||w||2 is a penalty function. Here, y equals 1 when the tweet references a crash and -1 when it does not. We use a squared hinge loss function (L2). We implement k-fold cross-validation on 4 folds, training the model on 75% of the truth data and testing on 25% of the data within each fold. Table S4 shows results for select parameters. While the Naive Bayes algorithm performs slightly better based on precision, the SVM has higher recall and generally performs better for 2 and 3 n-grams. Overall, the F1 statistic, which provides a balance between the precision and recall, is best for SVM at 0.95 using 2 and 3-grams. Given that the overarching goal is to produce a data set of geolocated crashes based on the tweets, better recall is more important than higher precision. The reason for this is that even if a larger set of tweets is misclassified as crash related, it is more likely that these general tweets will not be geolocated at the second stage since they are not discussing a particular crash with a given location. We therefore want to capture as many of the tweets that are reporting crashes as possible at this stage, even if it means capturing slightly more tweets that are not reporting a crash. The SVM algorithm also has a very high accuracy of 0.93. Table S4: Tweet Classification Results Precision Recall F1 Accuracy N-Grams Naive Bayes 0.938 0.947 0.942 0.919 1 0.945 0.949 0.947 0.926 2 0.945 0.949 0.947 0.926 3 SVM 0.935 0.963 0.948 0.927 1 0.94 0.966 0.953 0.934 2 0.939 0.967 0.953 0.934 3 The table shows best results for both SVM and Naive Bayes. For these results, both models use the original tweet and no features are removed. The Naive Bayes models do not use TF-IDF, while the SVM models do. S9 Preparation for Geolocation Prior to being able to use the geolocation algorithm, two additional pieces need to be prepared. One relates to identifying types of landmarks that are more common to be mentioned as the location of a crash in a tweet. In the situation where there might be multiple landmarks with the same name, the more likely landmark for a crash is the one that should be chosen for the location. The second relates to identifying the correct location when multiple locations are mentioned in the tweet. We can use the typical grammatical structure of a tweet to identify prepositions that are used prior to the correct location of a crash compared to ones that are more likely to be used with locations that are not close to the crash. Ranking prepositions based on these probabilities makes it possible to choose the correct location from the possible locations mentioned. Determining Landmark Types More Commonly Used as the Crash Location When a landmark name is mapped to multiple locations, the algorithm preferences certain landmark types. To determine which landmarks to preference, we examine which landmark types are more commonly associated with the correct location. We consider cases where (1) one landmark is used to identify the crash location and (2) the landmark name is mapped to locations both near and far from the crash location. We compute the proportion of times a type is near and far from a crash location and divide the proportion near over far to understand the likelihood that choosing the type is near the crash location. Figure S2 shows results. Among tweets considered, a landmark location that is a bus stop is near the correct location 17% of the time and is far from the correct location less than 1% of the time, leading to a bus stop being close to the correct location 22 times more frequently than far from the correct location. In the algorithm, we use the top 6 landmark types (all being 2.5 or more times likely to be near the correct location) to preference landmarks: bus stop, parking, mall, cafe, transit station and bus station. S10 Figure S2: Landmark types typically near or far from the crash location when a landmark name is mapped to multiple locations Determining Preposition Phrase Tiers The truth dataset indicates the landmark used to geocode the crash. We examine the phrases that precede the landmark. Figure S3 shows the top phrases. The phrase “at” precedes the correct landmark in 42% of tweets and in roughly half these cases “accident at” precedes the landmark. We examine the phrases that precede the landmark to guide decision making when more than one landmark is mentioned. For this, we take all phrases that precede the correct landmark at least 20 times. We then identify cases where two of these phrases appear in a tweet and one of the phrases precedes the correct landmark; we then calculate the proportion of times each phrase precedes the correct landmark when the other phrase is also in the tweet. Figure S4 shows results. While ‘at’ is the most common word that precedes a landmark, other phrases that precede landmarks are more predictive of the correct landmark. For example, when both ‘at’ and ‘near’ appear in the tweet (and one of them precedes the correct landmark), the landmark is preceded by ‘at’ only 6% of the time. We use information from these phrase-pairings to divide phrases into “tiers”; if two landmarks are found in a tweet, the landmark is used where the phrase that precedes it is from a lower tier. We develop 6 tiers: S11 1. Tier 1: Across phrase-pairs, these phrases precede the correct landmark more than the other phrase in all cases. (for example, when ‘just after‘ and phrases such as ‘at’, ‘on’, or ‘in’ are also in the tweet, ‘just after‘ precedes the correct landmark more often than all other phrases). 2. Tier 2: These phrases precede the correct landmark more than the other phrase in over 90% of cases (but less than 100%). 3. Tier 3: Across phrase-pairs where one of the phrases is “at”, these phrases precede the correct landmark more times than “at.” 4. Tier 4: The phrase “at” 5. Tier 5: Remaining phrases where, across phrase-pairs, the phrase precedes the correct land- mark more often than over half of the other phrases. 6. Tier 6: Remaining phrases where, across phrase-pairs, the phrase precedes the correct land- mark more than at least one other phrase. We modify this list to account for different spellings of certain phrases (e.g., adding “btw” with “between”) and whenever a phrase has “accident [word]”, we generalize so this becomes “[crash word] [word]”, where crash word includes any word such as accident, crash, hit, wreck, etc. Using this, we use the following phrase tiers: 1. Tier 1: [crash word] after, [crash word] near, [crash word] outside, [crash word] past, around, hapo, just after, just before, just past, near, next to, ”opposite”, outside, past, you approach, apa, apo, hapa, right after, right before, right past, just before you reach 2. Tier 2: [crash word] at, before 3. Tier 3: after 4. Tier 4: at, happened at, at the, pale 5. Tier 5: between, from, btw, btwn 6. Tier 6: along, approach, in, on, opp, to, towards, toward S12 Figure S3: Top words that precede the landmark that correctly identifies the crash location. S13 Figure S4: Likelihood of different words preceding the correct landmark S14 Locating Crash Events As demonstrated in Table S1 in the example tweets from @Ma3Route, the geoparser has to handle different tweets in different ways. For example, tweet 1 is simple, including the name of one road and one landmark. Tweet 3 is short and clear as well; however, it identifies the crash location by a junction instead of a landmark. Tweet 8 uses the Swahili word ”apo”, which is commonly in front of a landmark word. Accident 2 includes the location of the crash and the location where traffic starts. This section outlines in detail the different components of the geolocation algorithm, which are meant to handle these different situations. The algorithm to locate an event location from text starts by cleaning the text and extracting location names of landmarks, roads and areas (e.g., neighborhoods) from the text. Next, the algorithm restricts location names and their locations to consider; for example, if two landmark names are found, and one is contained within the other, we only keep the longer one; in addition, where possible, we restrict locations to those near mentioned roads. The algorithm then chooses the location names that reference the event location, prioritizing location names primarily by the words that precede them (e.g., “just after [location]” is used over “toward [location]”). If the chosen location is not near a mentioned road, we search for landmarks that have a similar name but are near a mentioned road. Next, we snap the location to the road network. Finally, the algorithm implements select checks to determine whether no location should be outputted; for example, if a road is mentioned but the chosen location is not within 500 m from any mentioned road, the algorithm does not output a location. The algorithm is described in detail below. S15 Algorithm Locate crash/event locations Input Text Landmark gazetteer Roads Areas (e.g., neighborhoods) List of event words (e.g., crash, acci- dent, wreck, etc) Prepositions, grouped by tier Types, grouped by tier Output Coordinates of event A. Clean Tweets 1. Replace @ with “at” only when it is not proceeded by via or when it is not the last word in a tweet.7 7 We found that @[word] often re- ferred to a twitter handle when pro- 2. Remove select stopwords8 ceeded by via or when it was the last 3. Mask common phrases that contain a location but refer word in a tweet; otherwise, users were more likely to use “@” as a shorthand to something else, such as “[city] bus”9 for “at.” Distinguishing these cases is 4. Removing hyperlinks and only keeping alphanumeric char- important as we rely on preposition acters (e.g., removing punctuation). to prioritize landmark references. 8 We only remove “a” and “the”; other stopwords may be part of B. Extract Locations a landmark name (e.g., the stop- 1. Extract exact matches of landmarks, roads and areas word “and” appears in the restau- rant “nice and lovely”. We remove 2. Extract fuzzy matches of landmarks, roads and areas these stopwords as we later deter- mine whether a preposition proceeds (a) Break tweets into 1-3 grams a landmark, and we consider [prepo- (b) For each n-gram, check levenstein distance to gazetteer sition] [landmark name] to be equiva- entries. If word/phase is 0 – 4 characters, ignore; if lent to [preposition] [stopword] [land- mark name]. 5-10, allow levenstein distance of 1; if above 10, allow 9 In Nairobi, we found that matatu levenstein distance of 2 (minibuses) often were referred to by 3. Extract landmarks after prepositions. For each preposi- the location where they traveled to; tion in the tweet:10 consequently, we mask phrases such as: “githurai bus”, “rongai matatu”, (a) Take the word after the preposition and extract all machakos minibus”, etc. In masking, landmarks that start with that word we replace each word in the phrase with a random sequence of charac- ters. Doing this preserves that a word appears at a location in the tweet, which may affect procedures such as determining the landmark closest to an event word. 10 This procedure will often capture the same landmarks as captured in the preceding steps; however, it helps to capture other landmarks where the process for augmenting the gazetteers did not generate the landmark name contained in the tweet S16 (b) Go to the next word in the tweet and further re- strict landmarks to those that contain that word. Repeat until doing so would remove all landmarks considered.11 11 For example, in the tweet “acci- (c) Among extracted landmarks, determine which land- dent at garden city toward town”, the algorithm searches for landmarks mark has the smallest number of words and only keep after ‘at.’ It first finds all land- landmarks with that number of words.12 . marks that contain ‘garden’, then it narrows down these landmarks to those with both ‘garden’ and ‘city’. C. Extract point locations from roads No landmark contains ‘garden’, ‘city’ 1. For each found, check if the length of the diagonal along and ’toward’, so the algorithm stops and considers landmarks with ‘gar- the bounding box is less than 500 m; if it is, take the den’ and ‘city’. centroid and consider this location to be a landmark13 . 12 For example, if ‘garden city’, ‘gar- 2. If two or more roads are mentioned, find intersections den mall’, ‘garden city mall’ and ‘air- between each road pair. If two roads intersect at multiple tel money agent rock city gardens’ were extracted, the algorithm keeps locations, only add the intersection if these locations are ‘garden city’ and ‘garden mall’ within 1 km. 13 These cases are often flyovers and roundabouts D. Restrict landmarks to consider 1. If the name of a landmark and a road overlap, keep the road and remove the landmark (if a landmark and area overlaps, we keep both). 2. If the name of an exact and fuzzy (misspelled) landmark overlap, keep the exact landmark 3. If a landmark name is contained within another, keep the longer name. E. Remove landmarks 1. By roads, areas and tier 1 landmarks (a) If a road is mentioned, for each landmark name check if any landmarks with the landmark name are near (within 500 m of) a road. If this is the case, restrict the landmarks in the gazetteer to those that are near the road. If no landmarks are near the road, do not subset and keep the landmark name14 14 We keep the landmark because dur- (b) If an area is mentioned (e.g., a neighborhood), for ing a later step we check for similarly named landmarks near the road, and each landmark – follow the same steps as above. for the possibility that the extracted (c) If a landmark is mentioned after a tier 1 preposition road is incorrect, so we still keep the (e.g., “next to”, “just after”), for each other land- landmark for now). mark – follow the same steps as above, checking the distance between the other landmarks to landmark locations after tier 1 prepositions.15 15 Helpful in case the landmark near a tier 1 preposition doesn’t form a dom- 2. Dominant Cluster and “general” landmarks inant cluster, but a dominant clus- (a) For each landmark name, check if the locations form ter is formed from another landmark a dominant cluster mentioned). S17 i. If they do A. keep the landmarks in the cluster and remove the others. ii. If they don’t, A. keep landmarks of commonly referenced types (e.g., matatu stages); if a landmark does not contain a common type, don’t subset. For this we use the analysis described earlier on deter- mining landmark types more commonly used as crash locations. B. Re-check which landmarks don’t form a clus- ter; among these, keep landmarks if the name of the landmark was not derived from an n/skip- gram (ie, matches the original name).16 16 For example, if there are 3 land- (b) Remove landmark name if it does not form a cluster marks of “garden city“, where the original names were: garden city, except if the name follows a tier 1 preposition. (If garden city mall and garden city it follows a tier 1 preposition, it is likely the correct bank, keep “garden city“; if no name landmark name but just cannot find the exact loca- matches the original name, keep all tion; if it does not follow a tier 1 preposition, it is landmarks. more likely to be a spurious landmark). F. Select landmark names or intersections 1. If there are multiple location names found (eg, multiple landmark names, multiple intersections) (a) Loop through preposition tiers. Within each tier, check the following, stopping once a location name has been found. i. Check if a landmark name comes after the prepo- sition ii. Check if one of the road names used to construct an intersection comes after the preposition (b) If no location name has been found, loop through the preposition tiers again and check whether [landmark name] [3 or less words] [preposition name] occurs; if so, keep landmark name(s) with fewest words be- tween name and preposition (c) If one intersection found (eg, if 3 or more roads found, and only one pair of roads intersects), use the inter- section location. (d) Use the landmark closest (least words between) itself and an event word 2. If a landmark name was chosen (ie, not an intersection). (a) If multiple landmark names were selected17 17 For example, two landmarks in i. If a road is mentioned, choose landmarks within front of different tier 1 prepositions 500 m of mentioned road; if none near the road, don’t subset S18 ii. Choose landmark closest to the event word (could still result in multiple!) (b) If landmark name mapped to multiple locations i. Select locations within 500 m of mentioned road; if none near road, don’t subset G. [If landmark location is not near any mentioned road] Broaden search to find similarly named land- marks near the road 1. Start with all landmarks that are near any mentioned road and subset to those that contain the landmark name. Take the next word in the tweet and subset landmarks that contain this word. Repeat process until doing so would cause no landmarks to be found. Among these locations: (a) If a dominant cluster exists, use this location. (b) If no dominant cluster exists, further subset locations to those where the landmark word in the tweet is at the beginning of the landmarks found. If a dominant cluster is found, use this location. i. If no location is found in the previous step, re- peat, but check words in the tweet proceeding the landmark name. H. Snap to Road 1. If a road is mentioned, snap location to road 2. If no road is mentioned, snap to nearest road if road within 500 m. I. Final checks to determine whether location should be used 1. If a road is mentioned and the location chosen is greater than 500 m from any mentioned road, no location is out- putted by the algorithm 2. If multiple landmarks are mentioned, the closest land- mark to the crashword is used18 and the landmark is 18 This would happen when no tier 1- more than two words away from the crash word, no lo- 6 phrase precedes a landmark cation is outputted by the algorithm 3. If multiple landmarks are mentioned, a tier 5 or 6 phrase precedes the chosen landmark and the landmark is more than two words away from the crash word, no location is outputted by the algorithm S19 Geoparse Tweets - Full Results Table S5 shows full results of the geoparsing algorithm. In particular, the table shows the value added of different data sources to build the landmark gazetteer; we run the algorithm using the augmented gazetteer generated from Geonames, Google and OpenStreetMap separately. Results highlight that the algorithm mainly relies on landmarks scraped from Google maps; recall and precision are only slightly worse using Google alone compared to combining all sources. Geonames performs poorly and OpenStreetMap performs better but still worse than Google, achieving about 0.2 and 0.1 worse recall and precision respectively compared to Google when judging whether the algorithm captures the true crash location. Table S5: Tweet Geoparse Results Any Location Captured by Crash Location Determined Algorithm Cluster Algorithm Close to by Algorithm Close to Contains True Crash Location True Crash Location True Crash Loction Recall Precision Recall Precision Recall Precision LNEx LNEx Aug Gaz 0.674 0.686 0.129 0.132 0.175 0.125 Algorithm - by Source Aug Gaz - Geonames 0.124 0.326 0.112 0.455 0.124 0.446 Aug Gaz - Google 0.79 0.853 0.645 0.811 0.653 0.777 Aug Gaz - OSM 0.52 0.691 0.431 0.728 0.446 0.691 Algorithm - All Sources Raw Gaz 0.695 0.757 0.579 0.756 0.591 0.72 Aug Gaz 0.798 0.857 0.651 0.811 0.656 0.774 S20 Choosing Parameters for Clustering Crash Reports into Unique Crashes Multiple people often tweet about the same crash. In order to cluster crash reports to unique crashes, we cluster by the kilometer and time distance between reports. To determine optimal kilometer and time parameters, a team manually determined which crash reports refer to the same crash. The dataset was double coded by different team members, resulting in two “truth” datasets. To judge whether crash reports refer to the same crash, the team used the location of the crash, the time of the tweet and looked for details about the crash in the tweet itself (e.g., extent of injuries, types and numbers of vehicles, etc.). The below table shows summary statistics of the maximum distance and time between any two crash reports in the same clustered or individual crash. Before calculating the statistics, outliers were removed (we define an outlier as a crash cluster where reported crashes occurred over 24 hours or over 5 km from each other). Across both truth datasets, around 52% of tweets were clustered with another tweet, meaning that 48% of tweets are the only tweet reporting one crash. Table S6: Clustered Tweets Truth Data Summary Statistics Variable Min Quartile 1 Median Mean Quartile 3 Max Truth Dataset 1 Hours Diff 0 0.133 0.55 1.68 1.693 23.776 KMs Diff 0 0 0.013 0.213 0.138 3.328 N Tweets 2 2 2 3.324 3 44 Truth Dataset 2 Hours Diff 0 0.161 0.597 1.715 2.144 23.417 KMs Diff 0 0 0.024 0.276 0.199 3.13 N Tweets 2 2 2 3.275 3 19 We examine two common metrics for evaluating clustering performance: the adjusted Rand index and the Jaccard coefficient[9]. When using our algorithm to cluster crash reports, we test all combinations of 0.1, 0.5, 1, 2 and 3 kilometers and 1, 2, 4, 12 and 24 hours. For truth dataset 1, both the Rand index and Jaccard coefficient show that 12 hours and 500 m leads to best results, while truth dataset 2 shows 2 hours and 500 m (see figure ). The difference in results in the truth datasets likely results from the exercise being partially subjective, particularly when limited or no S21 crash details are provided in the tweet text. We opt for using thresholds of 500 m and 4 hours. The 4 hour threshold is in between the optimal value from both truth datasets. Figure S5: Cluster Evaluation Results S22 References [1] Ritter A, Clark S, Mausam, Etzioni O. Named Entity Recognition in Tweets: An Experimental Study. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing; 2011. [2] Gelernter J, Balaji S. An algorithm for local geoparsing of microtext. GeoInformatica. 2013;17(4):635–667. doi:10.1007/s10707-012-0173-8. [3] Malmasi S, Dras M. Location Mention Detection in Tweets and Microblogs. In: Hasida K, Purwarianti A, editors. Computational Linguistics. Singapore: Springer; 2016. p. 123–134. [4] Middleton SE, Middleton L, Modafferi S. Real-Time Crisis Mapping of Natural Disasters Using Social Media. IEEE Intelligent Systems. 2014;29(2):9–17. doi:10.1109/MIS.2013.126. [5] Al-Olimat HS, Thirunarayan K, Shalin V, Sheth A. Location name extraction from targeted text streams using Gazeteer-based statistical language models. Arxiv preprint. 2017;11(17). [6] Gu Y, Qian ZS, Chen F. From Twitter to detector: Real-time traffic incident detection using social media data. Transportation Research Part C: Emerging Technologies. 2016;67:321 – 342. doi:https://doi.org/10.1016/j.trc.2016.02.011. [7] Joachims T. Text categorization with Support Vector Machines: Learning with many relevant edellec C, Rouveirol C, editors. Machine Learning: ECML-98. Berlin, Heidelberg: features. In: N´ Springer Berlin Heidelberg; 1998. p. 137–142. [8] Aggarwal CC, Zhai CX. Mining Text Data. Boston, MA: Springer; 2012. [9] Santos JM, Embrechts M. On the Use of the Adjusted Rand Index as a Metric for Evaluating Supervised Classification. In: Alippi C, Polycarpou M, Panayiotou C, Ellinas G, editors. Arti- ficial Neural Networks – ICANN 2009. Berlin, Heidelberg: Springer Berlin Heidelberg; 2009. p. 175–184. S23