Policy Research Working Paper                    9488




 Applying Machine Learning and Geolocation
  Techniques to Social Media Data (Twitter)
  to Develop a Resource for Urban Planning
                               Sveta Milusheva
                                Robert Marty
                              Guadalupe Bedoya
                               Sarah Williams
                               Elizabeth Resor
                              Arianna Legovini




Development Economics
Development Impact Evaluation Group
December 2020
Policy Research Working Paper 9488


  Abstract
 With all the recent attention focused on big data, it is easy to                   geoparsing algorithm to identify its location. The project
 overlook that basic vital statistics remain difficult to obtain                    geolocated 32,991 crash reports in Twitter for 2012–20
 in most of the world. This project set out to test whether                         and clustered them into 22,872 unique crashes to produce
 an openly available dataset (Twitter) could be transformed                         one of the first crash maps for Nairobi. A motorcycle deliv-
 into a resource for urban planning and development. The                            ery service was dispatched in real-time to verify a subset
 hypothesis is tested by creating road traffic crash location                       of crashes, showing 92 percent accuracy. Using a spatial
 data, which are scarce in most resource-poor environments                          clustering algorithm, portions of the road network (less
 but essential for addressing the number one cause of mor-                          than 1 percent) were identified where 50 percent of the
 tality for children over age five and young adults. The                            geolocated crashes occurred. Even with limitations in the
 research project scraped 874,588 traffic-related tweets in                         representativeness of the data, the results can provide urban
 Nairobi, Kenya, applied a machine learning model to cap-                           planners useful information to target road safety improve-
 ture the occurrence of a crash, and developed an improved                          ments where resources are limited.




 This paper is a product of the Development Impact Evaluation Group, Development Economics. It is part of a larger
 effort by the World Bank to provide open access to its research and make a contribution to development policy discussions
 around the world. Policy Research Working Papers are also posted on the Web at http://www.worldbank.org/prwp. The
 authors may be contacted at smilusheva@worldbank.org.




         The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development
         issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the
         names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those
         of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and
         its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent.


                                                       Produced by the Research Support Team
  Applying Machine Learning and Geolocation Techniques to Social Media
        Data (Twitter) to Develop a Resource for Urban Planning∗


                                                   Sveta Milusheva†
                                                     Robert Marty†
                                                  Guadalupe Bedoya†
                                                    Sarah Williams‡
                                                   Elizabeth Resor§
                                                   Arianna Legovini†




JEL Classification: R41, R42, O18, C80

Keywords: Big Data, Machine Learning, Road Safety, Urban Mobility, SDGs


    ∗ We thank Robert Tenorio and Amy Dolinger for their field coordination and research support. We are also grateful to

Andrew Muriithi, Purity Kimuru, Rodgers Avuya, Salome Omondi and Pheliciah Mwachofi for their field support. DIME
Analytics provided technical support throughout the analysis with Luiza Andrade and Luis Eduardo San Martin conducting code
review and reproducibility checks. We appreciate comments from anonymous reviewers and participants at the ACM COMPASS
Conference and the Netmob Conference. The research has been funded with UK aid from the UK government through the
ieConnect for Impact program and the World Bank’s Knowledge for Change program.
    † Development Impact Evaluation Department, World Bank, Washington DC
    ‡ School of Architecture and Planning, Massachusetts Institute of Technology, Cambridge MA
    § School of Information, University of California, Berkeley CA




                                                            1
Introduction

The World Bank has declared that data are the next deprivation to end; they argue that the lack of data

causes many of the world’s poorest populations to be overlooked when resources are allocated to address

their essential needs [1]. Data deprivation is a pressing challenge with as many as 74% of the global and 97%

of the Sub-Saharan African population living in countries without adequate vital registration [2]; one-third of

countries lack any poverty statistics [1]; and only 17% of the estimated road traffic deaths are reported in

official figures of low-income countries [3]. Without data to inform national and urban policies, the gap

between low- and high-income countries will worsen [4]. However, while official statistics are poor, data in the

hands of private providers are plentiful, populated by the rapid expansion of mobile phones and social media.

Globally, phone penetration reached 67% in 2019 [5], and social media penetration is almost 50% [6]. This

provides an opportunity for using crowdsourced data to study major urban and development policies [7–11].

   In this project we test the hypothesis of whether privately maintained data can be transformed into a

resource to better understand development challenges. Private data have been used to characterize

populations from determining poverty to understanding public emotions [12–17]. Here, we use private data

to describe the urban environment that affects those populations, specifically analyzing events reported on

social media that affect people’s safety such as road traffic crashes, crime or floods. We focus on road traffic

crashes (RTCs). Despite being the number one cause of death for children and young adults aged 5-29 years,

the lack of adequate data on RTCs is a recognized and unmet challenge [18]. The objective is to improve

RTC data for urban planners so they can contribute to addressing the high toll of road deaths, estimated

globally at 1.35 million a year [3]. Our case study is Kenya, a country with high road mortality, where the

official figures are said to underestimate the number of fatalities by a factor of 4.5 [3].

   The United Nations’ Sustainable Development Goal (SDG) 3 sets a target to halve road mortality by

2020; progress has been slow, and the target moved to 2030. The Stockholm Declaration by the Third Global

Ministerial Conference on Road Safety “Achieving Global Goals 2030” reiterated the call for country

investments in road safety–from legislation and regulation, safe urban and transport design, safe modes of

transport and vehicles, to modern technologies for crash prevention, trauma care, and urban management.

However, resource constraints make it unlikely that countries will be able to meet all of these goals. Instead,

countries should strategically invest for the greatest impact. This requires knowing where and when crashes

happen, so that resources can be targeted to risky locations and times.

   Social media data, with all their biases, can contribute to filling some of the data gaps for urban analysis,

planning and management [19]. In this study, we create an algorithm that classifies transport-related tweets

into geolocated RTCs for Nairobi. This is done by building on existing literature to test two natural language



                                                        2
processing algorithms to identify crash reports [20, 21], developing an improved geoparsing algorithm to

extract data on crash time and location [22–28], and ground truthing the results. The paper also contributes

to a broader literature that uses machine learning methods for road safety analysis [29–31].

   This study innovates on three fronts and demonstrates the value of using social media to expand data

availability. (1) Geospatial Twitter data analysis usually uses the approximately 1% of tweets that have a

geolocation tag [32–34]; we improve this by using a machine learning geoparsing algorithm to leverage the

99% of tweets that do not contain a geotag. (2) To our knowledge there are no other studies that physically

validate the locational accuracy of tweets in real time. Among verified tweets, 92% were found to be valid

crashes, demonstrating the validity of crowdsourced crash data. (3) The work created an essential resource

by generating one of the first real-time maps of RTCs in an African city (Nairobi). We identify 52,228 crash

reports and geolocate those with enough information provided in the text (32,991 of them). In a context

where there is no systematic georeferenced data on crashes to support policy planning, the process outlined

here could be used to capture these data for cities all over the world that need this essential resource.

   Overall, the method expands the coverage of road crashes that can be used to analyze road safety and to

prioritize policy action around the locations where crashes occur more often. This is especially useful in

country contexts where the only data available for analysis are aggregated statistics on total fatalities in the

country, with no detailed breakdown of location or time. Crowdsourced data can help act as an additional

input that can be used by policymakers in better understanding the situation. By using a clustering

algorithm to identify and rank crash locations, we find that the top 15% of crash clusters (66 of 435) account

for half of all crashes. Knowing that a small portion (<1%) of the road network hosts 50% of RTCs in the

crowdsourced data can help reduce an intractable problem to a more manageable one. This analysis shows

the potential for using these data to complement road safety diagnostics and to guide investments and

planning in road safety in Kenya and in other contexts, especially those with similar data deficiencies and

with sufficient social media density like India and the Philippines [35].

   The approach can be extended to other events reported on social media, whether related to disaster relief,

crime, personal safety, urban mobility, or road maintenance. The work on disaster relief and response makes

prominent use of geoparsing of tweets [36–43]. Geoparsing of tweets that lack geolocation information could

enable more comprehensive crime analytics [44–46]. Improved algorithms can lead to faster and better

geolocation of events, which would help urban planners and policy makers improve responses and better

target interventions.




                                                       3
Method

The goals of this analysis are to create data on road crashes with times and locations and understand how

these incidents cluster in the city, which allows for the spatial prioritization of urban investments in road

safety. The technical challenges this study addresses are: i) improve the protocols for geolocation, ii) apply

applications of AI to classify tweets reporting crashes and identify their location from multiple geographical

references, iii) cluster the crashes geographically and identify areas with many crashes. See the Supplemental

Information (SI) for the detailed methodology. The components are as follows:


  1. Scrape data. We scrape 874,588 tweets posted by Ma3Route, an existing urban mobility platform

     with 1.1 million followers, since its inception in May 2012 through July 2020 (see SI for examples of

     tweets and for a figure of the daily number of tweets across time).

  2. Develop and augment a gazetteer. We build a gazetteer of landmarks for the five counties that

     constitute the Nairobi metro area using: OpenStreetMap, Geonames and Google Places. The gazetteer

     includes the landmark name, geocoordinates and type of landmark (e.g., school, bus stop). We use

     consecutive combinations of 2 and 3 words (known as n-grams) and skip-grams of landmarks in the

     gazetteer, alternate spellings and abbreviations, and splitting of landmarks with select punctuation

     (e.g., slashes, parentheses, commas). We innovate by developing alternate names that exclude the

     landmark type from the name (e.g., excluding “Hotel” from the name).

  3. Develop a truth dataset. We develop a truth dataset to train the algorithm. Taking all tweets for

     July 2017 - July 2018, we restrict tweets to the ones most likely related to a crash based on a broad list

     of words and their variations. Each tweet is manually coded, indicating (1) if the tweet reported a

     crash and (2) the approximate latitude and longitude of any reported crash whenever enough

     information is provided. A total of 9,480 tweets were coded, of which 69% (6,602) reported a crash and

     of these, 63% (4,192) identified an approximate location of the crash. On average, users posted 10

     crash reports that could be geolocated to Twitter daily.

  4. Identify RTCs and their location. We use a three-step process to convert unstructured

     crowdsourced text into a dataset. The first is to identify relevant reports from hundreds of thousands

     of reports. The second is to extract necessary information from the relevant reports. The third is to

     consolidate unique record information from multiple reports of the same event. In Figure 1, we

     illustrate how the algorithm works to classify and geolocate RTCs. We use the tweet “Bad accident on

     Waiyaki Way next to Kianda heading towards ABC Place.”



                                                       4
Figure 1. Illustration of classification and geolocation algorithm developed for extracting
data from crowdsourced information


     (a) Classify relevant crowdsourced reports. We restrict the analysis to tweets that contain

         keywords from a broad list of English and Kiswahili road safety terms such as “accident” or

         “overturn.” This approach follows previous research and allows for misspellings [20]. We use

         natural language processing to classify and exclude tweets that contain road safety keywords but

         discuss road safety conditions rather than specific crash events (e.g., “terrible drivers keep causing

         crashes”). We test two approaches that analyze the combination of words in a tweet: Naive Bayes

         and support vector machines (SVM).




                                                     5
(b) Geolocate reports. We extract all landmarks and roads that have an exact match between the

   gazetteer and the tweet. In Figure 1, “kianda” and “abc way” match several entries in the

   gazetteer. We extract misspelled matches based on Levenshtein distance varied by length of the

   n-gram, matches based on the word following a preposition, and matches based on intersections

   between multiple roads.

   Existing geoparsers extract all possible location references without identifying the unique location

   that makes the data useful. We resolve two technical challenges to find the location of the crash:

     i. When multiple locations are mentioned in the tweets, we use prepositions to sort locations

       into tiers, based on the probability of a location being correct after a particular preposition.

       For example, in Figure 1, “next to” is ranked as tier 1 while “toward” is ranked as tier 6,

       resulting in the correct geolocation for the crash at “kianda” and not “abc place”.

     ii. When a name refers to multiple landmarks, we adopt a toponym resolution approach. In

       Figure 1, more than 6 landmarks across Nairobi have “kianda” in the name. We resolve the

       toponym in three steps: (1) we search for landmarks that are within 500 m of a road if it is

       mentioned, (2) we use the centroid of the clustered location if 90% or more of the landmarks

       are in a 500 m radius, or (3) we rank the landmarks by the probability of being correct using

       the landmark type in the truth data (see SI for statistics on location type). In the example,

       we use “Waiyaki Way” to narrow down the landmarks “kianda” in a 500 m radius (from 6 to

       3) and then use the centroid as the crash location.


       We define a correct geoparse as one located within 500 m of the coordinates in the truth

       dataset. As a benchmark, we compare our algorithm to the Location Name Extraction tool

       (LNEx), which was shown to have better accuracy than other geoparsers [40]. As LNEx and

       other geoparsers are not designed to extract one unique location from text [26, 40, 47], we first

       judge performance by examining whether any location references are near the true coordinates.

       Next, we define the crash location as determined by LNEx to be the centroid of all locations

       it finds in the tweet and compare this with the unique location identified by our algorithm.

(c) Identify unique reports. To avoid over-counting, we develop a clustering algorithm that uses

   time and location to identify which tweets refer to the same crash. In Figure 1, five tweets report

   a crash within two hours of each other, referencing different landmarks that are all close together.

   To develop reasonable parameters for clustering, we manually identify tweets that report the same

   crash in the truth dataset based on the time, location and crash characteristics. The 4,192 crash

   reports are clustered into 2,648 unique crashes. For unique crash clusters, 97% of tweets reported


                                              6
                                Table 1. Geolocation Algorithm Results
                                                 Any Location             Crash Location
                                                 Captured by              Determined by
                                              Algorithm Close to        Algorithm Close to
                                             True Crash Location True Crash Location
                                             Recall     Precision      Recall    Precision
                 LNEx                        0.674         0.686       0.129      0.132
                 Alg., Raw Gaz               0.695         0.757       0.579      0.756
                 Alg., Aug Gaz               0.798         0.857       0.651      0.811
                 Alg., Aug Gaz [Cluster]                               0.656      0.774
                 ‘N Crashes’ refers to the number of correctly identified crashes. ‘Raw
                 Gaz’ refers to the raw gazetteer (ie, dictionary of landmarks with original
                 names) and ‘Aug Gaz’ refers to the augmented gazetteer. We use our raw
                 gazetteer as an input into LNEX, which implements its own augmentation
                 process. For LNEx, the crash location is determined by taking the centroid
                 of all locations captured by the algorithm. Locations are considered close
                 if they are within 500 meters of each other.


          landmarks within 500 m and within 4 hours of each other (see additional details in SI for how

          parameters were chosen).

      (d) Ground truth. To ensure that the crowdsourced data are reliable and provide correct

          information, we conduct a ground-truthing exercise to validate the quality of the data and the

          performance of the underlying algorithm. We processed tweets in real-time and dispatched a

          motorcycle delivery service (Sendy) to the site of the crash within minutes. The Sendy driver was

          tasked with verifying and reporting whether a crash actually happened in that location. If a

          driver could not see the crash, they were instructed to ask a bystander whether a crash had

          occurred but was cleared or whether a crash occurred nearby. Drivers were able to arrive at the

          crash location quickly; the median time between being alerted of a crash and arriving at the scene

          was 26 minutes.



Results

The methods laid out here created a georeferenced RTC dataset that was previously unattainable and

produced one of the first real-time maps of RTCs in Nairobi. We classify 52,228 tweets as crash-related out

of a universe of 874,588 tweets during 2012 - 2020 (Panel A of Figure 2). This is based on the SVM

algorithm, which we find performs better than the Naive Bayes algorithm according to the F1 statistic (see

Table S4 in the SI). We geolocate 32,991 time-stamped crash tweets from August 2012 to July 2020 and

cluster them into 22,872 unique geolocated crashes (panels B and C of Figure 2 show the unique crashes

generated by Twitter daily using the algorithm and clustering). In our truth dataset, where we manually



                                                      7
coded each crash-related tweet, we found that 63% of tweets contain enough information in order to be

geolocated. Assuming the same proportion of tweets contain enough information to be geolocated in the full

dataset, we would expect 32,903 tweets with enough location information. This aligns almost perfectly with

the number of tweets that the algorithm is able to geolocate.




Figure 2. Crowdsourced crash reports from twitter data that our algorithm has geolocated
and clustered into unique crashes for the city of Nairobi between 2012 and 2020. Road data
comes from OpenStreetMap.


   The ground-truthing exercise confirms the validity of the crowdsourced data. We find that of the 73

crash-related tweets physically verified, 92% correctly corresponded to a crash near the estimated location;

32.8% witnessed the crash scene, 57.5% did not see the crash but were told by a bystander that a crash

occurred and was recently cleared, and 1.4% reported that the crash did not occur at the specified location

but nearby. Furthermore, using our truth dataset to benchmark shows that our algorithm performs

significantly better than the current geoparsing standard. Our algorithm’s recall rate of 65% is a five-fold

improvement in performance compared to the LNEx algorithm (13% recall) in identifying the unique location

of a crash (Table 1). This is largely because LNEx is not designed to identify a unique location when



                                                      8
multiple locations are mentioned. Our algorithm performs 25% better than LNEx even when comparing

whether any location extracted from the tweet is near the true location.

   Analyzing the crash data produced using our algorithm and focusing on the truth dataset within the city

limits of Nairobi, we find that all crashes from July 2017 to July 2018 can be found in 435 clusters, each with

a maximum diameter of 300 m. Of these clusters, 67% have two or more crashes and there are 56 clusters

with 10 or more crashes. Additionally, 66 crash clusters represent over 50% of all the crashes. When looking

at the 7.5 years of crowdsourced data for the city of Nairobi, the number of crash clusters does not grow

linearly, implying that the locations where crashes occur and are reported in Twitter are consistent across

years. Only 14% of crash locations have only a single crash, and there are 443 crash clusters with 10 or more

crashes. We see the concentration of crashes even more when we note that only 9% of crash clusters (133 of

1,375) represent 50% of the crashes reported (Figure 3 shows crash heatmaps for the truth dataset from July

2017 to July 2018 and for 2012-2020).




Figure 3. Heatmap of crashes Data in panel a is from July 2017 - July 2018, where we use the manually
coded Twitter dataset. Data in panel b is for August 2012 - July 2020. Road data comes from
OpenStreetMap.




Discussion

Cities are constantly evolving and understanding urban mobility is critical to creating urban designs that

help to manage risks for pedestrians and vehicles. Severe data limitations hinder the development of policy

interventions needed to manage risks, especially in low- and middle-income resource-constrained countries.

Closing the data deprivation gap can help avert divergence in socioeconomic conditions between data-poor

and -rich countries. By focusing on RTCs–the number one cause of death among young people—we

demonstrate that social media could be an inexpensive way to produce non-existent RTC data in

resource-poor contexts that can support government analyses of road safety and potentially inform policy.


                                                      9
This tool could be especially powerful when combined with investments in building a digital administrative

dataset that would provide information on the crashes attended by police. The answer to the seemingly

simple question of where and when crashes occur has profound implications for public policy response that

can save lives. And while official data deprivation can be an impediment to economic development, data

generated by private operators can be transformed and placed in the hands of policy makers as a resource for

policy making. By expanding the amount of data, we can generate more input to help resource-constrained

countries prioritize policy action where it is most needed.

   This example of geolocating crash data from mining twitter data can help to guide infrastructure redesign

or enforcement policies to reduce RTCs. Nairobi comprises an extensive road network of 6200 km; with the

city’s limited resources, addressing road safety across the whole network is difficult. By using this type of

geolocated data, urban planners and policy makers can narrow down the problem to the areas with the

highest number of crashes. This has been proven to work in developed countries where targeting risky

locations led to reductions in the concentration of crashes [48]. As shown in the results, crashes reported on

Twitter are highly concentrated, with the top 15% of locations spread across 20 km of road having 50% of

the crashes reported on Twitter.

   It should be noted that there are some limitations to the approach. The data generated are limited by the

coverage of the crowdsourced data. Users are more active on social media at particular times, and it is

necessary to possess a smartphone and have access to internet to be able to use the service. This can lead to

bias in the reports generated via the crowdsourced data. Only 7.5% of tweets are sent between the hours of 9

p.m. and 6 a.m., and as a result only 12% of the crash reports from Twitter are during this time. There

could also be geographic bias if there are areas of the city where people with smartphones are more likely to

be present or passing by, and therefore more likely to report. Our real-time motorcycle validation exercise

demonstrates the internal validity of the crowdsourced data and the improved algorithm. External validity is

more difficult to assess because we do not know what the universe of crashes is. Additionally, we do not

know the severity of the crashes reported on Twitter. Therefore, we have no way of knowing if the areas

where crashes happen are the most dangerous, which is what policy makers likely would want to target.

These caveats should be considered by policy makers when using crowdsourced data to inform policies and

targeting.

   Despite the limitations, our improved geoparsing algorithm discussed in this paper can begin filling some

of the gaps in data in low-capacity and data-scarce settings. And while the crash cluster areas identified by

the algorithm may not be the most dangerous or may not represent all crash areas, they nevertheless

highlight problem areas. All crashes, minor or severe, have important economic consequences in terms of

property damage and lost time and productivity due to the traffic generated (which is one of the reasons the


                                                      10
crash is likely reported on Twitter). Therefore, these data can be used to target areas for design solutions

where we are seeing high numbers of crashes consistently over time. In settings where there are limited or

non-existent administrative records and, therefore, lack of any geolocated data, this tool can produce

information in real-time for one of the most pressing challenges in developing countries.

   Furthermore, by developing tools that generate time-stamped geolocated data and statistics from

crowdsourcing on different “events” that are reported on social media, we can hope to expand data

availability across other contexts and across issues beyond RTCs. For example, real-time traffic applications

like RIDLR in India can be used to expand data on road safety. These improved tools can also help geolocate

victims during a natural disaster or alert disaster management teams to the location of unsafe buildings or

areas needing immediate attention. They can support law enforcement or communities to locate and respond

to crimes, cases of violence against women, or police violence. Improved identification of the time and

location of events can help to automate and accelerate policy response across a wide set of issues, potentially

leading to better policy outcomes.



References

   1. Serajuddin U, Uematsu H, Wieser C, Yoshida N, Dabalen A. Data deprivation: Another deprivation

      to end. The World Bank. 2015;.

   2. Notzon F, Nichols EK. Global Program for Civil Registration and Vital Statistics (CRVS)

      Improvement; 2015.

   3. WHO. Global status report on road safety 2018. World Health Organization. 2018;.

   4. IEAG. A World that Counts–Mobilising the Data Revolution for Sustainable Development.

      Independent Expert Advisory Group on a Data Revolution for Sustainable Development. 2014;.

   5. GSMA Intelligence. The Mobile Economy 2020. London: GSM Association. 2020;.

   6. Kemp S. Digital 2020: Global Digital Overview. Retrieved from Datareportal:

      https://datareportalcom/reports/digital-2020-global-digital-overview. 2020;.

   7. Batty M. Big data, smart cities and city planning. Dialogues in human geography. 2013;3(3):274–279.

   8. Miller G. Social scientists wade into the tweet stream. Science. 2011;333(6051):1814–1815.

   9. Kitchin R. The real-time city? Big data and smart urbanism. GeoJournal. 2014;79(1):1–14.

  10. Einav L, Levin J. Economics in the age of big data. Science. 2014;346(6210).


                                                      11
11. Hao J, Zhu J, Zhong R. The rise of big data on urban studies and planning practices in China: Review

    and open research issues. Journal of Urban Management. 2015;4(2):92–124.

12. Blumenstock J, Cadamuro G, On R. Predicting poverty and wealth from mobile phone metadata.

    Science. 2015;350(6264):1073–1076.

13. Kosinski M, Stillwell D, Graepel T. Private traits and attributes are predictable from digital records of

    human behavior. Proceedings of the national academy of sciences. 2013;110(15):5802–5805.

14. Resch B, Summa A, Zeile P, Strube M. Citizen-Centric Urban Planning through Extracting Emotion

    Information from Twitter in an Interdisciplinary Space-Time-Linguistics Algorithm. Urban Planning.

    2016;1(2):114–127. doi:https://doi.org/10.17645/up.v1i2.617.

15. Jaidka K, Giorgi S, Schwartz HA, Kern ML, Ungar LH, Eichstaedt JC. Estimating geographic

    subjective well-being from Twitter: A comparison of dictionary and data-driven language methods.

    Proceedings of the National Academy of Sciences. 2020;117(19):10165–10171.

16. Steiger E, Westerholt R, Resch B, Zipf A. Twitter as an indicator for whereabouts of people?

    Correlating Twitter with UK census data. Computers, Environment and Urban Systems. 2015;54:255 –

    265. doi:https://doi.org/10.1016/j.compenvurbsys.2015.09.007.

17. Wang Q, Phillips NE, Small ML, Sampson RJ. Urban mobility and neighborhood isolation in

    America’s 50 largest cities. Proceedings of the National Academy of Sciences. 2018;115(30):7735–7740.

18. WHO. Data systems: A road safety manual for decision-makers and practitioners. World Health

    Organization. 2010;.

19. Williams S. Data Action: Using Data for Public Good. Cambridge, MA: MIT Press; 2020.

20. Gu Y, Qian ZS, Chen F. From Twitter to detector: Real-time traffic incident detection using social

    media data. Transportation Research Part C: Emerging Technologies. 2016;67:321 – 342.

    doi:https://doi.org/10.1016/j.trc.2016.02.011.

21. Zhang Z, He Q, Gao J, Ni M. A deep learning approach for detecting traffic accidents from social

    media data. Transportation research part C: emerging technologies. 2018;86:580–596.

22. Finkel JR, Grenager T, Manning C. Incorporating Non-local Information into Information Extraction

    Systems by Gibbs Sampling. In: Proceedings of the 43rd Annual Meeting of the Association for

    Computational Linguistics (ACL’05); 2005.



                                                     12
23. Bender O, Och FJ, Ney H. Maximum Entropy Models for Named Entity Recognition. USA:

   Association for Computational Linguistics; 2003.Available from:

    https://doi.org/10.3115/1119176.1119196.

24. Bhargava R, Zuckerman E, Beck L. CLIFF-CLAVIN: Determining Geographic Focus for News Articles;

   2014. NewsKDD: Data Science for News Publishing.

25. Ritter A, Clark S, Mausam, Etzioni O. Named Entity Recognition in Tweets: An Experimental Study.

    In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing; 2011.

26. Gelernter J, Balaji S. An algorithm for local geoparsing of microtext. GeoInformatica.

   2013;17(4):635–667. doi:10.1007/s10707-012-0173-8.

27. Malmasi S, Dras M. Location Mention Detection in Tweets and Microblogs. In: Hasida K,

    Purwarianti A, editors. Computational Linguistics. Singapore: Springer; 2016. p. 123–134.

28. Middleton SE, Middleton L, Modafferi S. Real-Time Crisis Mapping of Natural Disasters Using Social

    Media. IEEE Intelligent Systems. 2014;29(2):9–17. doi:10.1109/MIS.2013.126.

29. Zeng Q, Huang H, Pei X, Wong S. Modeling nonlinear relationship between crash frequency by severity

    and contributing factors by neural networks. Analytic methods in accident research. 2016;10:12–25.

30. Zeng Q, Huang H, Pei X, Wong S, Gao M. Rule extraction from an optimized neural network for

    traffic crash frequency modeling. Accident Analysis & Prevention. 2016;97:87–95.

31. Wahab L, Jiang H. A comparative study on machine learning based algorithms for prediction of

    motorcycle crash severity. PLOS ONE. 2019;14(4):1–17. doi:10.1371/journal.pone.0214966.

32. Salas A, Georgakis P, Petalas Y. Incident detection using data from social media. In: 2017 IEEE 20th

    International Conference on Intelligent Transportation Systems (ITSC); 2017. p. 751–755.

33. Mai E, Hranac R. Twitter Interactions as a Data Source for Transportation Incidents. In:

   Transportation Research Board 2013 Annual Meeting; 2013.

34. Sloan L, Morgan J. Who tweets with their location? Understanding the relationship between

    demographic characteristics and the use of geoservices and geotagging on Twitter. PloS one.

   2015;10(11):e0142209.

35. Gatica-Perez D, Santani D, Isaac-Biel J, Phan TT. Social Multimedia, Diversity, and Global South

    Cities: A Double Blind Side. In: Proceedings of the 1st International Workshop on Fairness,

   Accountability, and Transparency in MultiMedia. ACM; 2019. p. 4–10.


                                                 13
36. Meier P. Digital humanitarians: How big data is changing the face of humanitarian response.

    Routledge; 2015.

37. Dhavase N, Bagade AM. Location identification for crime disaster events by geoparsing Twitter. In:

    International Conference for Convergence for Technology-2014; 2014. p. 1–3.

38. Aggarwal CC, Zhai CX. Mining Text Data. Boston, MA: Springer; 2012.

39. Yin J, Karimi S, Lampert A, Cameron MA, Robinson B, Power R. Using Social Media to Enhance

    Emergency Situation Awareness: Extended Abstract. In: Proceedings of the Twenty-Fourth

    International Joint Conference on Artificial Intelligence (IJCAI); 2015.

40. Al-Olimat H, Thirunarayan K, Shalin V, Sheth A. Location Name Extraction from Targeted Text

    Streams using Gazetteer-based Statistical Language Models. In: Proceedings of the 27th International

    Conference on Computational Linguistics; 2018.

41. Premamayudu B, Subbarao P, Koduganti VR. Identification of Natural Disaster Affected Area Precise

    Location Based on Tweets. International Journal of Innovative Technology and Exploring Engineering.

   2019;8(6).

42. Sangameswar MV, Nagabhushana Rao M, Satyanarayana S. An algorithm for identification of natural

    disaster affected area. Journal of Big Data. 2017;4(39).

43. de Bruijn JA, de Moel H, Jongman B, de Ruiter MC, Wagemaker J, Aerts J. A global database of

    historic and real-time flood events based on social media. Scientific Data. 2019;6(311).

44. Ristea A, Boni MA, Resch B, Gerber MS, Leitner M. Spatial crime distribution and prediction for

    sporting events using social media. International Journal of Geographical Information Science.

   2020;0(0):1–32. doi:10.1080/13658816.2020.1719495.

45. Gerber MS. Predicting crime using Twitter and kernel density estimation. Decision Support Systems.

   2014;61:115 – 125. doi:https://doi.org/10.1016/j.dss.2014.02.003.

                                           e-Mauroux P. CrimeTelescope: crime hotspot prediction
46. Yang D, Heaney T, Tonon A, Wang L, Cudr´

    based on urban and social media data fusion. World Wide Web. 2018;21(5):1323–1347.

47. Karimzadeh M, Pezanowski S, MacEachren AM, Wallgr[U+FFFD]n JO. GeoTxt: A scalable

    geoparsing system for unstructured text geolocation. Transactions in GIS. 2019;23(1):118–136.

    doi:10.1111/tgis.12510.

48. Austroads. Guide to roadsafety part 8: Treatment of crash locations; 2015.


                                                  14
Applying Machine Learning and Geolocation
Techniques to Social Media Data (Twitter) to
   Develop a Resource for Urban Planning

        Supplementary Information
   In recent years, social media, and especially Twitter, has emerged as a source of real-time in-

formation. Therefore, it is natural that in a context where there is a dearth of oﬃcial data on

a topic such as road safety, we turn to crowdsourcing through Twitter to produce more compre-

hensive information on road traﬃc crashes. As people move around cities that have been plagued

by congestion, they have started to rely on social media and citizen reporting to help them avoid

major traﬃc jams and decrease their commutes. Given the relationship between RTCs and conges-

tion, platforms that crowdsource and broadcast traﬃc updates have the additional beneﬁt of often

reporting on RTCs. This makes it possible to use crowdsourced data to identify when and where

crashes are occurring, which can be used to supplement and improve on existing oﬃcial statistics.

   While only around 1% of tweets contain geo-metadata, a growing literature has developed

geoparses—or algorithms that extract location names from text. Tweets present a unique chal-

lenge to geoparses. State-of-the art geoparses, such as OpenCascais and Stanford Named Entity

Recognition, rely on grammar rules to identify location mentions; however, tweets often do not

follow grammatical capitalization rules and use clipped, ungrammatical sentences [1, 2]. New al-

gorithms have been developed to geoparse tweets. This includes developing an algorithm that

accounts for tweets that contain place references that are abbreviated, misspelled or highly local-

ized [2]. Others develop gazetteers (location dictionaries) from sources such as Open Street Map

and Geonames and search for names within the gazetteer in tweets, employing diﬀerent approaches

to account for misspellings or tweets using shortened names than are in gazetteers [3, 4, 5]. [5]

provides a review of existing approaches.

   Here we provide more information on the speciﬁc data that we used, how they were processed

and the algorithms that were developed. These processes can then be implemented in diﬀerent

contexts where crowdsourced data on RTCs are available.


Twitter Data

We generate crash data from social media. The main source of crowdsourced data comes from

Ma3Route, a mobile/web/SMS platform that crowdsources transport data and provides users with

information on traﬃc, matatu (informal bus) directions, driving reports and crashes for Kenya. As

of early 2019, Ma3Route had 1.1 million followers on Twitter and around 400,000 subscribed users

on their app. When users post a traﬃc report on the app, Ma3Route displays the report on their

                                                S1
app and posts the report to Twitter. We scraped all tweets posted by Ma3Route from May 2012,

when the Twitter feed was started, onward. Figure S1 shows the number of tweets across time.1

The full dataset of tweets that we use consists of 874,588 tweets scraped between May 2012 and

July 2020. See Table S1 for examples of tweets.




                                    Figure S1: Ma3Route Tweet Trends




                               Table S1: Example tweets from Ma3Route
             1    accident on waiyaki way just before deloitte
             2    accident just before roysambu footbridge on thika road inboud tailback
                  almost at githurai
             3    accident at junction of ole dume and arwing kodhek
             4    accident at tajmall towards the roundabout heavy traﬃc from doni
             5    there is an accident at the pangani underpass heading to either muranga
                  road or forest road involving two personal cars and a matatu mini bus
                  this is causing a bit of snurl up cc
             6    jogoo road traﬃc small accident just before donholm
             7    a heavy truck has rolled at karai naivasha loaded with what seems to be
                  bags of maize such trucks are supposed to use mai mahiu route how did
                  it end up there
             8    prepare for snurl up jogoo road just a minor incident apo hamza
             9    bad accident involving 6 matatus and a lorry on thika road near till
                  station
             10   an accident has occurred kenyatta road involving a lorry that has over-
                  turned and several vehicles
User mentions have been removed.
   1
    Ma3Route was most popular in 2015, receiving 700-1,000 traﬃc reports a day, but has since declined in popularity,
receiving an average of around 300 traﬃc reports daily.




                                                         S2
   We explored additional Twitter handles that focus on traﬃc and road safety in Kenya. These

include twitter handles such as RoadAlertsKE, KenyanTraﬃc and ThikaTowntoday. The majority

of tweets from these other handles are already tweeted out by Ma3Route; therefore, including

these additional handles does not produce many new tweets to incorporate into the dataset. An

additional source of data is including tweets that mention Ma3Route but are not necessarily posted

by Ma3Route. While these tweets are not included in the current analysis, they can be easily

incorporated to expand the data set that is used to generate additional crash reports. We have

already done this for the data set of crashes that we are producing for the Government of Kenya.


Building a Truth Data set of Tweets

We build a truth data set of Ma3Route tweets where tweets are labeled as to whether they refer to

a speciﬁc traﬃc crash and, if they do, are geocoded. We code all potentially crash related tweets

from July 2017 to July 2018. We deﬁne a tweet as potentially crash-related if one of the following

words appeared in the tweet:



    accident, accidents, ajali, axident, collision, crash, crashes, crashs, crush, crushed, damage,

disaster, emergency, fatal, fatality, fender bender, fender-bender, hazard, hit, hit-and-run, incident,

incidents, injuries, injury, magari zmegongana, mishap, overturn, overturned, ovrturn, ovrturned,

pileup, rammed, read end, rear ended, roll, rolled, smash, smashed, wreck, wreckage, zilicrash, zime-

crash



To account for misspellings of select words, we also include tweets if they contained a word that

had a Levenshtein distance of two or less to “accident” or “incident” or a Levenshtein distance of

one to “crash” or “crashed”.

   Six coders were trained to process the 9,480 tweets deﬁned as potentially crash related. Coders

were instructed to label a tweet as reporting a crash if the tweet referred to one or more speciﬁc

crashes; general comments about crashes were labeled as not reporting a crash. If the coder labeled

the tweet as reporting a crash, they were instructed to geocode the location of the crash based on

the tweet text if they were able. Coders were instructed to record the street names and landmarks

used to geocode the crash. In addition, they provided the approximate coordinates of the crashes.

                                                  S3
Each tweet was labeled and geocoded by two coders; diﬀerences were resolved by one of the authors.

(We consider geocodes diﬀerent if they were more than 100 m apart.)

   Of the 9,480 tweets, 6,602 (69%) reported a crash and of these, 4,192 (63%) identiﬁed an

approximate location of the crash.


Augmenting a Gazetteer

The primary goal of the algorithm to augment the gazetteer is to generate alternate names of

landmarks that users may use instead of the original name in the gazetteer. Alternate names are

generated in three steps: (1) splitting landmark names at certain punctuation (e.g., slashes), (2)

create n-grams and skip-grams of landmarks and (3) in select cases, removing the landmark type

from the end of the name (e.g., removing ‘restaurant’ from ‘McDonald’s restaurant.’) The algorithm

also removes landmark names that are common words that may often be used in a context to not

refer to a landmark. In addition, the algorithm removes landmarks that do not refer to a speciﬁc

location, such as roads.




                                               S4
 Algorithm Augment gazetteer
 Input     Landmark gazetteer, where for each
           landmark entry includes: (1) name,
           (2) types and (3) coordinates
 Output Augmented landmark gazetteer
A. Split landmarks at select punctuation
 1. If a landmark has a slash, open parentheses, dash or comma,
    add landmarks to the gazetteer that separate at the char-
    acter.

B. Clean landmark names
 1. Everything lowercase, only keep alphanumeric characters
    (eg, remove punctuation)

C. Remove certain landmarks
 1. Remove landmarks that are just one character in length
 2. Remove landmarks that have certain types (eg, where the
    type indicates that the landmark actually represents a
    large area). We remove landmarks with the type: route,
    road, political, locality or neighborhood except if the land-
    mark also contains “ﬂyover“ or “roundabout“ in the name1        1 We treat ﬂyovers and roundabouts
                                                                    as landmarks, even though they are
                                                                    roads, as they represent a unique lo-
D. Create N-grams and skip-grams2                                   cation
                                                                    2 Other geoparses such as LNEx only
 1. Generate 2-3 N-grams and add to gazetteer                       add the n-grams and skip-grams if
 2. Generate 2-3 skip-grams, skip 1-4, restrict so that the ﬁrst    the name does not already exists in
    and last word match and add to gazetteer3                       the gazetteer. Our algorithm diﬀers,
                                                                    and we add all n-grams and skip-
                                                                    grams. However, in the algorithm to
E. Create parallel landmarks                                        locate events, we preference locations
                                                                    where the landmark name associated
 1. If a word begins/ends with a certain word/phrase, remove        with the location was not a derived
    the word or phrase                                              n/skip-gram, but still consider the n-
                                                                    gram/skip-gram version as the non-
    (a) If it begins with a stopword or preposition, create par-    derived landmark location may be re-
        allel landmark with word removed                            moved from consideration if it is not
    (b) If ends with: bar, shops, restaurant, hotel, stage, bus     near a mentioned road.
                                                                    3 For example, from the original land-
        stop or bus station, create parallel landmark with
                                                                    mark ‘Prestige Plaza Shopping Mall‘,
        word removed                                                this generates ‘Prestige Mall‘, ‘Pres-
 2. If word contains certain word/phrase, swap with another         tige Plaza Mall‘, and ‘Prestige Shop-
                                                                    ping Mall‘
    (a) (stage, bus stop, bus station) – make interchangeable.
        So if someone says “X stage”, create “X bus stop” and
        “X bus station”
 3. Diﬀerent spellings of words
    (a) British/English spellings (Eg,: center vs centre, the-
        ater vs theatre)


                                                 S5
    (b) Common shorter/longer/diﬀerent ways (train vs rail-
        way, rail vs railway)
 4. Add types
    (a) If landmark ends with: stage, bus stop or bus sta-
        tion, add “stage” as type (we preference certain types,
        hence we do this).
 5. Remove parallel landmarks if only 1-2 characters long, and
    add rest to gazetteer

F. Remove landmarks
 1. If it has a stop word and is 2 or less words, remove
 2. If landmark contains/begins with/ends with:
    (a) If landmark contains: road or rd, remove
    (b) If landmark begins with a stop word or preposition,
        remove
    (c) If landmark ends with road word (street, st, avenue,
        ave), remove
 3. Remove common English words
    (a) Remove one word landmarks that are also English
        words (spelled correctly according to an English spell
        checker)4 but are not nouns5 or categorized as a bus/     4 We use Hunspell, a commonly used

        transit station.6                                         spellchecker
                                                                  5 We use spaCy, an open source nat-
                                                                  ural language processing library, to
                                                                  determine the part of speech of each
                                                                  landmark
                                                                  6 We keep bus/transit stations as
                                                                  users often reference matatu stages
                                                                  when describing crash locations




                                               S6
Tweet Classiﬁcation - Identifying relevant crowdsourced reports

We ﬁrst developed an algorithm to identify whether a tweet is crash related or not, using the truth

data set to train the algorithm. We extract features from tweets by extracting n-grams from tweets.

We employ a grid search, tuning the models by testing all combinations of multiple parameters.

The three main parameters we test are: (1) extracting 1-grams, 1-2 grams or 1-3 grams, (2)

not removing any features or removing features that occur in less than/more than 0.01%/99.9%,

1%/99% or 5%/95% of tweets, (3) deﬁning features as the number of occurrences of the n-gram in

the tweet or using the Term Frequency - Inverse Density Frequency (TF-IDF) of the n-gram2 For

the Support Vector Machine, we also vary the regularization parameter–which controls how the

algorithm weighs misclassiﬁcation versus simplicity–using 0.5, 1, 2, 10, 100 and 1000.



                           Table S2: Example Tweet and Augmented Tweet
           accident past garden city near thika rd and kamiti rd junction
           accident past #landmark-name# near #road-name# and #road-name# junction


       An additional parameter we test is using the original tweet text and, following [6], replacing

landmark names and road networks with generalized names (just indicating the presence of a

landmark or road). Generalizing landmark and road names helps to reduce the dimensionality

of the feature space. Table S2 demonstrates how a particular tweet is transformed into one with

general landmark and road names. Table S3 shows examples of the features extracted in regular and

augmented tweets where landmarks and roads have been replaced. This augmentation assumes that

the occurrence of a road or landmark name contributes equally to the probability of a crash-related

tweet.
   2
    TF-IDF reﬂects how important a word or n-gram is to a tweet within the full set of tweets; for example, words
such as ‘a’ or ‘the’ that appear frequently will be given less weight. It is calculated as
                                    N T weets           N times n − gram appears in atweet
                     log (                           )×
                             N T weets with N − gram         N n − grams in a tweet




                                                        S7
                                   Table S3: Features of Tweets
                         N-gram                      Using      Using
                                                   Original Augmented
                                                    Tweet       Tweet
                         accident                       1         1
                         past                           1         1
                         garden                         1         0
                         city                           1         0
                         near                           1         1
                         thika                          1         0
                         rd                             2         0
                         and                            1         1
                         kamiti                         1         0
                         junction                       1         1
                         accident past                  1         1
                         past garden                    1         0
                         garden city                    1         0
                         city near                      1         0
                         near thika                     1         1
                         thika rd                       1         0
                         rd and                         1         0
                         and kamiti                     1         0
                         kimiti rd                      1         0
                         rd junction                    1         0
                         #landmark-name#                0         1
                         #road-name#                    0         2
                         past #landmark-name#           0         1
                         of #road-name#                 0         1
                         and #road-name#                0         1
                         Features deﬁned using the number of occur-
                         rences of n-gram in the tweet.


   We test two methods for determining whether a tweet reports a crash: Naive Bayes and support

vector machines. Both techniques are commonly used in text classiﬁcation for their ability to handle

high dimensionality, e.g. when the number of features is greater than the number of observations

[7, 8]. The Naive Bayes model is estimated as:

                                                     n
                                       y
                                       ˆ =y P (y )         P (xi |y )                            (1)
                                                     i=1

where y is whether the tweet is classiﬁed as crash related or not and xi are all the n-grams that

occur in a tweet.

   The linear SVM solves the minimization problem:

                                                 S8
                                            N
                                    min C       (1 − yi f (xi ))2 + ||w||2                        (2)
                                            i

where C is a regularization parameter and ||w||2 is a penalty function. Here, y equals 1 when the

tweet references a crash and -1 when it does not. We use a squared hinge loss function (L2).

   We implement k-fold cross-validation on 4 folds, training the model on 75% of the truth data

and testing on 25% of the data within each fold. Table S4 shows results for select parameters.

While the Naive Bayes algorithm performs slightly better based on precision, the SVM has higher

recall and generally performs better for 2 and 3 n-grams. Overall, the F1 statistic, which provides a

balance between the precision and recall, is best for SVM at 0.95 using 2 and 3-grams. Given that

the overarching goal is to produce a data set of geolocated crashes based on the tweets, better recall

is more important than higher precision. The reason for this is that even if a larger set of tweets is

misclassiﬁed as crash related, it is more likely that these general tweets will not be geolocated at

the second stage since they are not discussing a particular crash with a given location. We therefore

want to capture as many of the tweets that are reporting crashes as possible at this stage, even if

it means capturing slightly more tweets that are not reporting a crash. The SVM algorithm also

has a very high accuracy of 0.93.



                               Table S4: Tweet Classiﬁcation Results
                       Precision Recall      F1      Accuracy N-Grams
                                     Naive Bayes
                        0.938       0.947   0.942     0.919          1
                        0.945       0.949   0.947     0.926          2
                        0.945       0.949   0.947     0.926          3
                                        SVM
                        0.935       0.963   0.948     0.927          1
                         0.94       0.966   0.953     0.934          2
                        0.939       0.967   0.953     0.934          3

                   The table shows best results for both SVM and Naive Bayes.
                   For these results, both models use the original tweet and no
                   features are removed. The Naive Bayes models do not use
                   TF-IDF, while the SVM models do.




                                                     S9
Preparation for Geolocation

Prior to being able to use the geolocation algorithm, two additional pieces need to be prepared. One

relates to identifying types of landmarks that are more common to be mentioned as the location of

a crash in a tweet. In the situation where there might be multiple landmarks with the same name,

the more likely landmark for a crash is the one that should be chosen for the location. The second

relates to identifying the correct location when multiple locations are mentioned in the tweet. We

can use the typical grammatical structure of a tweet to identify prepositions that are used prior to

the correct location of a crash compared to ones that are more likely to be used with locations that

are not close to the crash. Ranking prepositions based on these probabilities makes it possible to

choose the correct location from the possible locations mentioned.


Determining Landmark Types More Commonly Used as the Crash Location

When a landmark name is mapped to multiple locations, the algorithm preferences certain landmark

types. To determine which landmarks to preference, we examine which landmark types are more

commonly associated with the correct location. We consider cases where (1) one landmark is used

to identify the crash location and (2) the landmark name is mapped to locations both near and

far from the crash location. We compute the proportion of times a type is near and far from a

crash location and divide the proportion near over far to understand the likelihood that choosing

the type is near the crash location.

   Figure S2 shows results. Among tweets considered, a landmark location that is a bus stop is

near the correct location 17% of the time and is far from the correct location less than 1% of the

time, leading to a bus stop being close to the correct location 22 times more frequently than far

from the correct location.

   In the algorithm, we use the top 6 landmark types (all being 2.5 or more times likely to be near

the correct location) to preference landmarks: bus stop, parking, mall, cafe, transit station and bus

station.




                                                S10
Figure S2: Landmark types typically near or far from the crash location when a landmark name is

mapped to multiple locations


Determining Preposition Phrase Tiers

The truth dataset indicates the landmark used to geocode the crash. We examine the phrases that

precede the landmark. Figure S3 shows the top phrases. The phrase “at” precedes the correct

landmark in 42% of tweets and in roughly half these cases “accident at” precedes the landmark.

   We examine the phrases that precede the landmark to guide decision making when more than

one landmark is mentioned. For this, we take all phrases that precede the correct landmark at least

20 times. We then identify cases where two of these phrases appear in a tweet and one of the phrases

precedes the correct landmark; we then calculate the proportion of times each phrase precedes the

correct landmark when the other phrase is also in the tweet. Figure S4 shows results. While ‘at’

is the most common word that precedes a landmark, other phrases that precede landmarks are

more predictive of the correct landmark. For example, when both ‘at’ and ‘near’ appear in the

tweet (and one of them precedes the correct landmark), the landmark is preceded by ‘at’ only 6%

of the time. We use information from these phrase-pairings to divide phrases into “tiers”; if two

landmarks are found in a tweet, the landmark is used where the phrase that precedes it is from a

lower tier. We develop 6 tiers:


                                                S11
  1. Tier 1: Across phrase-pairs, these phrases precede the correct landmark more than the other

     phrase in all cases. (for example, when ‘just after‘ and phrases such as ‘at’, ‘on’, or ‘in’ are

     also in the tweet, ‘just after‘ precedes the correct landmark more often than all other phrases).

  2. Tier 2: These phrases precede the correct landmark more than the other phrase in over 90%

     of cases (but less than 100%).

  3. Tier 3: Across phrase-pairs where one of the phrases is “at”, these phrases precede the

     correct landmark more times than “at.”

  4. Tier 4: The phrase “at”

  5. Tier 5: Remaining phrases where, across phrase-pairs, the phrase precedes the correct land-

     mark more often than over half of the other phrases.

  6. Tier 6: Remaining phrases where, across phrase-pairs, the phrase precedes the correct land-

     mark more than at least one other phrase.

We modify this list to account for diﬀerent spellings of certain phrases (e.g., adding “btw” with

“between”) and whenever a phrase has “accident [word]”, we generalize so this becomes “[crash

word] [word]”, where crash word includes any word such as accident, crash, hit, wreck, etc. Using

this, we use the following phrase tiers:

  1. Tier 1: [crash word] after, [crash word] near, [crash word] outside, [crash word] past, around,

     hapo, just after, just before, just past, near, next to, ”opposite”, outside, past, you approach,

     apa, apo, hapa, right after, right before, right past, just before you reach

  2. Tier 2: [crash word] at, before

  3. Tier 3: after

  4. Tier 4: at, happened at, at the, pale

  5. Tier 5: between, from, btw, btwn

  6. Tier 6: along, approach, in, on, opp, to, towards, toward




                                                S12
Figure S3: Top words that precede the landmark that correctly identiﬁes the crash location.




                                           S13
Figure S4: Likelihood of diﬀerent words preceding the correct landmark




                                 S14
Locating Crash Events

As demonstrated in Table S1 in the example tweets from @Ma3Route, the geoparser has to handle

diﬀerent tweets in diﬀerent ways. For example, tweet 1 is simple, including the name of one road

and one landmark. Tweet 3 is short and clear as well; however, it identiﬁes the crash location by a

junction instead of a landmark. Tweet 8 uses the Swahili word ”apo”, which is commonly in front

of a landmark word. Accident 2 includes the location of the crash and the location where traﬃc

starts. This section outlines in detail the diﬀerent components of the geolocation algorithm, which

are meant to handle these diﬀerent situations.

   The algorithm to locate an event location from text starts by cleaning the text and extracting

location names of landmarks, roads and areas (e.g., neighborhoods) from the text. Next, the

algorithm restricts location names and their locations to consider; for example, if two landmark

names are found, and one is contained within the other, we only keep the longer one; in addition,

where possible, we restrict locations to those near mentioned roads. The algorithm then chooses

the location names that reference the event location, prioritizing location names primarily by the

words that precede them (e.g., “just after [location]” is used over “toward [location]”). If the chosen

location is not near a mentioned road, we search for landmarks that have a similar name but are

near a mentioned road. Next, we snap the location to the road network. Finally, the algorithm

implements select checks to determine whether no location should be outputted; for example, if

a road is mentioned but the chosen location is not within 500 m from any mentioned road, the

algorithm does not output a location. The algorithm is described in detail below.




                                                 S15
 Algorithm Locate crash/event locations
 Input    Text
          Landmark gazetteer
          Roads
          Areas (e.g., neighborhoods)
          List of event words (e.g., crash, acci-
          dent, wreck, etc)
          Prepositions, grouped by tier
          Types, grouped by tier
 Output Coordinates of event
A. Clean Tweets
  1. Replace @ with “at” only when it is not proceeded by
     via or when it is not the last word in a tweet.7              7 We found that @[word] often re-
                                                                   ferred to a twitter handle when pro-
  2. Remove select stopwords8                                      ceeded by via or when it was the last
  3. Mask common phrases that contain a location but refer         word in a tweet; otherwise, users were
                                                                   more likely to use “@” as a shorthand
     to something else, such as “[city] bus”9
                                                                   for “at.” Distinguishing these cases is
  4. Removing hyperlinks and only keeping alphanumeric char-       important as we rely on preposition
     acters (e.g., removing punctuation).                          to prioritize landmark references.
                                                                   8 We only remove “a” and “the”;
                                                                   other stopwords may be part of
B. Extract Locations                                               a landmark name (e.g., the stop-
  1. Extract exact matches of landmarks, roads and areas           word “and” appears in the restau-
                                                                   rant “nice and lovely”. We remove
  2. Extract fuzzy matches of landmarks, roads and areas           these stopwords as we later deter-
                                                                   mine whether a preposition proceeds
    (a) Break tweets into 1-3 grams                                a landmark, and we consider [prepo-
    (b) For each n-gram, check levenstein distance to gazetteer    sition] [landmark name] to be equiva-
        entries. If word/phase is 0 – 4 characters, ignore; if     lent to [preposition] [stopword] [land-
                                                                   mark name].
        5-10, allow levenstein distance of 1; if above 10, allow
                                                                   9 In Nairobi, we found that matatu
        levenstein distance of 2
                                                                   (minibuses) often were referred to by
  3. Extract landmarks after prepositions. For each preposi-       the location where they traveled to;
     tion in the tweet:10                                          consequently, we mask phrases such
                                                                   as: “githurai bus”, “rongai matatu”,
     (a) Take the word after the preposition and extract all       machakos minibus”, etc. In masking,
         landmarks that start with that word                       we replace each word in the phrase
                                                                   with a random sequence of charac-
                                                                   ters. Doing this preserves that a word
                                                                   appears at a location in the tweet,
                                                                   which may aﬀect procedures such as
                                                                   determining the landmark closest to
                                                                   an event word.
                                                                   10 This procedure will often capture
                                                                   the same landmarks as captured in
                                                                   the preceding steps; however, it helps
                                                                   to capture other landmarks where the
                                                                   process for augmenting the gazetteers
                                                                   did not generate the landmark name
                                                                   contained in the tweet




                                               S16
    (b) Go to the next word in the tweet and further re-
        strict landmarks to those that contain that word.
        Repeat until doing so would remove all landmarks
        considered.11                                             11 For example, in the tweet “acci-

    (c) Among extracted landmarks, determine which land-          dent at garden city toward town”,
                                                                  the algorithm searches for landmarks
        mark has the smallest number of words and only keep       after ‘at.’ It ﬁrst ﬁnds all land-
        landmarks with that number of words.12 .                  marks that contain ‘garden’, then
                                                                  it narrows down these landmarks to
                                                                  those with both ‘garden’ and ‘city’.
C. Extract point locations from roads                             No landmark contains ‘garden’, ‘city’
 1. For each found, check if the length of the diagonal along     and ’toward’, so the algorithm stops
                                                                  and considers landmarks with ‘gar-
    the bounding box is less than 500 m; if it is, take the       den’ and ‘city’.
    centroid and consider this location to be a landmark13 .      12 For example, if ‘garden city’, ‘gar-
 2. If two or more roads are mentioned, ﬁnd intersections         den mall’, ‘garden city mall’ and ‘air-
    between each road pair. If two roads intersect at multiple    tel money agent rock city gardens’
                                                                  were extracted, the algorithm keeps
    locations, only add the intersection if these locations are
                                                                  ‘garden city’ and ‘garden mall’
    within 1 km.                                                  13 These cases are often ﬂyovers and
                                                                  roundabouts
D. Restrict landmarks to consider
 1. If the name of a landmark and a road overlap, keep the
    road and remove the landmark (if a landmark and area
    overlaps, we keep both).
 2. If the name of an exact and fuzzy (misspelled) landmark
    overlap, keep the exact landmark
 3. If a landmark name is contained within another, keep the
    longer name.

E. Remove landmarks
 1. By roads, areas and tier 1 landmarks
    (a) If a road is mentioned, for each landmark name check
        if any landmarks with the landmark name are near
        (within 500 m of) a road. If this is the case, restrict
        the landmarks in the gazetteer to those that are near
        the road. If no landmarks are near the road, do not
        subset and keep the landmark name14                       14 We keep the landmark because dur-

    (b) If an area is mentioned (e.g., a neighborhood), for       ing a later step we check for similarly
                                                                  named landmarks near the road, and
        each landmark – follow the same steps as above.           for the possibility that the extracted
    (c) If a landmark is mentioned after a tier 1 preposition     road is incorrect, so we still keep the
        (e.g., “next to”, “just after”), for each other land-     landmark for now).
        mark – follow the same steps as above, checking the
        distance between the other landmarks to landmark
        locations after tier 1 prepositions.15                    15 Helpful in case the landmark near a
                                                                  tier 1 preposition doesn’t form a dom-
 2. Dominant Cluster and “general” landmarks                      inant cluster, but a dominant clus-
    (a) For each landmark name, check if the locations form       ter is formed from another landmark
        a dominant cluster                                        mentioned).


                                                S17
          i. If they do
            A. keep the landmarks in the cluster and remove
                the others.
         ii. If they don’t,
            A. keep landmarks of commonly referenced types
                (e.g., matatu stages); if a landmark does not
                contain a common type, don’t subset. For this
                we use the analysis described earlier on deter-
                mining landmark types more commonly used
                as crash locations.
            B. Re-check which landmarks don’t form a clus-
                ter; among these, keep landmarks if the name
                of the landmark was not derived from an n/skip-
                gram (ie, matches the original name).16             16 For example, if there are 3 land-

    (b) Remove landmark name if it does not form a cluster          marks of “garden city“, where the
                                                                    original names were: garden city,
        except if the name follows a tier 1 preposition. (If        garden city mall and garden city
        it follows a tier 1 preposition, it is likely the correct   bank, keep “garden city“; if no name
        landmark name but just cannot ﬁnd the exact loca-           matches the original name, keep all
        tion; if it does not follow a tier 1 preposition, it is     landmarks.
        more likely to be a spurious landmark).

F. Select landmark names or intersections
 1. If there are multiple location names found (eg, multiple
    landmark names, multiple intersections)
    (a) Loop through preposition tiers. Within each tier,
         check the following, stopping once a location name
         has been found.
           i. Check if a landmark name comes after the prepo-
              sition
          ii. Check if one of the road names used to construct
              an intersection comes after the preposition
    (b) If no location name has been found, loop through the
         preposition tiers again and check whether [landmark
         name] [3 or less words] [preposition name] occurs;
         if so, keep landmark name(s) with fewest words be-
         tween name and preposition
     (c) If one intersection found (eg, if 3 or more roads found,
         and only one pair of roads intersects), use the inter-
         section location.
    (d) Use the landmark closest (least words between) itself
         and an event word
 2. If a landmark name was chosen (ie, not an intersection).
    (a) If multiple landmark names were selected17                  17 For example, two landmarks in

           i. If a road is mentioned, choose landmarks within       front of diﬀerent tier 1 prepositions
              500 m of mentioned road; if none near the road,
              don’t subset

                                                 S18
         ii. Choose landmark closest to the event word (could
             still result in multiple!)
    (b) If landmark name mapped to multiple locations
          i. Select locations within 500 m of mentioned road;
             if none near road, don’t subset

G. [If landmark location is not near any mentioned
road] Broaden search to ﬁnd similarly named land-
marks near the road
 1. Start with all landmarks that are near any mentioned
    road and subset to those that contain the landmark name.
    Take the next word in the tweet and subset landmarks
    that contain this word. Repeat process until doing so
    would cause no landmarks to be found. Among these
    locations:
    (a) If a dominant cluster exists, use this location.
    (b) If no dominant cluster exists, further subset locations
        to those where the landmark word in the tweet is at
        the beginning of the landmarks found. If a dominant
        cluster is found, use this location.
          i. If no location is found in the previous step, re-
             peat, but check words in the tweet proceeding
             the landmark name.

H. Snap to Road
 1. If a road is mentioned, snap location to road
 2. If no road is mentioned, snap to nearest road if road
    within 500 m.

I. Final checks to determine whether location should
be used
 1. If a road is mentioned and the location chosen is greater
    than 500 m from any mentioned road, no location is out-
    putted by the algorithm
 2. If multiple landmarks are mentioned, the closest land-
    mark to the crashword is used18 and the landmark is           18 This would happen when no tier 1-

    more than two words away from the crash word, no lo-          6 phrase precedes a landmark
    cation is outputted by the algorithm
 3. If multiple landmarks are mentioned, a tier 5 or 6 phrase
    precedes the chosen landmark and the landmark is more
    than two words away from the crash word, no location is
    outputted by the algorithm




                                                S19
Geoparse Tweets - Full Results

Table S5 shows full results of the geoparsing algorithm. In particular, the table shows the value

added of diﬀerent data sources to build the landmark gazetteer; we run the algorithm using the

augmented gazetteer generated from Geonames, Google and OpenStreetMap separately. Results

highlight that the algorithm mainly relies on landmarks scraped from Google maps; recall and

precision are only slightly worse using Google alone compared to combining all sources. Geonames

performs poorly and OpenStreetMap performs better but still worse than Google, achieving about

0.2 and 0.1 worse recall and precision respectively compared to Google when judging whether the

algorithm captures the true crash location.



                                 Table S5: Tweet Geoparse Results
                         Any Location Captured by    Crash Location Determined   Algorithm Cluster
                            Algorithm Close to         by Algorithm Close to          Contains
                           True Crash Location          True Crash Location      True Crash Loction
                         Recall      Precision       Recall       Precision      Recall   Precision
 LNEx
   LNEx Aug Gaz          0.674         0.686         0.129         0.132         0.175     0.125
 Algorithm - by Source
   Aug Gaz - Geonames    0.124         0.326         0.112         0.455         0.124     0.446
   Aug Gaz - Google       0.79         0.853         0.645         0.811         0.653     0.777
   Aug Gaz - OSM         0.52          0.691         0.431         0.728         0.446     0.691
 Algorithm - All Sources
   Raw Gaz               0.695         0.757         0.579         0.756         0.591     0.72
   Aug Gaz               0.798         0.857         0.651         0.811         0.656     0.774




                                               S20
Choosing Parameters for Clustering Crash Reports into Unique

Crashes

Multiple people often tweet about the same crash. In order to cluster crash reports to unique

crashes, we cluster by the kilometer and time distance between reports. To determine optimal

kilometer and time parameters, a team manually determined which crash reports refer to the same

crash. The dataset was double coded by diﬀerent team members, resulting in two “truth” datasets.

To judge whether crash reports refer to the same crash, the team used the location of the crash, the

time of the tweet and looked for details about the crash in the tweet itself (e.g., extent of injuries,

types and numbers of vehicles, etc.).

   The below table shows summary statistics of the maximum distance and time between any two

crash reports in the same clustered or individual crash. Before calculating the statistics, outliers

were removed (we deﬁne an outlier as a crash cluster where reported crashes occurred over 24 hours

or over 5 km from each other). Across both truth datasets, around 52% of tweets were clustered

with another tweet, meaning that 48% of tweets are the only tweet reporting one crash.



                   Table S6: Clustered Tweets Truth Data Summary Statistics
            Variable     Min Quartile 1 Median Mean Quartile 3               Max
                                         Truth Dataset 1
            Hours Diﬀ     0        0.133        0.55      1.68 1.693        23.776
            KMs Diﬀ       0          0         0.013     0.213 0.138         3.328
            N Tweets      2          2            2      3.324   3             44
                                         Truth Dataset 2
            Hours Diﬀ     0        0.161       0.597     1.715 2.144        23.417
            KMs Diﬀ       0          0         0.024     0.276 0.199         3.13
            N Tweets      2          2            2      3.275   3             19


   We examine two common metrics for evaluating clustering performance: the adjusted Rand

index and the Jaccard coeﬃcient[9]. When using our algorithm to cluster crash reports, we test all

combinations of 0.1, 0.5, 1, 2 and 3 kilometers and 1, 2, 4, 12 and 24 hours. For truth dataset 1,

both the Rand index and Jaccard coeﬃcient show that 12 hours and 500 m leads to best results,

while truth dataset 2 shows 2 hours and 500 m (see ﬁgure ). The diﬀerence in results in the truth

datasets likely results from the exercise being partially subjective, particularly when limited or no



                                                 S21
crash details are provided in the tweet text.

   We opt for using thresholds of 500 m and 4 hours. The 4 hour threshold is in between the

optimal value from both truth datasets.




                              Figure S5: Cluster Evaluation Results




                                                S22
References

[1] Ritter A, Clark S, Mausam, Etzioni O. Named Entity Recognition in Tweets: An Experimental

   Study. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language

   Processing; 2011.

[2] Gelernter J, Balaji S.   An algorithm for local geoparsing of microtext.    GeoInformatica.

   2013;17(4):635–667. doi:10.1007/s10707-012-0173-8.

[3] Malmasi S, Dras M. Location Mention Detection in Tweets and Microblogs. In: Hasida K,

   Purwarianti A, editors. Computational Linguistics. Singapore: Springer; 2016. p. 123–134.

[4] Middleton SE, Middleton L, Modaﬀeri S. Real-Time Crisis Mapping of Natural Disasters Using

   Social Media. IEEE Intelligent Systems. 2014;29(2):9–17. doi:10.1109/MIS.2013.126.

[5] Al-Olimat HS, Thirunarayan K, Shalin V, Sheth A. Location name extraction from targeted

   text streams using Gazeteer-based statistical language models. Arxiv preprint. 2017;11(17).

[6] Gu Y, Qian ZS, Chen F. From Twitter to detector: Real-time traﬃc incident detection using

   social media data. Transportation Research Part C: Emerging Technologies. 2016;67:321 – 342.

   doi:https://doi.org/10.1016/j.trc.2016.02.011.

[7] Joachims T. Text categorization with Support Vector Machines: Learning with many relevant

                  edellec C, Rouveirol C, editors. Machine Learning: ECML-98. Berlin, Heidelberg:
   features. In: N´

   Springer Berlin Heidelberg; 1998. p. 137–142.

[8] Aggarwal CC, Zhai CX. Mining Text Data. Boston, MA: Springer; 2012.

[9] Santos JM, Embrechts M. On the Use of the Adjusted Rand Index as a Metric for Evaluating

   Supervised Classiﬁcation. In: Alippi C, Polycarpou M, Panayiotou C, Ellinas G, editors. Arti-

   ﬁcial Neural Networks – ICANN 2009. Berlin, Heidelberg: Springer Berlin Heidelberg; 2009. p.

   175–184.




                                               S23