Estimating House Prices in Emerging Markets and Developing Economies A Big Data Approach

The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent.


Introduction
Private property price levels and their movement over time have critical implications for the economy of most countries, but they also play a fundamental role in household (HH) wealth.Buying property is undoubtedly the most significant investment for many families around the world and reliable information on private property prices and their determinants are essential decision-making factors for HHs.The relationship between property prices and HHs' disposable incomes determines what type of property HHs can afford.Therefore, from a policy standpoint, reliable property price data are key to understanding the scale of affordability issues and identifying market failures leading to supply-demand mismatches in the housing sector.
House price data also bear key insights on countries' financial and macroeconomic stability.Disproportionally rising private property prices, or high price-to-income ratios, can be an early indication of imbalances and risks in the financial system (e.g., Anundsen et al. 2016;Drehmann & Juselius 2014).Decreasing house prices, in turn, may be associated with a decrease in HH wealth and a decline in consumption (e.g., Campbell & Coco 2007;Mian et al. 2017).If monitored over time, property prices can function as an early warning system of systemic banking stress or economic downturn.In addition, urban planners, private developers, economists, and policy makers depend on reliable property price data to update zoning regulations, formulate housing policies, and decide how to efficiently allocate scarce resources to support housing solutions where they are needed most.
Despite the importance of property price data for various stakeholders, comprehensive house price data are scarce in developed economies and virtually non-existent for emerging markets and developing economies (EMDEs).If available at all, residential property prices are mostly presented and published in indexed format, allowing to track changes over time; however, indices are not informative to understanding distributional aspects of prices, affordability issues, or the degree of price segments underserved in a market.When housing markets mature, properties' characteristics and attributes may be more stable over time.Hence, changes captured by price indices reflect price movements over a relatively constant set of properties in the market.However, in many EMDEs, there are dynamic changes to the type of properties being built by formal developers.Due to the nascency of markets, formal developers and builders in EMDEs are evolving to expand their portfolio to also cater to housing solutions to HHs with lower or even informal incomes.In such circumstances, aggregate price indices may not be fit for purpose and may suffer from biases due to rapidly changing underlying property types.
The wide gap of publicly available property price data in EMDEs is a key impediment to extending the understanding of housing markets widely available from developed economies to these markets.Insights on the determinants of property prices and their drivers over time are heavily researched subjects, but exclusively rely on studying this dynamic in developed economies.If comparable property price data were available in EMDEs, they could provide crucial insights on potential inefficiencies along the housing value chain.Analyzing property price data in emerging economies can also help, for instance, to point towards constraints in developers' ability to access finance, prohibitively high construction costs, or regulatory bottlenecks in land acquisition and titling.In addition, property price data in EMDEs can also help point towards affordability issues and potential ramifications on HH consumption and spending.The width of these challenges is often difficult to grasp as the housing value chain is very interconnected and complex,

Literature: Collecting and Estimating House Prices
Despite their importance, private property price data are not readily available for most countries and are particularly scarce for EMDEs.Further, the lack of standardization of property prices and the heavy reliance on indices to monitor changes over time make comparisons across countries cumbersome.Addressing this gap, the paper places itself at the crossing of three streams of literature: existing efforts to collect or collate data on residential property prices and to construct residential property price indices (Section 2.1); studies on determinants of house prices (Section 2.2); and a relatively new area of big data and machine learning approaches to estimate (determinants of) house prices (Section 2.3).Finally, we summarize the existing gap within this line of research and outline how we address it (Section 2.4).

Current Efforts to Collect House Price Data and Existing House Price Indices
Since housing plays a key role in the growth of many aspects of a country's economy including the development of the construction industry, job creation, and improving the living conditions of many HHs, governments have a great interest in understanding property prices and their developments over time.Most advanced economies' statistical offices or central banks, therefore, started collecting data on residential property prices in the 1970s (Knoll et al. 2017).Also, tax authorities, land registries, or real estate associations collect, hold, and sometimes even publish data on residential property prices in many advanced economies. 3With the 2008-2009 global recession, which many scholars attributed to misalignments in housing and housing-related asset prices, the interest in dynamics in housing markets rose significantly (e.g., Goodhart & Hofmann 2008;Del Negro & Otrok 2007).Since then, several international organizations and central banks have increased efforts to develop global property price indices and collated real property price data for various predominantly developed economies to monitor macro-financial stability and price developments.One early methodology to track property prices is the Case-Shiller National Home Price Index which measures the value of residential real estate in major US metropolitan areas and serves as a blueprint for subsequent price indices in other developed countries. 4Extending the collection of residential property prices to EMDEs has been slow, in either indexed or other forms.In the following section, we briefly discuss the most important sources.
First, a primary source for residential property prices, covering a relatively large number of countries, is provided by the Bank for International Settlements (BIS).BIS collects quarterly data on residential property prices for 60 countries, predominantly focusing on advanced economies.Property prices are harmonized as much as possible by BIS according to the recommendations outlined in the Handbook on Residential Property Price Indices (RPPIs), which summarizes best practices in how to calculate property price indices (European Union [EU] et al. 2013).As BIS compiles data from various sources, the data series differ from country to country, varying in frequency, type of property, covered area, priced unit, compilation method, or seasonal adjustment.In addition, while BIS publishes some actual prices, most data are only available in an indexed format, allowing for tracking aggregate price movements over time.BIS does not provide insights on the distribution of property prices. 5Despite these shortcomings, this database currently offers the most comprehensive data series on house prices.
Second, the International Comparison Program (ICP 2011), which collects prices for a range of goods and services that make up final consumption expenditure and gross capital formation, also captures housing expenditures.The ICP survey collects annual rental prices and dwelling stock data.Rents are either captured as actual or imputed rents (World Bank 2020).In the most recent ICP cycle (2017), participating economies collected rental data for 21 different dwelling types, ranging from one-bedroom apartments to single-family homes.
Third, a more regional-focused data source on property prices is provided by the Organization for Economic Co-operation and Development (OECD), which publishes nominal residential property price indices for OECD countries, as well as price-to-rent and price-to-income ratios. 6The database particularly focuses on house price developments across regions and cities within countries to capture spatial price variation.For select countries, OECD also offers the number and value of housing transactions.While insightful for advanced economies, this database does not cover any emerging economies and mostly publishes indexed data to track price changes over time.
Fourth, institutions such as the International Monetary Fund (IMF) or the United States Federal Reserve Bank collate property price data from various national sources.IMF's Global Housing Watch platform, for instance, tracks developments in housing markets across the world on a quarterly basis. 7The database collates property price data from different sources (e.g., BIS, European Central Bank, Federal Reserve, and national source) for 63 countriesmostly advanced economiesto assess valuation in housing markets.Further, it provides metrics such as price-to-rent and price-to-income ratios.Similarly, the Dallas Federal Reserve Bank's International House Price Database publishes quarterly house prices for 25 mostly developed economies by drawing on national public sources primarily from central banks, statistical offices, or other non-government organizations (Mack & Martínez-García 2011). 8The data collected by these institutions are mostly for developed economies, and these institutions collate secondary data from a plethora of different sources, and primarily make data available in an indexed format.
Fifth, in collecting and analyzing actual house prices across emerging economies, the Center for Affordable Housing Finance in Africa (CAHF) is unique in its efforts.It systematically collects house prices for African countries by surveying local housing experts on the cost and size of the cheapest house built by a private developer.Figures are published in CAHF's annual housing finance yearbook, covering the last decade.CAHF's approach also turns the conversation on house prices away from mean or aggregate measures and to the lower tail of the formal market.From the perspective of policy dialog on affordable housing, this approach may be more appropriate.However, the usability of the data for policy purposes suffers partly because i) the price point provided represents only the extremely lower end of the formally developed new housing units and ii) is not paired with information regarding the quantity supplied at or near this price range.
Finally, in recent years, crowd-sourced platforms such as Numbeo,9 which rely on user inputs on property prices in various locations around the world, have added their own house price index along with publicly available per square foot price ranges for properties within the city center and outside the city center.These platforms add more distributional aspects to the average house prices and point predictions published by other indices, but suffer from the reliability of the self-reported data.
Despite the apparent issues in comparability across countries, the listed sources are the most comprehensive databases on property prices currently available for a larger set of countries.Therefore, many papers draw on these indices to conduct country or region-specific analyses on property price developments over time (e.g., Girouard et al. 2006;Igan & Loungani 2012;Yoshino & Helble 2016).One of the earliest systematic presentations of house prices is a historical time series data set of nominal residential property prices in 13 advanced economies by Borio et al. (1994).Some studies that provide comparative assessments combine the data sources outlined above or enhance them with some primary data collection on some additional countries that are not yet covered by the indices (e.g., Deghi et al. 2020).

Determinants of House Prices
The volume of research on the housing market, particularly estimating its impact on real economic activity, has experienced a steep influx after the global financial crisis in [2008][2009].Most studies in this realm investigate the various channels through which housing and house prices affect macroeconomic and financial outcomes, particularly as housing bubbles are associated with significant output losses (e.g., Catte et al. 2004;IMF 2008;Jordà et al. 2015).Single-country studies on house prices and house price developments mainly focus on developed economies, particularly OECD countries, EU countries, and the United States or Canada (e.g., Alter & Mahoney 2021;Davis & Heathcote 2005;Knoll et al. 2017;Philiponnet & Turrini 2017).
Most studies investigate determinants of house prices over time.Jordà et al. (2016), for instance, have gathered time series data on disaggregated bank credit for 17 advanced economies since 1870.With this historical data for the total value of the residential housing stock (structures and land), the authors relate household mortgage debt to asset values, showing that the rise in mortgage credit has financed a substantial expansion of home ownership from about 40 percent in 1950 to 60 percent in the 2000s.Similarly, Knoll et al. (2017) assess how house prices have evolved over time for 14 advanced economies, gathering historical house price data to estimate what drives changes in house prices.The authors show that changes in house prices are largely attributed to changes in land prices.This finding is corroborated by others who also attribute rising property prices to sharp increases in residential land prices, while construction costs have remained relatively stable over time (e.g., Glaeser & Ward 2009;Gyourko et al. 2013).In major metropolitan areas, it is not uncommon for the cost of land to exceed 40 percent of total property price; in extreme cases, like San Francisco, the share can stretch to as much as 80 percent (McKinsey Global Institute 2014).Gao et al. (2019) dissect property features into two groups when predicting house prices: non-geographical features, such as the number of bedrooms and floor space area, and geographical features, such as the distance to the city center and the quality of nearby schools.This is also documented by Gröbel and Thomschke (2018) who show that housing prices are largely determined by the physical location of the property.In addition, the number of bedrooms and the size of a private property are consistently found to be positively related to the property price (e.g., Fletcher et al. 2000;Garrod and Willis, 1992;Rodriguez and Sirmans, 1994).Other attributes studied include crime rates (e.g., Ceccato & Wilhelmsson 2020), or proximity to transportation (e.g., Zhang et al. 2021;Zong & Li 2016).Other authors estimate that house prices have particularly increased since the financial crisis in 2008 due to the rise of economic activity paired with unusually low mortgage interest rates in most advanced economies (Claessens & Schanz 2019).Also, price changes in major cities are hypothesized to be driven by institutional investors trying to find high yields or safe assets in a low-interest rate environment (IMF 2018;Gauder et al. 2014).
While there is a plethora of literature on property price determinants in developed economies, studies on EMDEs are scarce.Some single-country studies focusing on EMDEs are analyzing existing house price data that are published by commercial banks such as e.g., Absa, Standard Bank, and First National Bank for South Africa (e.g., Balcilar et al. 2011;Luüs 2005).Other authors collect their own data by either surveying real estate agencies to estimate the relative importance of housing attributes to house prices (Owusu-Manu et al. (2019) for Ghana), by surveying developers (Libertun de Duren (2018) for peri-urban areas in Brazil and Mexico), or by conducting a household survey to collect data on housing costs (Uwayezu & de Vries (2020) for Kigali city in Rwanda).High property prices in EMDEs are often attributed to prohibitively high building costs due to the need to import materials, the shortage of local skills, and the absence of financial mechanisms that allow for materials to be bought in bulk (e.g., Gardner & Pienaar 2019).While unique in their efforts to shed some light on the housing market in emerging economies, these studies provide only a snapshot of the housing market of one country (or a handful of countries)often with a regional focus or a focus on the biggest cities.

Big Data Approaches and Machine Learning for Private Property Price Estimation
In addition to traditional approaches of data collection of private property prices discussed in the previous section, in recent years and with the gaining momentum of big data and machine learning in economics, more studies started to gather property price data from online listing websites.While less than a decade ago, most private properties were listed for sale in local newspapers or with private realtors, today, much of the listing activity has moved to websites concentrating on housing advertisements.
Analyses that draw on property price data collected from listing websites allow for fine-grained spatial and temporal assessments of the entire housing market.Further, big data approaches to private property prices will enable one to investigate a particular housing market in more detail or add distributional aspects to the mostly averaged house prices made available by indices discussed in Section 2.1.
A predecessor of web scraping approaches to collect property price data includes Kim's (2007) study on Vietnam.The author manually collated over 5,000 observations on property prices and property attributes drawing on classified advertisements in Vietnam's most prominent newspaper.Applying a hedonic price model, Kim assesses the price differences between Hanoi and Ho Chi Minh City to investigate the impact of social norms on property prices.Over time and with the increased penetration of property listing websites, private property price collection efforts have transitioned to online listings where data collection can be automated.Anenberg & Laufer (2017), for instance, use listing information to construct a new house price index to monitor house price developments in the US.Using property listings, the authors construct a new repeat-sales house price index that describes house values at the contract date when the price is determined rather than the closing date when the property is transferred.Other big data price collection efforts include, for instance, scraping of online listings in Great Britain (e.g., Rae 2015), the US (e.g., Boeing et al. 2021), the Netherlands (ten Bosch & Windmeijer 2014), Türkiye (Keskin & Watkins 2017), Japan (Sadayuki 2018), or China (Hu et al. 2019;He et al. 2019;Wang et al. 2020).
In their data collection efforts, most authors focus exclusively on a localized housing market (i.e., a particular region, city, or neighborhood) in developed countries for which well-structured property listing websites with a plethora of private properties listed for sale are available.Additionally, while very comprehensive in scope, most efforts of web scraping of private property prices are centered on developed markets.Similar approaches in EMDEs, particularly in low-income economies, are scarce.Notable exceptions are, for instance, Gnagey and Tans (2018), who collate a data set of over 64,000 properties in 2016 from listing websites to estimate house prices in Indonesia.The authors find that desirable housing attributes, structural quality, advantageous location on major thoroughfares, and secure land tenure increase property asking prices.
Almost all studies that collect price data from online listing websites focus on only one or few markets within a particular region.One notable exception is a recent HouseLev database project that assembled house prices for 40 countries, mainly European and advanced economies, including some emerging economies such as Türkiye or the Russian Federation (Bricongne et al. 2019).The authors do not solely rely on web scraping for all 40 countries.They instead relate to national accounts data and implement web scraping as a "fallback methodology" in case of missing data.As the authors use both methods, national accounts as well as web scraping, for a sub-sample of European countries, they can compute the median level of estimated upward bias arising from the use of listed rather than transaction prices, which is then applied as a correction factor to improve comparability of price level data obtained with the two methods (Bricongne et al. 2019: 6).HouseLev, to the best of our knowledge, is the most comprehensive web scraping project of private property prices, primarily focused on developed economies.
Advanced price estimation techniques have also evolved with the increased usage of big data approaches to collecting property price data from listing websites.Traditionally, the hedonic price model, which draws on Lancaster's consumer theory, has long been the predominant model to estimate property prices (Lancaster 1966;Rosen 1974).Property prices are modeled in multiple regression analysis, assessing the association between property price and several hedonic attributes through parametric estimation (Oladunni & Sharma 2016).Attributes frequently applied in hedonic price models include, for instance, number of bathrooms, number of bedrooms, area size, neighborhood, or accessibility of the property (e.g., Borba & Dentinho 2016;Can 1992;Krol 2013).While very simple in their interpretation, hedonic price models require the fulfillment of strong model assumptions, including functional form of the conventional hedonic pricing model, homoscedasticity, independence, and the absence of multicollinearity (e.g., Anderson 2000;  Pérez-Rave et al. 2019). 10  In recent years, the applicability of alternate methods to the hedonic price estimation has expanded and machine learning (ML) has emerged as an alternative to predicting house prices (Borde et al. 2017;Čeh et al. 2018;Fan et al. 2006;Mullainathan & Spiess, 2017;Pérez-Rave et al. 2019;Truong et al. 2020;Yan & Zong 2020).Within that realm, Fan et al. (2006) constitute one of the earliest contributions that move beyond hedonic price models to predict property prices.Applying a decision tree technique, the authors explore the relationship between house prices and housing characteristics, which aided the determination of the most important variables for price predictions.
While ML techniques are comparatively weak in inference, they have strong predictive power, manage to fit complex data, are very flexible in assumptions on functional form without overfitting, and work well in out-of-sample estimations (e.g., Athey 2018;Mullainathan & Spiess 2017).ML estimations such as random forest have become a suitable, and frequently applied alternative to hedonic price estimates, particularly for property price estimation.While RF and other decision tree-based models also rely on model assumptions, they are better at modeling non-linear relationships compared to simple, multi-linear regression.
Authors applying ML to price estimations mostly focus on narrowly defined housing markets in developed economies such as Ljubljana, Slovenia (e.g., Čeh et al. 2018), Gangnam, Republic of Korea (e.g., Hong et al. 2020), London, Great Britain (e.g., Levantesi & Piscopo 2020), Arlington County, USA (e.g., Wang & Wu 2018) or housing markets in upper-middle income economies such as Mamak District, Ankara, Türkiye (Yilmazer & Kocaman 2020), Petaling, Jaya, Selangor, Malaysia (Mohd et al. 2019), or St. Petersburg, Russian Federation (Antipov & Pokryshevskaya 2012).In assessing the housing sector, many of these authors contrast the predictive performance of ML algorithms with standard regression techniques.Across the board, the authors find that RF (significantly) outperforms parametric estimation techniques in terms of accuracy and predictive power.

Data Gap: EMDEs Are Largely Absent in Property Price Analyses
This overview on existing studies and data sources within the realm of house prices points to five major gaps that we try to address with this paper: First, there is a striking data gap in the availability of house prices, particularly for EMDEs.Most existing property price compilation efforts concentrate on developed economies, publish data only in indexed format, and do not report underlying actual house prices.This may be attributed to the fact that national sources, such as central banks or statistical offices on which these indices base their data, do not collect, report, or publish property price data.Further, as underlying data to these indices are very country-and context-specific, they fit the purpose of monitoring price changes over time within a specific country but do not facilitate cross-country assessments.Some countries, for instance, only consider prices for family homes in the capital while other countries use flats in urban areas for the index.The same applies for prices: some report the transaction prices while others draw on listed prices, while yet others average prices (cf.BIS database; Mack & Martínez-García 2011).Mack & Martínez-García (2011), who collate publicly available national sources to build a database of (nominal and real) house prices for developed economies, acknowledge this flaw outlining that the main contribution of their database is "sorting out the existing data by country, selecting the most similar series and documenting the differences across countries to clarify the extent to which international sources can be made comparable for empirical analysis purposes" (Mack & Martínez-García 2011: 3).Achieving comparability across countries with the existing data sources is almost impossible.
Second, while price indices present equilibrium outcomes of housing markets, they do not cover details about, broadly, the quantity of housing.They often only include a particular type of housing for which prices are tracked.Whereas in high-income economies, the latter may remain relatively stable in the short term, in EMDEs, with rapidly expanding formal housing markets, quantity and type of housing are important elements to capture.They provide context to changes in prices as the sample over which prices are indexed changes, and as price and quantity and type of housing supplied are highly interrelated.Also, they have important policy implications in the context of markets' ability to supply homes for different market segments, and formal developers' ability to go reach lower income groups.The measurement of these dimensions of the housing markets is absent across EMDEs.
Third, primary data collection for actual, non-indexed house prices is still somewhat limited and, if available, almost exclusively covers advanced economies.With a few notable exceptions, there is a severe lack of contributions in the literature on property price estimations in EMDEs.Property price data for EMDEs are virtually non-existentboth for within-country assessments, and even more so for crosscountry comparison.Most studies on house prices obtain data from readily available sources such as land registries, real estate agencies, or commercial banks, or tap into established indices.Given the significant effort to collect original data on house prices, there are very limited efforts.Since the scope and focus of these studies differ or as they purely rely on price indices, comparing property prices across studies is not feasible.
Fourth, efforts to investigate property prices in EMDEs mainly converge to analyze the determinants of mean house prices.Distributional efforts in property price collection for different income segments within emerging economies are largely absent.CAHF is unique in its effort to approach the house price estimation from the perspective of low-cost developers.Yet, CAHF takes it to the other extreme.It only collects the cheapest price of a house built by a formal developer in African countries and does not factor in otherwise transacted housing units in the formal housing market.While insightful, this approach does not allow pricing the entire housing market, offering an understanding of the quantity of "affordable" houses available to different income segments.
Lastly, studies assessing the historical developments of private property prices are concerned with measuring financial (in)stability, which they attribute to distorted household mortgage debt to asset values ratios.A myriad of studies estimating house prices in developed economies were published after the collapse of the housing bubble and the resulting financial crisis in 2008/2009 mainly concerned with estimating how to identify housing bubbles in the first place.An examination of property prices from the perspective of affordability and demand-supply mismatches for different income segments is absent.In addition, studies assessing house price data usually draw on different methodologies to estimate property prices and rely on varying data sources.Hence property price estimations are not comparable across studies and scholars have only recently started to use big data approaches to collect actual property price data for a larger number of economies.Yet, their focus mostly remains on developed countries.
To fill these gaps, our paper extends the novel approaches in collecting house prices through a web scraping approach to emerging economies, thus addressing the substantial data gap in EMDEs.Further, this paper offers a methodology contributing to a more distributional understanding of private property prices in emerging economies.It also provides a comprehensive methodology to estimate a standard house price that allows for consistent price comparison across countries.These data can then facilitate the extension of the scope of the analysis to affordability assessments of property price data and the segmentation of the housing marketparticularly focusing on EMDEs.

Data Collection and Processing: House Price Data in Emerging Economies
In this section, we outline how we collect house price data through a web scraping approach for five markets: Albania, Costa Rica, Morocco, Pakistan, and South Africa.We demonstrate how a big data approach, hitherto employed mainly in developed economies with good data quality, can also be extended to EMDEs to collect price data efficiently.We collected 200,000 unique property transactions for these five countries in an otherwise data-scare environment.The web scraped data reflect the entire housing market and complement the available indexed data that (mostly) report average property prices only.
We selected these five economies to cover different regions and factors in varying country contexts to highlight specificities of web scraping and data processing in EMDEs.These include, for instance, types of properties listed, unique forms of data entry specific to EMDEs, or cultural aspects.We do not strive for the representativeness of these five countries for all EMDEs but seek to exemplify the unique challenges of applying a web scraping approach in EMDEs.Nevertheless, transferring this approach to other EMDEs, especially those with lower data quality, will come with additional unique challenges (as discussed in more detail in Section 3.3).

Web Scraping House Prices in EMDEs
The transaction price would be the ideal source to obtain comparable property price data.Typically, these data can be found in land registries or tax authorities, collated from real estate agencies, collected through online surveys, or obtained through appraisals or valuations as part of the mortgage process.However, none of these sources are feasible for automated data collection in EMDEs as the various institutions holding these price data do not yet have a standardized way of collecting, publishing, or even digitizing them.In markets characterized by lax regulation or enforcement, transacting parties may under-declare property prices to avoid negative ramifications with respect to paying additional registration costs or taxes.
We opted to collect property price data for EMDEs through a web scraping approach of real estate websites.
While not yielding transaction prices, obtaining listing prices of formal properties is a viable alternative to gathering price data in EMDEs, where data are otherwise non-existent.At least in the context of developed markets, strong evidence exists that listing prices are correlated and a good leading indicator for transaction prices (e.g., Ardila et al. 2021;Anenberg & Laufer 2017;Lyons 2019).For each economy, we scrape property prices and additional data points for all available listings, capturing, to the extent possible, location aspects.Scraping unique property transactions has several advantages: first, they allow us to provide actual property price data of an entire housing market in EMDEs, facilitating analysis beyond aggregated or indexed data.Particularly in EMDEs, we expect significant differences in prices between the biggest business city and rural areas, and more considerable skewness in data even within cities.Second, by web scraping property prices of the entire formal housing market, we can analyze sub-markets in greater detail.Third, collecting house price data from the entire formal, online housing market allows us to also capture the quantity of housing available at different price segments and can, therefore, provide an overview on the distributional aspects of the housing market within a given economy.These distributional price data are very relevant for additional analysis as they can, for instance, be paired with household-level income data for affordability assessments.Finally, an overview of the entire housing market can reveal supply-demand mismatches particularly regarding in which price segment formal housing market activity is generally low or absent altogether.
We start the web scraping process by identifying the most up-to-date and complete websites that list private property prices for sale in the five EMDEs.We identify up to three relevant listing websites per economy.
Websites were selected based on the following aspects: (i) websites with the most comprehensive number of up-to-date listings, (ii) websites that offer broad ranges of properties and do not only cater to the luxurious segment (i.e., avoiding websites exclusively targeting expats etc.); (iii) websites that offer structured data entries on housing attributes including price and size.We limit ourselves to up to three websites since we notice considerable cross-postings in additional, usually less comprehensive, websites.We then scrape all residential properties that are listed at one point in time for sale on these websites along with all available housing features, including price, size, type (i.e., whether the property is an apartment or a house), location, number of bedrooms, number of bathrooms, and sometimes amenities such as garage, time of construction, number of floors, etc.
We extracted online listing data for the entire formal housing market of five EMDEs at one point in time, between April 2020 and August 2020.While this falls within the onset of the COVID-19 pandemic, insights on how house prices were impacted in EMDEs are qualitative and largely anecdotal.Commentary on the matter focuses on the affordability challenges for HHs rather than specifically on changes in property prices. 11Beyond qualitative insights, comprehensive analyses on price changes due to the COVID-19 crisis are preliminary and focused mostly on developed economies (e.g., Pfeifer & Steurer 2020 for Vienna and London; or Bricongne et al. 2021 for the UK).While the results are not transferrable to EMDEs, they still offer some context into a largely under-researched area.On the impact of COVID-19 on housing markets in the UK, Bricongne et al. (2021) show that while the number of offers per week dropped during the first lockdown period, house prices did not change significantly (maximum of 2.6 percent increase) (Bricongne et al. 2021).Pfeifer & Steurer (2020) make a similar observation for the housing market in London, while showing that the housing market in Vienna follows an upward trend following the COVID-19 crisis.Despite the timing overlap, we cannot draw on existing literature to determine if bias may exist in our data, or, more importantly, the direction of the bias.
As a first effort to scrape private property prices in several EMDEs, we faced unique challenges compared to similar efforts in developed markets, such as issues pertaining to the number of observations available per website, the organization and reliability of data, measurement units provided, and the overall reliability of the websites.In many EMDEs, property listing websites are often not the primary source for transactions.
Often, buyers and sellers revert to real estate agents and personal interactions.Yet, online platforms are becoming increasingly more popular for transacting goods and services, including properties.In Africa, for 11 Some regional analyses in Latin America qualitatively point to the fact that while there was economic slowdown and increased investor uncertainty dampening growth in the short term, but also that the COVID-19 pandemic has shifted consumer preferences to larger properties with more outdoor space.Another analysis for India, for instance, reported that house prices have stagnated in 2020 / 2021, a trend attributed, among others, to the receding demand due to the COVID-19 pandemic (Reuters 2021).
instance, Jumia.com, which is an online platform combining an e-commerce marketplace, classified websites, and applications, is widely used across the continent.In Nigeria, Property Pro (formerly Jumia) is the number one property transaction website, covering about 65 percent of the Nigerian online real estate market (Nairametrics 2018).
Additionally, EMDEs' websites, particularly those that do not have a regional spread like Jumia.com in Africa, have limited formal standards regarding data entry.Many websites in emerging economies do not have consistent data on an array of property characteristics and provide somewhat limited information on the listed properties.Often, they only include some pictures of the property, the listed price, and a phone number through which the seller can be reached.Formal developers, who have started to also list newly built properties on online platforms (in addition to their own online or offline platforms), provide slightly more structured information on the transacted property.Yet, they are bound by the format of the online platform, which often only requires submission of property price and size.Few developers provide exhaustive property descriptions in free text format, which, if extracted, needs to be processed for data analysis through text mining.Hence, property data obtained through web scraping in emerging economies, in our experience, will not be as exhaustive in terms of obtaining different property features as found in publications on developed markets (cf.Section 2.2).
Furthermore, across websites in EMDEs, there is no standardized way to record the size of the property.Sometimes, the website does not provide the option to insert size information at all.In addition, userprovided information on size might not necessarily comply with the unit required by the platform (e.g., users insert square feet even though the platform requires square meters).The matter is even more complicated in economies where local measurement units for properties are used alongside more "standardized" measurements.South Asian websites (Nepal; Pakistan) allow for the insertion of different size units including Biga, Kattha, Dhur, Ropani, Aana, Paise, and Daam, alongside square feet and square meters.However, not all users consistently specify the measurement unit making data cleaning cumbersome.In addition, particularly for houses, size data can be somewhat muddled as it is unclear whether the plot size or the usable property size is indicated.To account for this difference, we distinguish apartments and houses (cf.Section 4.1) and, where available, use plot size for houses to also account for the value of the land, whichin some EMDEscan be a significant portion of the property price (cf.Section 2.2).
Finally, in some EMDEs, listing websites do not specify whether a property is for rent or for sale.Usually, rental properties can easily be distinguished by relatively low prices.However, in some EMDEs, it is common to pay one year's rent upfront.In these instances, it is challenging to discern low sales prices from annual rents in cases where listings do not distinctly indicate sale versus rent.In addition, price data do not always include a currency marker, which is mainly problematic in countries where both euros and US dollars are used to transact properties in addition to local currencies.
When the aim is to estimate representative property prices of the housing stock in a country or region, scraping at least 0.5-1 percent of the number of HHs in that area is considered to be a large enough sample (cf.Bricongne et al. 2019).The same principle applies when the statistical population being analyzed is the universe of transacted properties in the market over a given period: in developed economies where residential property markets are formalized, it is possible to obtain the total number of transactions (the statistical "universe") and thus sample appropriately to achieve representativeness.In EMDEs, by contrast, we expect most transactions to occur informally and outside of what is observable publicly.Through web scraping, we constrain our analysis to the formal housing market and to what is transacted online.With this approach, we are able to obtain house prices for the formal housing market but are unable to infer the degree the estimations apply toin some EMDEs admittedly largeinformal housing markets.We aim for representativeness of the formal housing market only, and to achieve this, we scrape entire websites to cover all available listings.Despite these efforts, we encounter smaller sample sizes in some countries, which are likely to stretch margins of error (cf.Section 3.3).Annex 1 provides an overview of the sample coverage and percentage of formal households scraped.

Data Processing: Data Cleaning and Outlier Removal
As with any data set that is obtained from user-inserted data, the scraped data is prone to incorrect, inconsistent, or missing information.Most online listing platforms do not run quality checks on the listed properties or require fully populated identification of property features.Preparing the data for analysis, we diligently cleaned the web scraped data removing data entry errors, duplicates, and outliers.Given the issues outlined in the previous section, which are inherent to EMDEs, data cleaning is more tedious and timeconsuming than for more structured data likely to be obtained in developed economies.We describe the data processing steps in detail below, illustrating descriptive statistics of the various stages of the process (Table 1).

Duplicates
The first step in processing the data is identifying and flagging repeat data and duplicates.These mostly arise for two reasons: first, many EMDEs' property websites allow for the re-submission of the same property within some days' interval.Realtors mainly use this option to restore the property at the top of the search results list to improve visibility on the website.Second, duplicates may also arise because of crosslisting of properties across different platforms.Ideally, we want to create a data set that removes both occurrences.Hence, we deduplicate the data set to obtain what we call the original data set, dropping all exact duplicates that either have the same listing ID or that include the same title and description.Typically, properties with the same title and the same description are a clear indication for a repeat entry of a property on the same website.The title and description of properties, however, might bear similarities in instances where multiple, newly built apartments are advertised within the same complex.Also, in these cases, property features such as price, size, address, or number of bedrooms might be identical while referring to unique listings.Retaining these observations in the data set, we only remove exact duplicates with the exact values on price, size, bedroom, title, and description.A downside of this approach is that we run into the risk of keeping observations in the data set where the title or description has been slightly altered during the re-submission of the same property listing to the website.
As we assume that we will retain some duplicates in the data set, we also perform a more rigorous deduplication where we remove all data that could potentially constitute a duplicate to understand how this alters our estimations.In this stringent outlier removal process, we remove all observations that have the same value on available property features only (price, type, bedroom, bathroom, and city) and disregard the title and the description of the property.While we note that this procedure is highly likely to also remove observations that are in fact unique but share the same property features, we perform this robustness test to ensure that repeat data do not drive estimated property prices.More sophisticated duplicate removal would include the use of text analysis techniques to understand the extent of similarity of the title or description of the property to remove those observations that have only been slightly changed during re-submission.
Given the significant time effort of this technique, we opted for the more stringent data removal as a robustness test for price estimations (Annex 2).

Data Filtering
Next, we filter the data by excluding scraped data that are clearly not residential properties.These include storing units, garages, parking lots, undeveloped or agricultural land, or commercial properties.In addition, we truncate the data on price and size to exclude data entry errors and rental data.These include, for instance, rental data likely erroneously captured as sales price, particularly for those properties that include yearly rentals, spam or negotiable listings often detectable by "1" entered as the sales prices, and random data entries on square meter data.We apply a direct data filter to remove obvious errors and undesired data to obtain the truncated data set.We assume that all observations of below 9 square meters (sqm) and above 3,000 sqm are either data entry errors or properties that cannot be considered residential properties (e.g., storing units; large farmland).Also, we assume that properties of less than 9 sqm are not habitable for one person, aligning with the definition of the UN (UN-Habitat 2007).Regarding the truncation on price data, we remove any properties below 5,000 US dollars and above 50 million US dollars to account for data entry errors, rental prices that are accidentally listed as sales, as well asa very typical feature in EMDEsentire apartment complexes that are sold in bulk as an investment project.The major issue with apartment complexes or several apartment units being sold in bulk is the mismatch between the size and price data.
Often, the price reflects the price of the overall apartment complex while the size reflects that of a single unit.Since it is often impossible to infer the actual per unit price and size, we exclude these properties to avoid distortion.While we apply a context-driven data filter to maintain as many observations as feasible in the truncated data set, we also apply a more rigorous winsorization to the data, common in large data sets such as ours (e.g., Bricongne et al. 2021 for HouseLev Data).We remove the first and 99 th percentile of price and size and outline how this winsorization alters summary statistics (cf.Annex 3).

Outlier Removal
Having obtained the truncated data set, we perform additional outlier removal to ensure that skewed data do not drive estimations.Heavily, positively skewed property price data seem to be particularly acute in EMDEs where very luxurious properties catering to expatriates or foreign investors are transacted.To avoid analyzing severely skewed data, we employed two different approaches: First, we right censored the data to remove luxurious residential properties that are not targeted at the local housing market.In doing so, we use Numbeo, a crowd-sourced global data platform that reports consumer prices, including private property prices in most countries' largest cities.As Numbeo data is likely to be dominated by a bias towards data entry from higher-income individuals (with internet access), we consider Numbeo's data maximum as the "true" maximum.Hence, we consider properties within the truncated data set that exceed the maximum per square meter price reported in Numbeo as an outlier.Hence, we obtain the right censored data set.
Second, to avoid that outliers at both tails of the distribution are distorting our estimations, we perform multivariate outlier removal on the truncated data set based on the robust Mahalanobis distance of each observation in the sample.12With this outlier detection technique, we remove outliers throughout the entire distribution, but mostly concentrated on the left-and right-hand tail.
Given the numerous data quality issues outlined above, we consider the second avenue of outlier removal, the more restrictive technique, most appropriate for our purposes and hence use the multivariate outlier removal technique to obtain the final data set.All other data sets are contained for robustness check purposes and to illustrate the data processing only (cf.Table 1).Price estimations and predictions are only performed on the final data set.

Data Sets Illustrating Data Processing
Table 1 summarizes the property price data of the five economies for the different stages of the data processing.In the original and truncated data set, the means of price, size, and per square meter price are (much) greater than their medians as the distribution is positively (and in some cases strongly) skewed by outliers.
While more robust statistics such as median and the interquartile range (IQR) stay relatively consistent across data sets, the standard deviation and mean drop significantly from the original data set to the final data set.This pattern remains constant throughout the five countries and provides some suggestion that the outlier removal process, while comprehensive on distance metrics, does not significantly alter the balance of the right and left tail and the order of the distribution.
In Albania, the difference in standard deviation between the original data set and the final data set is stark despite the relatively low number of outliers being removed.In Morocco, the right censored data set is the same as the truncated data set as the maximum price listed in Numbeo is smaller than the maximum price in the truncated data set, hence, no right censoring is applied here.In South Africa, the truncated data set particularly if compared to the other four countriesexcludes a relatively larger share of data entry errors and potential rental data.This might be attributed to large farms being included for sale on the website we used for South Africa (property24.com).In Costa Rica, the multivariate outlier removal technique detected particularly high-end, luxurious properties.Note: The original data set contains the original set of all listings.The truncated data set retains listings that contain sale prices, size data, and whether the property is an apartment or house and truncates the data based on sqm<9 or sqm >3,000 and Price < $US 5,000 or Price > $US 5,000,000.The right censored data set retains unique listings that contain sale prices, size data, and whether the property is an apartment or house, truncates the data based on sqm<9 or sqm >3,000 and Price < $US 5,000 or Price > $US 5,000,000, and additionally removes the most extreme price data points on the upper end of the spectrum.The final data set retains thorough listings with reasonable values for price, size, and per square meter price, cleaned through multivariate outlier removal.SD= Standard Deviation; IQR=Interquartile range

Limitations
While applying a web scraping approach to obtain price data in EMDEs is a very cost-effective and efficient way to collect data, particularly compared to conducting expensive surveys, there are several limitations that pertain particularly in the context of EMDEs.
To start with, the web scraping approach does not necessarily yield data that are representative of all properties in the market as we are only able to capture properties of sellers with access to internet and who are able and willing to post their property online.By the same token, accessing house prices on real estate listing websites in emerging economies requires buyers to have access to these online listings.This might not always be the case, especially in lower-income segments of a given market.Particularly in developing areas, information density is low and might lead to data blind zones (Li et al. 2019).Recent research shows that online platforms used for home sales, even in developed markets, may reproduce and even intensify existing forms of inequality within cities (Boeing et al. 2021;Angelo & Vormann 2018).
While internet access is less of a concern in the five countries we outline here (cf.Annex 1), expanding the methodology to other countries might become problematic.In Burundi, for instance, only 5 percent of the overall population use the internet either via computer, mobile phone or other digital devices (World Bank 2022).In comparison, in Brazil, close to 74 percent of the population use the internet (World Bank 2022).
In countries with relatively low internet penetration rates, HHs might adhere to alternative pathways to buy properties: personal interactions with real estate agents, classified ads in newspapers or through informal, personal interaction.Hence, the web scraping approach might not be suitable to capture local property markets where online advertisements are not frequently used and might, therefore, weaken the generalization of the results to localized markets.
Second, the price data collected are concentrated in countries' biggest business cities and urban centers.This is not surprising, since urban centers are the place where most new housing units are being built, responding to the accelerated urbanization rates currently observed in EMDEs.Further, urban dwellers are more likely to formally transact their property and to use online sources to sell or buy properties.Given the diversity of urbanization across EMDEs, the level of geographical disaggregation differs significantly.Disaggregated data for geographical areas beyond the major business city might not be sufficiently large to provide price estimations beyond the largest urban area.Due to data limitations beyond the biggest business cities and the absence of location markers on many housing listing websites in EMDEs, highly complex and spatially heterogenous housing markets cannot fully be delineated.
Third, the data listed on real estate websites include newly developed properties and the secondary housing market, which might bias house price estimations.In addition, some new housing developments are sold as investment projects, often tailored towards foreign investors.These properties are usually sold in bulk, i.e., entire apartment complexes containing several apartment units.Accounting for this potential bias, we conduct careful, multivariate outlier removal.In addition, we differentiate between property types (apartments and houses) and provide distinct price estimates for both property types.
Fourth, the final transaction price is likely to be different from the listed price, which often appears to be the price ceiling that precludes the possibility of sales at higher prices (Horowitz 1992).Furthermore, the listed prices advertised online represent the user-inserted price, which could include either the appraised values from some third party such as a tax assessor, or the self-appraised property values of homeowners.Regarding the latter, several studies have pointed towards a large variance of self-appraised values which in large enough samples like ours, positive and negative errors tend to cancel each other out (e.g., Follain & Malpezzi 1981;Goodman & Itter 1992).The difference between the transaction and the listed property price is dependent on multiple factors including the overall state of the housing market, the demand for housing, the time the property remains on the market before willing and able buyers come forward, as well as cultural aspects pertaining to e.g., negotiation.Despite these issues, in the absence of transaction price data sets, listing prices offer a good proxy to estimate the state of the housing market as researchers have consistently found rather low deviations between listed price and transaction price (Arnott 2009;McGreal & Taltavull de La Paz 2013;Haurin et al. 2010).

Descriptive Statistics
Across the final data set of the five EMDEs in our sample, we observe different patterns in terms of availability of apartments versus single family houses (Figure 1).While in Costa Rica, Pakistan, and South Africa, the number of apartments and houses are well distributed, Albania and Morocco have many more apartments than houses available within the data.In South Africa and Costa Rica, the right-hand side of the distribution is dominated by rather expensive single-family houses and only few, expensive apartments.Similar patterns are observable in the price-size relationship (Figure 2).While in Pakistan the price and size differences between houses and apartments are marginal, South Africaand to a lower extent Costa Rica have noticeable price differences between houses and apartments, which could potentially be attributed to composition effects as houses and apartments are not equally located in all places.The strength of observed correlation between house price and size also varies across countries (Figure 3).In South Africa, this correlation is weakestamong apartments, houses, and overall.This suggests that size may not be the primary driver of price, and that other attributes collected (e.g., location) may have more explanatory power.
Equally insightful are frequency distributions of smaller-sized apartments and houses within the data (Annex 4).Across countries, units smaller than 200 sqm are usually apartments.In Costa Rica and Pakistan, however, there are a significant number of smaller-sized houses, compared to the other emerging economies in our sample.In Pakistan, houses are dominated by 5-Marla14 houses (equal to about 126 sqm), which are considered a typical house for a small family.5-Marla houses are particularly prominent in Lahore, Rawalpindi, Islamabad, and Peshawar.

Application: Estimation of the Basic House Price
This section demonstrates how the large volumes of property price data collected for select EMDEs can be used beyond descriptive statistics.To do so, we introduce the notion of the Basic House Price (BHP), the price of a standard house that is defined identically across all markets.By fixing the type of house to be the constant, BHP aims to provide a data point on price that is independent of the distribution of type/quality of housing that varies widely across markets.The BHP is a key concept that allows for the comparison across and within EMDEs, assessing critical drivers of price and performance of housing markets at the lower end of the price spectrum.
We apply a machine learning technique, Random Forest (RF), to estimate the BHP from the collected web scaped data.Compared to Ordinary Least Squared (OLS) regression, Random Forest has consistently been found to perform better and provide more accurate price predictions (cf.Section 2.3).While the results are presented for five countries as way of application in the next section, the methodology introduced can be rolled out for all emerging economies.

Defining a Basic House
Housing costs reflect the value of the land, the price of the house, the age, condition, and location of the property, as well as the local market.Also, private property prices depend on macroeconomic as well as demographic conditions including migration, urbanization rates, population growth, income growth, a country's housing finance system including current interest rates and the availability of mortgage lending for all segments of the population.Other aspects relevant to house prices include challenges on the supply side such as a restrictive regulatory environment with lengthy permit granting processes, a shortage of labor and low mobility, as well as high construction costs.
In estimating the Basic House Price, we focus on formal housing built by private developers, through public-private partnerships between developers and governments, or by private individuals.We disregard projects that are purely government-sponsored, or housing that is self-built and probably not transacted in the housing market.Formal housing combines specializations in the housing value chain to deliver titled properties that can be pledged as collateral for a mortgage, that is structurally sound, and that complies with local planning standards and building codes (World Bank 2015).As formal housing often remains unaffordable to low-income HHs in many emerging economies, many families in these economies adhere to incremental self-building.Self-building is particularly common at the outskirts of larger cities or in smaller towns.A recent study in India, for instance, found that 62 percent of newly financed houses are self-built (Das et al. 2018).In self-built environments, the initial house serves as anchor for a multi-room home that accommodates multiple unrelated people or households (World Bank 2015).While these selfbuilders add to providing shelter to many families where the alternative is often homelessness, we disregard these houses for this project as self-built houses are often highly insecure in terms of tenure and do not necessarily comply with quality housing standards, building codes, or zoning regulations, and are not transacted in formal housing markets. 15Also, we do not consider endogenous factors such as HH preferences over a set of amenities or locations, that might differ across HHs and countries.Given the limited information available and to maintain comparability across countries, we exclude these exogenous preferences from our model. 16Finally, we are only concerned with home ownership and defer scraping of rental data to further research.The relationship between property prices and rental prices has been discussed in depth elsewhere (e.g., Campbell et al. 2009;Engsted & Pederson 2015;Gallin 2008).
We define a basic house as a formally supplied 50 square meters (sqm) one-bedroom, one-bathroom apartment located in an urban area within a given country, assumed to provide basic municipal or on-site services including water, sanitation, road access, and an energy source.
While we presented summary statistics for both houses and apartments, the BHP deliberately only includes apartments.We constrain the BHP to apartments for comparability purposes and to avoid potential distortions that can be attributed to the different reporting of size (plot size versus usable surface size) in houses.In addition, by reporting the BHP exclusively for apartments, we also account for the ongoing debate regarding the need to increase the housing density in emerging economies, particularly in cities that experience an influx of migration and growth, through densification of existing settlements or the building of multi-story, complex buildings.This is particularly relevant for Africa, where cities are 20 percent more fragmented compared to cities in Asia, more expensive and less accessible for most (Lall et al. 2017).In some African countries, the densification, which is not served by the market, takes place in the informal realm.In many countries, single family homes built on a plot of land are turned into mini-compounds where a main house is surrounded by 'backyard shacks' that are rented.This phenomenon of backyarding is particularly well documented for South Africa where backyarding increased from 1.1.million in 2011 to about 1.8 million in 2016 providing many families an informal way of overcoming the limitation of housing supply in urban areas (e.g., Brueckner et al. 2018).17Densification of houses has many beneficial effects, including a reduction of land use costs as well as cost of connecting to utility infrastructure and services, particularly in areas of accelerated urbanization (e.g., Kurvinen & Saari 2020).While we deliberately apply a narrow definition of the BHP, the presented methodology in this paper allows for easy transferability of other comparative units similar to BHP that might be more suitable for other researchers' focus.

Estimating the Basic House Price: Random Forest Estimation
To estimate the BHP for each country, we run the following predictive regression specification using houselevel data that we obtain by web scraping online listings: where Pricei is the listed price of the property i; Sizei is the size in square meters; Chari is a vector of characteristics of the property i to include the number of bedrooms and bathrooms; Location is a vector to denote the location on property i, and includes, where available, the municipality, county, and/or city.
The predictive framework above is estimated in its linear form using OLS and through non-parametric estimation of the RF model.When it comes to ML approaches to predicting house prices, there is an expanding list of different approaches such as Random Forests, Quantile Regression, LASSO Regression, Adaptive Regression Splines, and Neural Nets (cf.Steurer et al. 2021), gradient boosting machine (GBM) or support vector machine (SVM) (e.g., Ho et al. 2021;Truong et al. 2020).Since previous research has demonstrated that Random Forest algorithms present the most accurate predictions, we decided for this non-parametric estimation technique and present other estimations for robustness checks (Mohd et al. 2019;Mullainathan & Spiess 2017;Pérez-Rave et al. 2019).
RF (Breiman, 2001) has recently gained popularity in property price predictions.RF models are based on classification and regression trees, which follow binary rule-based decisions that indicate how an input is related to its predictor variable (cf.Yoo et al. 2012).The RF is random in two ways: (1) each tree is based on a random subset of the observations, and (2) each split within each tree is created based on a random subset of variables (Grömping 2009: 311).In RF models, node splitting is not accomplished using all predictors as conventionally done in regression trees.Instead, RF node splitting is achieved using a random subset of predictors chosen at each node (e.g., Breiman 2001;Liaw & Wiener 2002).Hence, RF models are an ensemble tree-based learning algorithm that averages predictions over many individual trees using bootstrap aggregation (also known as bagging) (Breiman 2001).
Applied to the real estate sector, RF maps each vector of house characteristics to a predicted value.The prediction function takes the form of a tree that splits at every node given the value of a particular housing characteristic (e.g., sqm; number of rooms) (Mullainathan & Spiess 2017).Given its very flexible functional form, RF is suited well for out-of-sample predictions and for varied structures of data.Unlike other econometric estimation techniques, RF models do not require training data to be normally distributed, which particularly for property price research in EMDEs is beneficial as data might be heavily skewed.
While relatively new to property price estimations, RF models have a variety of advantages over traditional estimation techniques, particularly in EMDEs.First, compared to other price estimation models, RF models perform stronger than other algorithms, offering more precise price estimations (cf.Section 2.3).Second, housing markets in EMDEs often have a series of sub-markets either clustered around housing size, type of housing, or income group.Traditional estimations like hedonic price models would often fail to capture these sub-markets.Hence, if the data set sufficiently covers the characteristics of the property, the RF model is expected to replicate the complex structure of the property price determination process more sensitively (cf.Hong 2020: 142).Third, RF models do not require a detailed model and are hence more suitable for EMDEs with potentially more skewed distributions.Finally, while hedonic price models have been more geared towards inference, RF models focus more on prediction (Yoo et al. 2012).

Parameter Optimization
The model-training process is started by randomly splitting the data set into training and testing data for each country ensuring a random sort order.We split each country's data set into two subsets: 50 percent of the data are used for training, and 50 percent of the data are used for testing (validation) (cf.Schonlau & Zou 2020).The 50-50 split is the most common split in RF applications.Results on alternate splits are also tested and presented in Annex 5.More in-depth discussions on the effect of alternate splitting options are offered elsewhere (cf.Biau 2012;Ishwaran 2014).
Having decided on splitting the data set into training and testing data, we tune the hyperparameters to determine the model with the highest testing accuracy, focusing on the number of sub-trees and the number of variables randomly investigated at each split.The benefit of RF is that there are few hyperparameters with the potential to strongly influence the model's performance (cf.Hong 2020).RF does not require an external cross-validation procedure to estimate the model's accuracy.Model selection and parameter tuning are driven by parameters that would produce the lowest out-of-bag (OOB) errors. 18  First, we fix the number of sub-trees (number of iterations).As RF OOB error rates converge after the number of iterations gets large enough, we set the iterations to 500 for all models instead of tuning the number of observations for each country's data set individually (cf.Breiman 2001;Schonlau & Zou 2020).While some scholars applying RF spend a fair amount of time in tuning to the most optimal number of subtrees, recent research has shown that increasing the number of trees does not harm the model and the biggest performance gain is achieved within the first 250 trees (Probst & Boulesteix 2018).To check for the robustness of this assumption for our data, we iteratively run the model for two countries testing incrementally how increasing iterations from 10 to 500 alters the OOB error rate.As error rates stabilize with increasing iterations for both countries, we chose the highest number of sub-trees (500) for all our models (Figure 3).

Albania Morocco
Second, we select the number of variables to randomly investigate at each splitthe depth of the decision trees.RF models applied to property price estimations in developed markets often have many property attributes to choose from (e.g., square meters, bedroom, bathroom, garage, age of property, location, distance to markets etc.).In these scenarios, to select the best RF model, authors often remove lesser important property attributes one at a time to estimate the relative performance of the model (e.g., Hong 2020) or "only" use the ten most important predicting variables in the final model (e.g., Čeh et al. 2018). 18The error of the Random Forest is approximated by the OOB error during the training process.Each tree is built on a different bootstrap sample which, by random chance, leaves out about one-third of the observations.These leftout observations for a given tree are referred to as the OOB sample.Finding parameters that would produce low OOB error is often a key consideration in model selection and parameter tuning (cf.Schonlau & Zou 2020: 6).
Selecting the number of attributes where the OOB-error rate is lowest is another common decision factor in RF model selection (Schonlau & You 2020).
Since the number of property attributes is rather limited in most EMDEs that we cover, we include all available attributes to the RF model.For most of the five EMDEs presented in this paper, this includes at least the type of the property (apartment or house), size of the property, location (city, region, or districtdepending on availability), number of bedrooms, and number of bathrooms.Costa Rica, unfortunately, does not provide the number of bedrooms and bathrooms, and hence only Size, Type of Property, and City are included as predictor variables.The exact variables used for each country are summarized in Table 2. Since all property attributes that we are using for property price estimations have consistently been found to be relevant for price predictions (cf.Section 2.2) and since the number of property attributes is overall limited, we abstain from successively identifying the optimal number of features in the RF model and include all available attributes in our final model.19

Results: Private Property Prices in Five EMDEs
In the following section, we discuss the results of private property prices in five emerging economies across different regions: Albania, Costa Rica, Morocco, Pakistan, and South Africa.We selected these economies as a way of demonstrating how a big data approach can be applied to notoriously data-scarce environments such as EMDEs.

Basic House Prices in Five Economies and Their Largest Cities
Having provided some overview on the availability of houses and apartments in the market and having discussed some descriptive statistics on price and size of all available private properties within the available data, we now present the estimation of the BHP, which solely focuses on apartments.Estimating the property price of a Basic House as defined in Section 4.1, Table 3 summarizes the results of the estimation model.To compare the performance, we use the same explanatory variables available for every country across models (as outlined in Table 2).All results are robust to more rigorous removal of potential duplicates in the data (Annex 2) as well as the application of alternate splits (Annex 3).
Local property markets have their own characteristics featuring from the market itself and the products offered.The national averages in Table 3 mask the differences within the country.Particularly within capitals or the biggest business city, house prices are expected to be more expensive than in less urbanized areas.To capture the different price dynamics, we provide price estimations for the BHP for the countries' largest cities in Table 3.  $ 113,064).In PPP$, however, the price levels for a basic apartment between the five countries is more equal.Visualization of the raw data in Figures 1 and 2 offer some intuition behind the cross-country differences.Comparing Pakistan and Costa Rica at opposite ends of the spectrum, we observe the availability of apartments in Pakistan concentrated at the lower end of the price spectrum, while a much more even distribution in Costa Rica (Figure 1).Moreover, while the price-size relationship (Figure 2) clearly points to apartments being listed cheaper than similar size houses in Pakistan, the opposite is true in Costa Rica.As the BHP represents the typical estimated market price for a standard 50 sqm apartment, the prevalence of more luxurious/expensive apartments in Costa Rica is expected to drive up the benchmark price.
Within countries, there are also significant differences across regions and cities.In South Africa, for instance, price differences of the BHP between cities are stark.House prices in Cape Town are at the higher income spectrum, where even a basic house is priced at US$ 106,536.Cape Town is one of the most popular tourist destinations in Africa and its property market is known to be tailored to more affluent retirees and foreign property buyers.In Morocco, Casablanca is the most expensive city followed by Marrakesh.In addition to consistently high prevalence of European buyers, many wealthy Moroccans families live in the suburbs of these cities including Palmeraie in Marrakesh and Bouskoura in Casablanca where prices usually start around US$ 700,000.In Pakistan, while nationally at the lowest end of the price spectrum of the five economies covered in this study, there are also significant intra-country differences.Being a fast-growing emerging economy, the capital, Islamabad, is a thriving real estate market.While there are many houses at the lower end of the price spectrum, with the cheapest house advertised at roughly US$ 9,000, property prices in Islamabad can go as high as US$ 6 million.

Cross-Validation
The predicted BHP of the apartments obtained by both models were compared with the observed apartment prices in order to determine the predictive power of the different models.One standard measure often used in price estimation models is the quotient between the predicted price and the observed price for the property.The acceptable median ratio between predicted and observed price is 0.9-1.1 (cf.International Association of Assessing Officers 2014; Čeh et al. 2018).
Additionally, we evaluate the performance of the different models with the mean absolute percentage error (MAPE), which measures the average percentage deviation of predicted prices from actual property prices expressed as , where ̂ is the predicted property price and pi the actual property price of property i.To understand differences in predictions for the BHP, we estimate a localized MAPE for apartments between 50 and 60 square meters.Comparing the predictive performances of the models based on the performance measures (ratio of predicted vs. actual value, MAPE, localized MAPE), we obtain more precise results in the RF models for all countries.All RF models are within the suggested range of the predicted/observed price ratio of 0.9-1.1, also for the estimations where all potential duplicates are rigorously removed (Annex 2) and where different splits are applied (Annex 5).In the main OLS estimation, South Africa exceeds the acceptable median ratio between predicted and observed price range.The MAPEs indicates that the percent deviation of the RF prediction from the actual property price ranges between 27 percent for Costa Rica and 31 percent for Pakistan.Across the board, while MAPE is relatively high in both RF-based and OLS-based predictions, the RF estimation consistently performs better than the OLS prediction.While the quality and quantity of information used in both estimation techniques were identical, since the predictor in the RF model explores the hierarchical structure of features, it can more sensitively track the possibility that the effect of each attribute on price varies by context (Hong 2020).
The limited coverage of observable property features within EMDEs could be one explanation for the acceptable accuracy of the estimation.In addition, while most studies applying the RF model focus on a very narrow housing market (e.g., Čeh et al. 2018;Levantesi & Piscopo 2020) our data expands to the entire housing market of the five emerging economies covering heterogeneous properties with varying amenities including interior decorations, building age, or other features that are not captured in the model as these property features were not consistently available on listing websites.Equally, we are jointly estimating property prices for a large swathe of locations within an economybeyond just different neighborhoods within a city but aggregating both rural and urban areas.This deviates from the use of RF models in the literature to predict property prices for a well-defined narrow set of locations, typically a city or a province/state.The limited property features and available explanatory variables pose limitations to the use of the model to predict individual property prices across the spectrum.However, the purpose of the prediction in our case is to arrive at a typical price for a standard property that can be compared across countries and contexts.The RF and OLS-based approach, notwithstanding the relatively high MAPE, can be considered improvements over the alternative of only considering one-dimensional summary statistics of price.

Conclusion
Given the difficulty of obtaining reliable private property price data in emerging economies, most analyses of house prices or affordability assessments are constrained to developed economies and limited in scope.
Most cross-country analyses that assess trends in house prices are based on available indices, which often aggregate to the national level, masking important within-country dynamics and regional differences.To overcome this flaw and provide more in-depth insights into housing markets in emerging economies, we demonstrate how to collect a large range of localized data through web scraping of property listing websites.Further, to compare property prices across countries, we introduce the concept of the Basic House Pricewhich constitutes the average price of a basic one-bedroom apartment of 50 square meters in an urban area that allows for comparability across countries.
By way of demonstrating the methodology and data processing for five EMDEs, we show that web scraping offers a cost-effective way to obtain a large amount of price data for countries where official data is absent and where alternate data sources on prices are not available.The main constraint to this approach remains the unorganized structure of listing websites and the limited information available on property features and attributes.This approach will only improve over time as the capabilities of listing websites improve and become the preferred method of listing.There is also room to improve the web scraping on several fronts.For instance, image recognition software can extract information that is not supplied systematically in listing websites and could improve model precision.In addition, addresses could be geotagged to incorporate crucial details about location and to differentiate within-city variation.
The paper aims to outline one efficient way to address the wide data gap on property prices in emerging economies.In addition, the paper outlined how, once collected, these data could be used to estimate the price of a standard house consistently.With a consistent methodology proposed in our paper, the BHP can then be applied in several avenues for further research.
First, available data and analysis can feed into further research on determinants of house prices and drivers of changes through time in emerging economies.While determinants of house prices are a well-researched subject in the literature, gaining increasing attention post-2008/2009, analyses mainly rely on data from developed economies.If available at all, price estimations in emerging economies are primarily available for very localized markets.To what extent these findings extend to a larger sample of emerging economies may be an area of research triggered by the data proposed in this paper.In addition, research areas more relevant for emerging economies, such as those related to empirically assessing inefficiencies in the housing value chain, would be possible with the data and analysis proposed by the paper.From an affordability perspective, this paper provides an important variable that may be combined with other data sources, for example, households' disposable income.Bringing these various elements together in the analysis of the country's housing market affordability is fundamental for more fully understanding housing needs and challenges faced by households in emerging economies.

Figure
Figure 1.Frequency Distribution of Property Type, by Country

Table 3 . Basic House Prices, by City
MAPE refers to mean absolute percentage error; l_MAPE refers to the mean percentage error based on predictions accuracy of apartments sized between 50 and 60 square meters; local ratio refers to the ratio between observed and predicted values for apartments sized between 50 and 60 square meters.Exchange rates are based on 2019 conversions.
The estimations show that among the five countries, at the national level and in US$ terms, Pakistan has the cheapest BHP (US$ 21,849), followed by Morocco (US$ 53,282), South Africa (US$ 63,745), Albania (US$ 71,205), and Costa Rica (US Annex 3: Alternate Outlier Removal The original data set contains the original set of all listings.The truncated data set retains listings that contain sale prices, size data, and whether the property is an apartment or house and truncates the data based on sqm<9 or sqm >3,000 and Price < $US 5,000 or Price > $US 5,000,000.The winsorized data set removes the first and 99th percentile of price and size; SD= Standard Deviation; IQR=Interquartile range Annex 4: Frequency Distribution of Smaller Apartments and Houses Note: